站点图标 AI技术聚合



本项目 https://github.com/PromtEngineer/localGPT

模型 https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML


1. 摘要

        相比OpenAI的LLM ChatGPT模型必须网络连接并通过API key云端调用模型,担心数据隐私安全。基于Llama2和LangChain构建本地化定制化知识库AI聊天机器人,是将训练好的LLM大语言模型本地化部署,在没有网络连接的情况下对你的文件提问。100%私有化本地化部署,任何时候都不会有数据离开您的运行环境。你可以在没有网络连接的情况下获取文件和提问!        






2. 准备工作

2.1 Meta’s Llama 2 7b Chat GGML

These files are GGML format model files for Meta’s Llama 2 7b Chat.

GGML files are for CPU + GPU inference using llama.cpp and libraries and UIs which support this format

2.2 安装Conda

CentOS 上快速安装包管理工具Conda_Entropy-Go的博客-CSDN博客

2.3 升级gcc

CentOS gcc介绍及快速升级_Entropy-Go的博客-CSDN博客

3. 克隆或下载项目localGPT

git clone https://github.com/PromtEngineer/localGPT.git

4. 安装依赖包

4.1 Conda安装并激活

conda create -n localGPT
conda activate localGPT

4.2 安装依赖包

如果Conda环境变量正常设置,直接pip install

pip install -r requirements.txt


whereis conda
conda: /root/miniconda3/bin/conda /root/miniconda3/condabin/conda
/root/miniconda3/bin/pip install -r requirements.txt

安装时如遇下面问题,参考2.3 gcc升级,建议升级至gcc 11

ERROR: Could not build wheels for llama-cpp-python, hnswlib, lxml, which is required to install pyproject.toml-based project

5. 添加文档为知识库

5.1 文档目录以及模板文档




/root/miniconda3/bin/python ingest.py --help
Usage: ingest.py [OPTIONS]

  --device_type [cpu|cuda|ipu|xpu|mkldnn|opengl|opencl|ideep|hip|ve|fpga|ort|xla|lazy|vulkan|mps|meta|hpu|mtia]
                                  Device to run on. (Default is cuda)
  --help                          Show this message and exit.

5.2 开始注入文档


/root/miniconda3/bin/python ingest.py


/root/miniconda3/bin/python ingest.py --device_type cpu

首次注入时,会下载对应的矢量数据DB,矢量数据DB会存放到  /root/localGPT/DB


/root/miniconda3/bin/python ingest.py
2023-08-18 09:36:55,389 – INFO – ingest.py:122 – Loading documents from /root/localGPT/SOURCE_DOCUMENTS
all files: [‘constitution.pdf’]
2023-08-18 09:36:55,398 – INFO – ingest.py:34 – Loading document batch
2023-08-18 09:36:56,818 – INFO – ingest.py:131 – Loaded 1 documents from /root/localGPT/SOURCE_DOCUMENTS
2023-08-18 09:36:56,818 – INFO – ingest.py:132 – Split into 72 chunks of text
2023-08-18 09:36:57,994 – INFO – SentenceTransformer.py:66 – Load pretrained SentenceTransformer: hkunlp/instructor-large
Downloading (…)c7233/.gitattributes: 100%|███████████████████████████████████████████████████████████████████████████| 1.48k/1.48k [00:00<00:00, 4.13MB/s]
Downloading (…)_Pooling/config.json: 100%|████████████████████████████████████████████████████████████████████████████████| 270/270 [00:00<00:00, 915kB/s]
Downloading (…)/2_Dense/config.json: 100%|████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 380kB/s]
Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████| 3.15M/3.15M [00:01<00:00, 2.99MB/s]
Downloading (…)9fb15c7233/README.md: 100%|████████████████████████████████████████████████████████████████████████████| 66.3k/66.3k [00:00<00:00, 359kB/s]
Downloading (…)b15c7233/config.json: 100%|███████████████████████████████████████████████████████████████████████████| 1.53k/1.53k [00:00<00:00, 5.70MB/s]
Downloading (…)ce_transformers.json: 100%|████████████████████████████████████████████████████████████████████████████████| 122/122 [00:00<00:00, 485kB/s]
Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████| 1.34G/1.34G [03:15<00:00, 6.86MB/s]
Downloading (…)nce_bert_config.json: 100%|██████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 109kB/s]
Downloading (…)cial_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████| 2.20k/2.20k [00:00<00:00, 8.96MB/s]
Downloading spiece.model: 100%|████████████████████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 3.46MB/s]
Downloading (…)c7233/tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████| 2.42M/2.42M [00:00<00:00, 3.01MB/s]
Downloading (…)okenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████| 2.41k/2.41k [00:00<00:00, 9.75MB/s]
Downloading (…)15c7233/modules.json: 100%|███████████████████████████████████████████████████████████████████████████████| 461/461 [00:00<00:00, 1.92MB/s]
load INSTRUCTOR_Transformer
2023-08-18 09:40:26,658 – INFO – instantiator.py:21 – Created a temporary directory at /tmp/tmp47gnnhwi
2023-08-18 09:40:26,658 – INFO – instantiator.py:76 – Writing /tmp/tmp47gnnhwi/_remote_module_non_scriptable.py
max_seq_length  512
2023-08-18 09:40:30,076 – INFO – __init__.py:88 – Running Chroma using direct local API.
2023-08-18 09:40:30,248 – WARNING – __init__.py:43 – Using embedded DuckDB with persistence: data will be stored in: /root/localGPT/DB
2023-08-18 09:40:30,252 – INFO – ctypes.py:22 – Successfully imported ClickHouse Connect C data optimizations
2023-08-18 09:40:30,257 – INFO – json_impl.py:45 – Using python library for writing JSON byte strings
2023-08-18 09:40:30,295 – INFO – duckdb.py:454 – No existing DB found in /root/localGPT/DB, skipping load
2023-08-18 09:40:30,295 – INFO – duckdb.py:466 – No existing DB found in /root/localGPT/DB, skipping load
2023-08-18 09:40:32,800 – INFO – duckdb.py:414 – Persisting DB to disk, putting it in the save folder: /root/localGPT/DB
2023-08-18 09:40:32,813 – INFO – duckdb.py:414 – Persisting DB to disk, putting it in the save folder: /root/localGPT/DB


ACKNOWLEDGEMENT.md  CONTRIBUTING.md  ingest.py   localGPT_UI.py  README.md            run_localGPT.py
constants.py        DB               LICENSE     __pycache__     requirements.txt     SOURCE_DOCUMENTS
constitution.pdf    Dockerfile       localGPTUI  pyproject.toml  run_localGPT_API.py 

6. 运行知识库AI聊天机器人


6.1 命令行方式运行提问

 首次运行时,会下载对应的默认模型 ~/localGPT/constants.py 

# model link: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML

MODEL_ID = "TheBloke/Llama-2-7B-Chat-GGML"

MODEL_BASENAME = "llama-2-7b-chat.ggmlv3.q4_0.bin"

模型会下载到 /root/.cache/huggingface/hub/models–TheBloke–Llama-2-7B-Chat-GGML


/root/miniconda3/bin/python run_localGPT.py



Enter a query:


/root/miniconda3/bin/python run_localGPT.py
2023-08-18 09:43:02,433 – INFO – run_localGPT.py:180 – Running on: cuda
2023-08-18 09:43:02,433 – INFO – run_localGPT.py:181 – Display Source Documents set to: False
2023-08-18 09:43:02,676 – INFO – SentenceTransformer.py:66 – Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-18 09:43:05,301 – INFO – __init__.py:88 – Running Chroma using direct local API.
2023-08-18 09:43:05,317 – WARNING – __init__.py:43 – Using embedded DuckDB with persistence: data will be stored in: /root/localGPT/DB
2023-08-18 09:43:05,328 – INFO – ctypes.py:22 – Successfully imported ClickHouse Connect C data optimizations
2023-08-18 09:43:05,336 – INFO – json_impl.py:45 – Using python library for writing JSON byte strings
2023-08-18 09:43:05,402 – INFO – duckdb.py:460 – loaded in 72 embeddings
2023-08-18 09:43:05,405 – INFO – duckdb.py:472 – loaded in 1 collections
2023-08-18 09:43:05,406 – INFO – duckdb.py:89 – collection with name langchain already exists, returning existing collection
2023-08-18 09:43:05,406 – INFO – run_localGPT.py:45 – Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-18 09:43:05,406 – INFO – run_localGPT.py:46 – This action can take a few minutes!
2023-08-18 09:43:05,406 – INFO – run_localGPT.py:50 – Using Llamacpp for GGML quantized models
Downloading (…)chat.ggmlv3.q4_0.bin: 100%|███████████████████████████████████████████████████████████████████████████| 3.79G/3.79G [09:53<00:00, 6.39MB/s]
llama.cpp: loading model from /root/.cache/huggingface/hub/models–TheBloke–Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

Enter a query:


/root/miniconda3/bin/python run_localGPT.py --show_sources


/root/miniconda3/bin/python run_localGPT.py –show_sources
2023-08-18 10:03:55,466 – INFO – run_localGPT.py:180 – Running on: cuda
2023-08-18 10:03:55,466 – INFO – run_localGPT.py:181 – Display Source Documents set to: True
2023-08-18 10:03:55,708 – INFO – SentenceTransformer.py:66 – Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-08-18 10:03:58,302 – INFO – __init__.py:88 – Running Chroma using direct local API.
2023-08-18 10:03:58,307 – WARNING – __init__.py:43 – Using embedded DuckDB with persistence: data will be stored in: /root/localGPT/DB
2023-08-18 10:03:58,312 – INFO – ctypes.py:22 – Successfully imported ClickHouse Connect C data optimizations
2023-08-18 10:03:58,318 – INFO – json_impl.py:45 – Using python library for writing JSON byte strings
2023-08-18 10:03:58,372 – INFO – duckdb.py:460 – loaded in 72 embeddings
2023-08-18 10:03:58,373 – INFO – duckdb.py:472 – loaded in 1 collections
2023-08-18 10:03:58,373 – INFO – duckdb.py:89 – collection with name langchain already exists, returning existing collection
2023-08-18 10:03:58,374 – INFO – run_localGPT.py:45 – Loading Model: TheBloke/Llama-2-7B-Chat-GGML, on: cuda
2023-08-18 10:03:58,374 – INFO – run_localGPT.py:46 – This action can take a few minutes!
2023-08-18 10:03:58,374 – INFO – run_localGPT.py:50 – Using Llamacpp for GGML quantized models
llama.cpp: loading model from /root/.cache/huggingface/hub/models–TheBloke–Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

Enter a query: how many times could president act, and how many years as max?

llama_print_timings:        load time = 19737.32 ms
llama_print_timings:      sample time =   101.14 ms /   169 runs   (    0.60 ms per token,  1671.02 tokens per second)
llama_print_timings: prompt eval time = 19736.91 ms /   925 tokens (   21.34 ms per token,    46.87 tokens per second)
llama_print_timings:        eval time = 36669.35 ms /   168 runs   (  218.27 ms per token,     4.58 tokens per second)
llama_print_timings:       total time = 56849.80 ms

> Question:
how many times could president act, and how many years as max?

> Answer:
 The answer to this question can be found in Amendment XXII and Amendment XXIII of the US Constitution. According to these amendments, a person cannot be elected President more than twice, and no person can hold the office of President for more than two years of a term to which someone else was elected President. However, if the President is unable to discharge their powers and duties due to incapacity, the Vice President will continue to act as President until Congress determines the issue.
In summary, a person can be elected President at most twice, and they cannot hold the office for more than two years of a term to which someone else was elected President. If the President becomes unable to discharge their powers and duties, the Vice President will continue to act as President until Congress makes a determination.
———————————-SOURCE DOCUMENTS—————————

> /root/localGPT/SOURCE_DOCUMENTS/constitution.pdf:
Amendment  XXII.

Amendment  XXIII.

Passed by Congress March 21, 1947. Ratified February 27,

Passed by Congress June 16, 1960. Ratified March 29, 1961.





———————————-SOURCE DOCUMENTS—————————

Enter a query: exit

6.2 Web UI方式运行提问

6.2.1 启动服务器端API

可以使用Web UI方式运行,启动服务器端API在5110端口上进行监听服务

/root/miniconda3/bin/python run_localGPT_API.py

如果执行过程遇到下面问题,还是代码中的python没有使用Conda PATH下面的python导致的。

/root/miniconda3/bin/python run_localGPT_API.py
load INSTRUCTOR_Transformer
max_seq_length  512
The directory does not exist
run_langest_commands [‘python’, ‘ingest.py’]
Traceback (most recent call last):
  File “/root/localGPT/run_localGPT_API.py”, line 56, in <module>
    raise FileNotFoundError(
FileNotFoundError: No files were found inside SOURCE_DOCUMENTS, please put a starter file inside before starting the API!


run_langest_commands = ["python", "ingest.py"]


run_langest_commands = ["/root/miniconda3/bin/python", "ingest.py"]


看到 INFO:werkzeug:  表示启动成功,窗口可以保留座位debug用途

/root/miniconda3/bin/python run_localGPT_API.py
load INSTRUCTOR_Transformer
max_seq_length  512
WARNING:chromadb:Using embedded DuckDB with persistence: data will be stored in: /root/localGPT/DB
llama.cpp: loading model from /root/.cache/huggingface/hub/models–TheBloke–Llama-2-7B-Chat-GGML/snapshots/b616819cd4777514e3a2d9b8be69824aca8f5daf/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
 * Serving Flask app ‘run_localGPT_API’
 * Debug mode: on
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on
INFO:werkzeug:Press CTRL+C to quit
INFO:werkzeug: * Restarting with watchdog (inotify)

6.2.2 启动服务器端UI


/root/miniconda3/bin/python localGPTUI.py

如需局域网访问,修改localGPTUI.py, ->

parser.add_argument("--host", type=str, default="",
                        help="Host to run the UI on. Defaults to "
                             "Set to to make the UI externally "
                             "accessible from other devices.")


/root/miniconda3/bin/python localGPTUI.py 
 * Serving Flask app ‘localGPTUI’
 * Debug mode: on
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (
 * Running on
 * Running on http://IP:5111


netstat -nltp | grep 511
tcp        0      0*               LISTEN      57479/python
tcp        0      0  *               LISTEN      21718/python

6.2.3 浏览器访问Web UI


局域网: http://IP:5111



6.3 更换本地文档为知识库

6.3.1 命令行方式

直接将文档添加到 ~/localGPT/SOURCE_DOCUMENTS/


6.3.2 Web UI方式


1. 要上传文档以供应用程序摄取作为其新的知识库,请单击upload按钮。

2. 选择要用作新知识库的文档进行对话。


4. 当文档被输入到矢量数据库中作为新的知识库时,会有很短的等待时间。 





7.1 中文文档注入


max_ctx_size = 4096


text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)

7.2 网页打开后,问题无回复,response.status_code = 504, 304


unset http_proxy
unset https_proxy
unset ftp_proxy
/root/miniconda3/bin/python localGPTUI.py

7.3 locaGPT如何工作的

Selecting the right local models and the power of LangChain you can run the entire pipeline locally, without any data leaving your environment, and with reasonable performance.

  • ingest.py uses LangChain tools to parse the document and create embeddings locally using InstructorEmbeddings. It then stores the result in a local vector database using Chroma vector store.
  • run_localGPT.py uses a local LLM to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs.
  • You can replace this local LLM with any other LLM from the HuggingFace. Make sure whatever LLM you select is in the HF format.

7.4 怎么选择不同的LLM大语言模型

The following will provide instructions on how you can select a different LLM model to create your response:

  1. Open up constants.py in the editor of your choice.

  2. Change the MODEL_ID and MODEL_BASENAME. If you are using a quantized model (GGMLGPTQ), you will need to provide MODEL_BASENAME. For unquatized models, set MODEL_BASENAME to NONE

  3. There are a number of example models from HuggingFace that have already been tested to be run with the original trained model (ending with HF or have a .bin in its “Files and versions”), and quantized models (ending with GPTQ or have a .no-act-order or .safetensors in its “Files and versions”).

  4. For models that end with HF or have a .bin inside its “Files and versions” on its HuggingFace page.

    • Make sure you have a model_id selected. For example -> MODEL_ID = "TheBloke/guanaco-7B-HF"
    • If you go to its HuggingFace repo and go to “Files and versions” you will notice model files that end with a .bin extension.
    • Any model files that contain .bin extensions will be run with the following code where the # load the LLM for generating Natural Language responses comment is found.
    • MODEL_ID = "TheBloke/guanaco-7B-HF"
  5. For models that contain GPTQ in its name and or have a .no-act-order or .safetensors extension inside its “Files and versions on its HuggingFace page.

    • Make sure you have a model_id selected. For example -> model_id = "TheBloke/wizardLM-7B-GPTQ"

    • You will also need its model basename file selected. For example -> model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"

    • If you go to its HuggingFace repo and go to “Files and versions” you will notice a model file that ends with a .safetensors extension.

    • Any model files that contain no-act-order or .safetensors extensions will be run with the following code where the # load the LLM for generating Natural Language responses comment is found.

    • MODEL_ID = "TheBloke/WizardLM-7B-uncensored-GPTQ"

      MODEL_BASENAME = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"

  6. Comment out all other instances of MODEL_ID="other model names"MODEL_BASENAME=other base model names, and llm = load_model(args*)

7.5 更多问题参考

Issues · PromtEngineer/localGPT · GitHub

