netease-youdao · xixihahaliu · Jan 28, 2024 · Jan 24, 2024 · Jan 24, 2024 · Jan 24, 2024
diff --git a/README.md b/README.md
@@ -163,7 +163,11 @@ If you need to use it for commercial purposes, please follow the license of Qwen
 git clone https://github.com/netease-youdao/QAnything.git
 ```
 ### step2: Enter the project root directory and execute the startup script.
-If you are in the Windows11 system: Need to enter the WSL environment.
+* [📖 QAnything_Startup_Usage](docs/QAnything_Startup_Usage_README.md)
+* Get detailed usage of LLM interface by ```bash ./run.sh -h```
+
+
+If you are in the Windows11 system: Need to enter the **WSL** environment.
 ```shell
 cd QAnything
 bash run.sh  # Start on GPU 0 by default.
@@ -174,7 +178,7 @@ bash run.sh  # Start on GPU 0 by default.
 
 ```shell
 cd QAnything
-bash run.sh 0  # gpu id 0
+bash ./run.sh -c local -i 0 -b default  # gpu id 0
 ```
 </details>
 
@@ -183,7 +187,7 @@ bash run.sh 0  # gpu id 0
 
 ```shell
 cd QAnything
-bash run.sh 0,1  # gpu ids: 0,1, Please confirm how many GPUs are available. Supports up to two cards for startup. 
+bash ./run.sh -c local -i 0,1 -b default  # gpu ids: 0,1, Please confirm how many GPUs are available. Supports up to two cards for startup. 
 ```
 </details>
 
@@ -265,7 +269,8 @@ Reach out to the maintainer at one of the following places:
 `QAnything` adopts dependencies from the following:
 - Thanks to our [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) for the excellent embedding and rerank model. 
 - Thanks to [Qwen](https://github.com/QwenLM/Qwen) for strong base language models.
-- Thanks to [Triton Inference Server](https://github.com/triton-inference-server/server) for providing great open source inference serving.
+- Thanks to [Triton Inference Server](https://github.com/triton-inference-server/server) and vllm(https://github.com/vllm-project/vllm) for providing great open source inference serving.
+- Thanks to [FastChat](https://github.com/lm-sys/FastChat) for providing a fully OpenAI-compatible API server.
 - Thanks to [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) for highly optimized LLM inference backend.
 - Thanks to [Langchain](https://github.com/langchain-ai/langchain) for the wonderful llm application framework. 
 - Thanks to [Langchain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat) for the inspiration provided on local knowledge base Q&A.

diff --git a/README_zh.md b/README_zh.md
@@ -154,7 +154,10 @@ QAnything使用的检索组件[BCEmbedding](https://github.com/netease-youdao/BC
 git clone https://github.com/netease-youdao/QAnything.git
 ```
 ### step2: 进入项目根目录执行启动脚本
-如果在Windows系统下请先进入wsl环境
+* [📖 QAnything_Startup_Usage](docs/QAnything_Startup_Usage_README.md)
+* 执行 ```bash ./run.sh -h``` 获取详细的LLM服务配置方法 
+
+如果在Windows系统下请先进入**WSL**环境
 ```shell
 cd QAnything
 bash run.sh  # 默认在0号GPU上启动
@@ -165,7 +168,7 @@ bash run.sh  # 默认在0号GPU上启动
 
 ```shell
 cd QAnything
-bash run.sh 0  # 指定0号GPU启动 GPU编号从0开始 windows机器一般只有一张卡，所以只能指定0号GPU
+bash ./run.sh -c local -i 0 -b default # 指定0号GPU启动 GPU编号从0开始 windows机器一般只有一张卡，所以只能指定0号GPU
 ```
 </details>
 
@@ -174,7 +177,7 @@ bash run.sh 0  # 指定0号GPU启动 GPU编号从0开始 windows机器一般只
 
 ```shell
 cd QAnything
-bash run.sh 0,1  # 指定0,1号GPU启动，请确认有多张GPU可用，最多支持两张卡启动
+bash ./run.sh -c local -i 0,1 -b default  # 指定0,1号GPU启动，请确认有多张GPU可用，最多支持两张卡启动
 ```
 </details>
 
@@ -248,6 +251,8 @@ [email protected]
 - [BCEmbedding](https://github.com/netease-youdao/BCEmbedding)
 - [Qwen](https://github.com/QwenLM/Qwen)
 - [Triton Inference Server](https://github.com/triton-inference-server/server)
+- [vllm](https://github.com/vllm-project/vllm)
+- [FastChat](https://github.com/lm-sys/FastChat)
 - [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
 - [Langchain](https://github.com/langchain-ai/langchain)
 - [Langchain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat)

diff --git a/assets/custom_models/.gitignore b/assets/custom_models/.gitignore
diff --git a/docker-compose-linux.yaml b/docker-compose-linux.yaml
@@ -78,7 +78,7 @@ services:
 
   qanything_local:
     container_name: qanything-container-local
-    image: freeren/qanything:v1.0.9
+    image: freeren/qanything:v1.1.1
     # runtime: nvidia
     deploy:
       resources:
@@ -87,11 +87,12 @@ services:
             - driver: nvidia
               count: "all"
               capabilities: ["gpu"]
-    command: /workspace/qanything_local/scripts/run_for_local.sh
+    command: /bin/bash -c 'if [ "${LLM_API}" = "local" ]; then /workspace/qanything_local/scripts/run_for_local_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND -m $MODEL_NAME -t $CONV_TEMPLATE -p $TP -r $GPU_MEM_UTILI; else /workspace/qanything_local/scripts/run_for_cloud_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND; fi; while true; do sleep 5; done'
     privileged: true
     shm_size: '8gb'
     volumes:
       - ${DOCKER_VOLUME_DIRECTORY:-.}/models:/model_repos/QAEnsemble
+      - ${DOCKER_VOLUME_DIRECTORY:-.}/assets/custom_models:/model_repos/CustomLLM
       - ${DOCKER_VOLUME_DIRECTORY:-.}/:/workspace/qanything_local/
     ports:
       - "5052:5052"

diff --git a/docker-compose-windows.yaml b/docker-compose-windows.yaml
@@ -78,7 +78,7 @@ services:
 
   qanything_local:
     container_name: qanything-container-local
-    image: freeren/qanything-win:v1.0.9
+    image: freeren/qanything-win:v1.1.1
     # runtime: nvidia
     deploy:
       resources:
@@ -87,11 +87,12 @@ services:
             - driver: nvidia
               count: "all"
               capabilities: ["gpu"]
-    command: /workspace/qanything_local/scripts/run_for_local.sh
+    command: sh -c 'if [ "${LLM_API}" = "local" ]; then /workspace/qanything_local/scripts/run_for_local_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND -m $MODEL_NAME -t $CONV_TEMPLATE -p $TP -r $GPU_MEM_UTILI; else /workspace/qanything_local/scripts/run_for_cloud_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND; fi; while true; do sleep 5; done'
     privileged: true
     shm_size: '8gb'
     volumes:
       - ${DOCKER_VOLUME_DIRECTORY:-.}/models:/model_repos/QAEnsemble
+      - ${DOCKER_VOLUME_DIRECTORY:-.}/assets/custom_models:/model_repos/CustomLLM
       - ${DOCKER_VOLUME_DIRECTORY:-.}/:/workspace/qanything_local/
     ports:
       - "5052:5052"

diff --git a/docs/QAnything_Startup_Usage_README.md b/docs/QAnything_Startup_Usage_README.md
@@ -0,0 +1,114 @@
+
+
+## Table of Contents
+
+- [QAnything Service Startup Command Usage](#QAnything-Service-Startup-Command-Usage)
+- [Supported Pulic LLM using FastChat API](#Supported-Pulic-LLM-using-FastChat-API-with-Huggingface-Transformers/vllm-runtime-backend)
+- [Tricks for saving GPU VRAM](#Tricks-for-saving-GPU-VRAM)
+- [Comming Soon](#Comming-Soon)
+
+
+## QAnything Service Startup Command Usage
+
+```bash
+Usage: bash run.sh [-c <llm_api>] [-i <device_id>] [-b <runtime_backend>] [-m <model_name>] [-t <conv_template>] [-p <tensor_parallel>] [-r <gpu_memory_utilization>]
+
+-c <llm_api>: "Options {local, cloud} to specify the llm API mode, default is 'local'. If set to '-c cloud', please mannually set the environments {OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH} into .env fisrt."
+-i <device_id>: "Specify argument GPU device_id."
+-b <runtime_backend>: "Specify argument LLM inference runtime backend, options={default, hf, vllm}"
+-m <model_name>: "Specify argument the model name to load public LLM model using FastChat serve API, options={Qwen-7B-Chat, deepseek-llm-7b-chat, ...}"
+-t <conv_template>: "Specify argument the conversation template according to the public LLM model when using FastChat serve API, options={qwen-7b-chat, deepseek-chat, ...}"
+-p <tensor_parallel>: "Use options {1, 2} to set tensor parallel parameters for vllm backend when using FastChat serve API, default tensor_parallel=1"
+-r <gpu_memory_utilization>: "Specify argument gpu_memory_utilization (0,1] for vllm backend when using FastChat serve API, default gpu_memory_utilization=0.81"
+-h: "Display help usage message"
+```
+
+| Service Startup Command                                                                 | GPUs | LLM Runtime Backend      | LLM model                        |
+| --------------------------------------------------------------------------------------- | -----|--------------------------| -------------------------------- |
+| ```bash ./run.sh -c cloud -i 0 -b default```                                            | 1    | OpenAI API               | OpenAI API                       |
+| ```bash ./run.sh -c local -i 0 -b default```                                            | 1    | FasterTransformer        | Qwen-7B-QAnything                |
+| ```bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat```                    | 1    | Huggingface Transformers | Public LLM (e.g., MiniChat-2-3B) |
+| ```bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.81```     | 1    | vllm                     | Public LLM (e.g., MiniChat-2-3B) |
+| ```bash ./run.sh -c local -i 0,1 -b default```                                          | 2    | FasterTransformer        | Qwen-7B-QAnything                |
+| ```bash ./run.sh -c local -i 0,1 -b hf -m MiniChat-2-3B -t minichat```                  | 2    | Huggingface Transformers | Public LLM (e.g., MiniChat-2-3B) |
+| ```bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.81```   | 2    | vllm                     | Public LLM (e.g., MiniChat-2-3B) |
+| ```bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.81```   | 2    | vllm                     | Public LLM (e.g., MiniChat-2-3B) |
+
+```bash
+Note: You can choose the most suitable Service Startup Command based on your own device conditions.
+(1) Local Embedding/Rerank will run on device gpu_id_1 when setting "-i 0,1", otherwise using gpu_id_0 as default.
+(2) When setting "-c cloud" that will use local Embedding/Rerank and OpenAI LLM API, which only requires about 4GB VRAM (recommend for GPU device VRAM <= 8GB).
+(3) When you use OpenAI LLM API, you will be required to enter {OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH} immediately.
+(4) "-b hf" is the most recommended way for running public LLM inference for its compatibility but with poor performance.
+(5) When you choose a public Chat LLM for QAnything system, you should take care of a more suitable **PROMPT_TEMPLATE** (/path/to/QAnything/qanything_kernel/configs/model_config.py) setting considering different LLM models.
+```
+
+## Supported Pulic LLM using FastChat API with Huggingface Transformers/vllm runtime backend
+
+| model_name                                | conv_template       | Supported Pulic LLM List                                                        |
+|-------------------------------------------|---------------------|---------------------------------------------------------------------------------|
+| Qwen-7B-QAnything                         | qwen-7b-qanything   | [Qwen-7B-QAnything](https://huggingface.co/netease-youdao/Qwen-7B-QAnything)    |        
+| Qwen-1.8B-Chat/Qwen-7B-Chat/Qwen-14B-Chat | qwen-7b-chat        | [Qwen](https://huggingface.co/Qwen)                                             |        
+| Baichuan2-7B-Chat/Baichuan2-13B-Chat      | baichuan2-chat      | [Baichuan2](https://huggingface.co/baichuan-inc)                                | 
+| MiniChat-2-3B                             | minichat            | [MiniChat](https://huggingface.co/GeneZC/MiniChat-2-3B)                         |
+| deepseek-llm-7b-chat                      | deepseek-chat       | [Deepseek](https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat)             | 
+| Yi-6B-Chat                                | Yi-34b-chat         | [Yi](https://huggingface.co/01-ai/Yi-6B-Chat)                                   | 
+| chatglm3-6b                               | chatglm3            | [ChatGLM3](https://huggingface.co/THUDM/chatglm3-6b)                            | 
+| ...                          ```check or add conv_template for more LLMs in "/path/to/QAnything/third_party/FastChat/fastchat/conversation.py"``` |
+
+### 1. Run QAnything using FastChat API with **Huggingface transformers** runtime backend (recommend for GPU device with VRAM <= 16GB).
+```bash
+## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
+cd /path/to/QAnything/assets/custom_models
+git clone https://huggingface.co/GeneZC/MiniChat-2-3B
+
+## Step 2. Execute the service startup command.  Here we use "-b hf" to specify the Huggingface transformers backend.
+## Here we use "-b hf" to specify the transformers backend that will load model in 8 bits but do bf16 inference as default for saving VRAM.
+cd /path/to/QAnything
+bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat
+
+```
+
+### 2. Run QAnything using FastChat API with **vllm** runtime backend (recommend for GPU device with enough VRAM).
+
+```bash
+## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
+cd /path/to/QAnything/assets/custom_models
+git clone https://huggingface.co/GeneZC/MiniChat-2-3B
+
+## Step 2. Execute the service startup command. 
+## Here we use "-b vllm" to specify the vllm backend that will do bf16 inference as default.
+## Note you should adjust the gpu_memory_utilization yourself according to the model size to aviod out of memory (e.g., gpu_memory_utilization=0.81 is set default for 7B. Here, gpu_memory_utilization is set to 0.5 by "-r 0.5").
+cd /path/to/QAnything
+bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.5
+
+## (Optional) Step 2. Execute the service startup command to specify the vllm backend by "-i 0,1 -p 2". It will do faster inference by setting a tensor parallel mode on 2 GPUs.
+## bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.5
+
+```
+
+## Tricks for saving GPU VRAM
+```bash
+## Trick 1. (Recommend for VRAM<=12 GB) Using PaddleOCR serve in CPU mode **use_gpu=False** in '/path/to/QAnything/qanything_kernel/dependent_server/ocr_serve/ocr_server.py'
+# Note that **use_gpu=False** must be set when using RTX-1080Ti GPU, otherwise PaddleOCR will always return **empty ocr result** when using **use_gpu=True**.
+ocr_engine = PaddleOCR(use_angle_cls=True, lang="ch", use_gpu=False, show_log=False)
+
+## Trick 2. Try 1.8B/3B size LLM, such as Qwen-1.8B-Chat and MiniChat-2-3B.
+
+## Trick 3. Try to limit the max length of context window by decreasing the value of **token_window** and increasing that of **offcut_token**
+# /path/to/QAnything/qanything_kernel/connector/llm/llm_for_fastchat.py
+# /path/to/QAnything/qanything_kernel/connector/llm/llm_for_local.py
+
+## Trick 4. Try INT4-Weight-Only Quantization methods such as GPTQ/AWQ. You should take care of the sampling parameters considering possible loss of accuracy.
+
+```
+
+
+## Comming Soon
+<details><summary>Feature Request</summary>
+
+- Support one-api interface to add more business LLM API (https://github.com/songquanpeng/one-api).
+- Support more runtime backends, such as llama.cpp (https://github.com/ggerganov/llama.cpp) and sglang (https://github.com/sgl-project/sglang).
+- ...
+
+</details>
diff --git a/qanything_kernel/configs/model_config.py b/qanything_kernel/configs/model_config.py
@@ -27,6 +27,9 @@
 请根据上述参考信息回答我的问题或回复我的指令。前面的参考信息可能有用，也可能没用，你需要从我给出的参考信息中选出与我的问题最相关的那些，来为你的回答提供依据。回答一定要忠于原文，简洁但不丢信息，不要胡乱编造。我的问题或指令是什么语种，你就用什么语种回复,
 你的回复："""
 
+# For LLM Chat w/o Retrieval context 
+# PROMPT_TEMPLATE = """{question}"""
+
 QUERY_PROMPT_TEMPLATE = """{question}"""
 
 # 缓存知识库数量
@@ -57,20 +60,30 @@
 
 # MILVUS向量数据库地址
 MILVUS_HOST_LOCAL = 'milvus-standalone-local'
-MILVUS_HOST_ONLINE = '10.55.163.98'  # gpu63
+MILVUS_HOST_ONLINE = 'milvus-standalone-local'
 MILVUS_PORT = 19530
 
 MYSQL_HOST_LOCAL = 'mysql-container-local'
-MYSQL_HOST_ONLINE = '10.55.163.98'
+MYSQL_HOST_ONLINE = 'mysql-container-local'
 MYSQL_PORT = 3306
 MYSQL_USER = 'root'
 MYSQL_PASSWORD = '123456'
 MYSQL_DATABASE = 'qanything'
 
+llm_api_serve_model = os.getenv('LLM_API_SERVE_MODEL')
+llm_api_serve_port = os.getenv('LLM_API_SERVE_PORT')
 rerank_port = os.getenv('RERANK_PORT')
 embed_port = os.getenv('EMBED_PORT')
+
+print("llm_api_serve_port:", llm_api_serve_port)
 print("rerank_port:", rerank_port)
 print("embed_port:", embed_port)
+
+
+LOCAL_LLM_SERVICE_URL = f"localhost:{llm_api_serve_port}"
+LOCAL_LLM_MODEL_NAME = llm_api_serve_model
+LOCAL_LLM_MAX_LENGTH = 4096
+
 LOCAL_RERANK_SERVICE_URL = f"localhost:{rerank_port}"
 LOCAL_RERANK_MODEL_NAME = 'rerank'
 LOCAL_RERANK_MAX_LENGTH = 512

diff --git a/qanything_kernel/connector/llm/__init__.py b/qanything_kernel/connector/llm/__init__.py
@@ -1,2 +1,11 @@
-from .llm_for_online import OpenAILLM
-from .llm_for_local import ZiyueLLM
+import os
+from dotenv import load_dotenv
+from .llm_for_openai_api import OpenAILLM
+
+load_dotenv()
+RUNTIME_BACKEND = os.getenv("RUNTIME_BACKEND")
+
+if RUNTIME_BACKEND == "default":
+    from .llm_for_local import ZiyueLLM
+else: # hf/vllm
+    from .llm_for_fastchat import OpenAICustomLLM as ZiyueLLM