Skip to content

Commit

Permalink
Merge pull request #78 from netease-youdao/llm_dev
Browse files Browse the repository at this point in the history
Support public LLMs and OpenAI API as a LLM service in QAnything
  • Loading branch information
xixihahaliu authored Jan 28, 2024
2 parents c75f724 + 71812d0 commit 63b85f3
Show file tree
Hide file tree
Showing 191 changed files with 27,522 additions and 186 deletions.
13 changes: 9 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,11 @@ If you need to use it for commercial purposes, please follow the license of Qwen
git clone https://github.com/netease-youdao/QAnything.git
```
### step2: Enter the project root directory and execute the startup script.
If you are in the Windows11 system: Need to enter the WSL environment.
* [📖 QAnything_Startup_Usage](docs/QAnything_Startup_Usage_README.md)
* Get detailed usage of LLM interface by ```bash ./run.sh -h```


If you are in the Windows11 system: Need to enter the **WSL** environment.
```shell
cd QAnything
bash run.sh # Start on GPU 0 by default.
Expand All @@ -174,7 +178,7 @@ bash run.sh # Start on GPU 0 by default.

```shell
cd QAnything
bash run.sh 0 # gpu id 0
bash ./run.sh -c local -i 0 -b default # gpu id 0
```
</details>

Expand All @@ -183,7 +187,7 @@ bash run.sh 0 # gpu id 0

```shell
cd QAnything
bash run.sh 0,1 # gpu ids: 0,1, Please confirm how many GPUs are available. Supports up to two cards for startup.
bash ./run.sh -c local -i 0,1 -b default # gpu ids: 0,1, Please confirm how many GPUs are available. Supports up to two cards for startup.
```
</details>

Expand Down Expand Up @@ -265,7 +269,8 @@ Reach out to the maintainer at one of the following places:
`QAnything` adopts dependencies from the following:
- Thanks to our [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) for the excellent embedding and rerank model.
- Thanks to [Qwen](https://github.com/QwenLM/Qwen) for strong base language models.
- Thanks to [Triton Inference Server](https://github.com/triton-inference-server/server) for providing great open source inference serving.
- Thanks to [Triton Inference Server](https://github.com/triton-inference-server/server) and vllm(https://github.com/vllm-project/vllm) for providing great open source inference serving.
- Thanks to [FastChat](https://github.com/lm-sys/FastChat) for providing a fully OpenAI-compatible API server.
- Thanks to [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) for highly optimized LLM inference backend.
- Thanks to [Langchain](https://github.com/langchain-ai/langchain) for the wonderful llm application framework.
- Thanks to [Langchain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat) for the inspiration provided on local knowledge base Q&A.
Expand Down
11 changes: 8 additions & 3 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,10 @@ QAnything使用的检索组件[BCEmbedding](https://github.com/netease-youdao/BC
git clone https://github.com/netease-youdao/QAnything.git
```
### step2: 进入项目根目录执行启动脚本
如果在Windows系统下请先进入wsl环境
* [📖 QAnything_Startup_Usage](docs/QAnything_Startup_Usage_README.md)
* 执行 ```bash ./run.sh -h``` 获取详细的LLM服务配置方法

如果在Windows系统下请先进入**WSL**环境
```shell
cd QAnything
bash run.sh # 默认在0号GPU上启动
Expand All @@ -165,7 +168,7 @@ bash run.sh # 默认在0号GPU上启动

```shell
cd QAnything
bash run.sh 0 # 指定0号GPU启动 GPU编号从0开始 windows机器一般只有一张卡,所以只能指定0号GPU
bash ./run.sh -c local -i 0 -b default # 指定0号GPU启动 GPU编号从0开始 windows机器一般只有一张卡,所以只能指定0号GPU
```
</details>

Expand All @@ -174,7 +177,7 @@ bash run.sh 0 # 指定0号GPU启动 GPU编号从0开始 windows机器一般只

```shell
cd QAnything
bash run.sh 0,1 # 指定0,1号GPU启动,请确认有多张GPU可用,最多支持两张卡启动
bash ./run.sh -c local -i 0,1 -b default # 指定0,1号GPU启动,请确认有多张GPU可用,最多支持两张卡启动
```
</details>

Expand Down Expand Up @@ -248,6 +251,8 @@ [email protected]
- [BCEmbedding](https://github.com/netease-youdao/BCEmbedding)
- [Qwen](https://github.com/QwenLM/Qwen)
- [Triton Inference Server](https://github.com/triton-inference-server/server)
- [vllm](https://github.com/vllm-project/vllm)
- [FastChat](https://github.com/lm-sys/FastChat)
- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
- [Langchain](https://github.com/langchain-ai/langchain)
- [Langchain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat)
Expand Down
Empty file added assets/custom_models/.gitignore
Empty file.
5 changes: 3 additions & 2 deletions docker-compose-linux.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ services:

qanything_local:
container_name: qanything-container-local
image: freeren/qanything:v1.0.9
image: freeren/qanything:v1.1.1
# runtime: nvidia
deploy:
resources:
Expand All @@ -87,11 +87,12 @@ services:
- driver: nvidia
count: "all"
capabilities: ["gpu"]
command: /workspace/qanything_local/scripts/run_for_local.sh
command: /bin/bash -c 'if [ "${LLM_API}" = "local" ]; then /workspace/qanything_local/scripts/run_for_local_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND -m $MODEL_NAME -t $CONV_TEMPLATE -p $TP -r $GPU_MEM_UTILI; else /workspace/qanything_local/scripts/run_for_cloud_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND; fi; while true; do sleep 5; done'
privileged: true
shm_size: '8gb'
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/models:/model_repos/QAEnsemble
- ${DOCKER_VOLUME_DIRECTORY:-.}/assets/custom_models:/model_repos/CustomLLM
- ${DOCKER_VOLUME_DIRECTORY:-.}/:/workspace/qanything_local/
ports:
- "5052:5052"
Expand Down
5 changes: 3 additions & 2 deletions docker-compose-windows.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ services:

qanything_local:
container_name: qanything-container-local
image: freeren/qanything-win:v1.0.9
image: freeren/qanything-win:v1.1.1
# runtime: nvidia
deploy:
resources:
Expand All @@ -87,11 +87,12 @@ services:
- driver: nvidia
count: "all"
capabilities: ["gpu"]
command: /workspace/qanything_local/scripts/run_for_local.sh
command: sh -c 'if [ "${LLM_API}" = "local" ]; then /workspace/qanything_local/scripts/run_for_local_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND -m $MODEL_NAME -t $CONV_TEMPLATE -p $TP -r $GPU_MEM_UTILI; else /workspace/qanything_local/scripts/run_for_cloud_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND; fi; while true; do sleep 5; done'
privileged: true
shm_size: '8gb'
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/models:/model_repos/QAEnsemble
- ${DOCKER_VOLUME_DIRECTORY:-.}/assets/custom_models:/model_repos/CustomLLM
- ${DOCKER_VOLUME_DIRECTORY:-.}/:/workspace/qanything_local/
ports:
- "5052:5052"
Expand Down
114 changes: 114 additions & 0 deletions docs/QAnything_Startup_Usage_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@


## Table of Contents

- [QAnything Service Startup Command Usage](#QAnything-Service-Startup-Command-Usage)
- [Supported Pulic LLM using FastChat API](#Supported-Pulic-LLM-using-FastChat-API-with-Huggingface-Transformers/vllm-runtime-backend)
- [Tricks for saving GPU VRAM](#Tricks-for-saving-GPU-VRAM)
- [Comming Soon](#Comming-Soon)


## QAnything Service Startup Command Usage

```bash
Usage: bash run.sh [-c <llm_api>] [-i <device_id>] [-b <runtime_backend>] [-m <model_name>] [-t <conv_template>] [-p <tensor_parallel>] [-r <gpu_memory_utilization>]

-c <llm_api>: "Options {local, cloud} to specify the llm API mode, default is 'local'. If set to '-c cloud', please mannually set the environments {OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH} into .env fisrt."
-i <device_id>: "Specify argument GPU device_id."
-b <runtime_backend>: "Specify argument LLM inference runtime backend, options={default, hf, vllm}"
-m <model_name>: "Specify argument the model name to load public LLM model using FastChat serve API, options={Qwen-7B-Chat, deepseek-llm-7b-chat, ...}"
-t <conv_template>: "Specify argument the conversation template according to the public LLM model when using FastChat serve API, options={qwen-7b-chat, deepseek-chat, ...}"
-p <tensor_parallel>: "Use options {1, 2} to set tensor parallel parameters for vllm backend when using FastChat serve API, default tensor_parallel=1"
-r <gpu_memory_utilization>: "Specify argument gpu_memory_utilization (0,1] for vllm backend when using FastChat serve API, default gpu_memory_utilization=0.81"
-h: "Display help usage message"
```

| Service Startup Command | GPUs | LLM Runtime Backend | LLM model |
| --------------------------------------------------------------------------------------- | -----|--------------------------| -------------------------------- |
| ```bash ./run.sh -c cloud -i 0 -b default``` | 1 | OpenAI API | OpenAI API |
| ```bash ./run.sh -c local -i 0 -b default``` | 1 | FasterTransformer | Qwen-7B-QAnything |
| ```bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat``` | 1 | Huggingface Transformers | Public LLM (e.g., MiniChat-2-3B) |
| ```bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.81``` | 1 | vllm | Public LLM (e.g., MiniChat-2-3B) |
| ```bash ./run.sh -c local -i 0,1 -b default``` | 2 | FasterTransformer | Qwen-7B-QAnything |
| ```bash ./run.sh -c local -i 0,1 -b hf -m MiniChat-2-3B -t minichat``` | 2 | Huggingface Transformers | Public LLM (e.g., MiniChat-2-3B) |
| ```bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.81``` | 2 | vllm | Public LLM (e.g., MiniChat-2-3B) |
| ```bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.81``` | 2 | vllm | Public LLM (e.g., MiniChat-2-3B) |

```bash
Note: You can choose the most suitable Service Startup Command based on your own device conditions.
(1) Local Embedding/Rerank will run on device gpu_id_1 when setting "-i 0,1", otherwise using gpu_id_0 as default.
(2) When setting "-c cloud" that will use local Embedding/Rerank and OpenAI LLM API, which only requires about 4GB VRAM (recommend for GPU device VRAM <= 8GB).
(3) When you use OpenAI LLM API, you will be required to enter {OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH} immediately.
(4) "-b hf" is the most recommended way for running public LLM inference for its compatibility but with poor performance.
(5) When you choose a public Chat LLM for QAnything system, you should take care of a more suitable **PROMPT_TEMPLATE** (/path/to/QAnything/qanything_kernel/configs/model_config.py) setting considering different LLM models.
```

## Supported Pulic LLM using FastChat API with Huggingface Transformers/vllm runtime backend

| model_name | conv_template | Supported Pulic LLM List |
|-------------------------------------------|---------------------|---------------------------------------------------------------------------------|
| Qwen-7B-QAnything | qwen-7b-qanything | [Qwen-7B-QAnything](https://huggingface.co/netease-youdao/Qwen-7B-QAnything) |
| Qwen-1.8B-Chat/Qwen-7B-Chat/Qwen-14B-Chat | qwen-7b-chat | [Qwen](https://huggingface.co/Qwen) |
| Baichuan2-7B-Chat/Baichuan2-13B-Chat | baichuan2-chat | [Baichuan2](https://huggingface.co/baichuan-inc) |
| MiniChat-2-3B | minichat | [MiniChat](https://huggingface.co/GeneZC/MiniChat-2-3B) |
| deepseek-llm-7b-chat | deepseek-chat | [Deepseek](https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat) |
| Yi-6B-Chat | Yi-34b-chat | [Yi](https://huggingface.co/01-ai/Yi-6B-Chat) |
| chatglm3-6b | chatglm3 | [ChatGLM3](https://huggingface.co/THUDM/chatglm3-6b) |
| ... ```check or add conv_template for more LLMs in "/path/to/QAnything/third_party/FastChat/fastchat/conversation.py"``` |

### 1. Run QAnything using FastChat API with **Huggingface transformers** runtime backend (recommend for GPU device with VRAM <= 16GB).
```bash
## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/GeneZC/MiniChat-2-3B

## Step 2. Execute the service startup command. Here we use "-b hf" to specify the Huggingface transformers backend.
## Here we use "-b hf" to specify the transformers backend that will load model in 8 bits but do bf16 inference as default for saving VRAM.
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat

```

### 2. Run QAnything using FastChat API with **vllm** runtime backend (recommend for GPU device with enough VRAM).

```bash
## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/GeneZC/MiniChat-2-3B

## Step 2. Execute the service startup command.
## Here we use "-b vllm" to specify the vllm backend that will do bf16 inference as default.
## Note you should adjust the gpu_memory_utilization yourself according to the model size to aviod out of memory (e.g., gpu_memory_utilization=0.81 is set default for 7B. Here, gpu_memory_utilization is set to 0.5 by "-r 0.5").
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.5

## (Optional) Step 2. Execute the service startup command to specify the vllm backend by "-i 0,1 -p 2". It will do faster inference by setting a tensor parallel mode on 2 GPUs.
## bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.5

```

## Tricks for saving GPU VRAM
```bash
## Trick 1. (Recommend for VRAM<=12 GB) Using PaddleOCR serve in CPU mode **use_gpu=False** in '/path/to/QAnything/qanything_kernel/dependent_server/ocr_serve/ocr_server.py'
# Note that **use_gpu=False** must be set when using RTX-1080Ti GPU, otherwise PaddleOCR will always return **empty ocr result** when using **use_gpu=True**.
ocr_engine = PaddleOCR(use_angle_cls=True, lang="ch", use_gpu=False, show_log=False)

## Trick 2. Try 1.8B/3B size LLM, such as Qwen-1.8B-Chat and MiniChat-2-3B.

## Trick 3. Try to limit the max length of context window by decreasing the value of **token_window** and increasing that of **offcut_token**
# /path/to/QAnything/qanything_kernel/connector/llm/llm_for_fastchat.py
# /path/to/QAnything/qanything_kernel/connector/llm/llm_for_local.py

## Trick 4. Try INT4-Weight-Only Quantization methods such as GPTQ/AWQ. You should take care of the sampling parameters considering possible loss of accuracy.

```


## Comming Soon
<details><summary>Feature Request</summary>

- Support one-api interface to add more business LLM API (https://github.com/songquanpeng/one-api).
- Support more runtime backends, such as llama.cpp (https://github.com/ggerganov/llama.cpp) and sglang (https://github.com/sgl-project/sglang).
- ...

</details>
17 changes: 15 additions & 2 deletions qanything_kernel/configs/model_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@
请根据上述参考信息回答我的问题或回复我的指令。前面的参考信息可能有用,也可能没用,你需要从我给出的参考信息中选出与我的问题最相关的那些,来为你的回答提供依据。回答一定要忠于原文,简洁但不丢信息,不要胡乱编造。我的问题或指令是什么语种,你就用什么语种回复,
你的回复:"""

# For LLM Chat w/o Retrieval context
# PROMPT_TEMPLATE = """{question}"""

QUERY_PROMPT_TEMPLATE = """{question}"""

# 缓存知识库数量
Expand Down Expand Up @@ -57,20 +60,30 @@

# MILVUS向量数据库地址
MILVUS_HOST_LOCAL = 'milvus-standalone-local'
MILVUS_HOST_ONLINE = '10.55.163.98' # gpu63
MILVUS_HOST_ONLINE = 'milvus-standalone-local'
MILVUS_PORT = 19530

MYSQL_HOST_LOCAL = 'mysql-container-local'
MYSQL_HOST_ONLINE = '10.55.163.98'
MYSQL_HOST_ONLINE = 'mysql-container-local'
MYSQL_PORT = 3306
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_DATABASE = 'qanything'

llm_api_serve_model = os.getenv('LLM_API_SERVE_MODEL')
llm_api_serve_port = os.getenv('LLM_API_SERVE_PORT')
rerank_port = os.getenv('RERANK_PORT')
embed_port = os.getenv('EMBED_PORT')

print("llm_api_serve_port:", llm_api_serve_port)
print("rerank_port:", rerank_port)
print("embed_port:", embed_port)


LOCAL_LLM_SERVICE_URL = f"localhost:{llm_api_serve_port}"
LOCAL_LLM_MODEL_NAME = llm_api_serve_model
LOCAL_LLM_MAX_LENGTH = 4096

LOCAL_RERANK_SERVICE_URL = f"localhost:{rerank_port}"
LOCAL_RERANK_MODEL_NAME = 'rerank'
LOCAL_RERANK_MAX_LENGTH = 512
Expand Down
13 changes: 11 additions & 2 deletions qanything_kernel/connector/llm/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,11 @@
from .llm_for_online import OpenAILLM
from .llm_for_local import ZiyueLLM
import os
from dotenv import load_dotenv
from .llm_for_openai_api import OpenAILLM

load_dotenv()
RUNTIME_BACKEND = os.getenv("RUNTIME_BACKEND")

if RUNTIME_BACKEND == "default":
from .llm_for_local import ZiyueLLM
else: # hf/vllm
from .llm_for_fastchat import OpenAICustomLLM as ZiyueLLM
Loading

0 comments on commit 63b85f3

Please sign in to comment.