Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support public LLMs and OpenAI API as a LLM service in QAnything #78

Merged
merged 10 commits into from
Jan 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,11 @@ If you need to use it for commercial purposes, please follow the license of Qwen
git clone https://github.com/netease-youdao/QAnything.git
```
### step2: Enter the project root directory and execute the startup script.
If you are in the Windows11 system: Need to enter the WSL environment.
* [📖 QAnything_Startup_Usage](docs/QAnything_Startup_Usage_README.md)
* Get detailed usage of LLM interface by ```bash ./run.sh -h```


If you are in the Windows11 system: Need to enter the **WSL** environment.
```shell
cd QAnything
bash run.sh # Start on GPU 0 by default.
Expand All @@ -174,7 +178,7 @@ bash run.sh # Start on GPU 0 by default.

```shell
cd QAnything
bash run.sh 0 # gpu id 0
bash ./run.sh -c local -i 0 -b default # gpu id 0
```
</details>

Expand All @@ -183,7 +187,7 @@ bash run.sh 0 # gpu id 0

```shell
cd QAnything
bash run.sh 0,1 # gpu ids: 0,1, Please confirm how many GPUs are available. Supports up to two cards for startup.
bash ./run.sh -c local -i 0,1 -b default # gpu ids: 0,1, Please confirm how many GPUs are available. Supports up to two cards for startup.
```
</details>

Expand Down Expand Up @@ -265,7 +269,8 @@ Reach out to the maintainer at one of the following places:
`QAnything` adopts dependencies from the following:
- Thanks to our [BCEmbedding](https://github.com/netease-youdao/BCEmbedding) for the excellent embedding and rerank model.
- Thanks to [Qwen](https://github.com/QwenLM/Qwen) for strong base language models.
- Thanks to [Triton Inference Server](https://github.com/triton-inference-server/server) for providing great open source inference serving.
- Thanks to [Triton Inference Server](https://github.com/triton-inference-server/server) and vllm(https://github.com/vllm-project/vllm) for providing great open source inference serving.
- Thanks to [FastChat](https://github.com/lm-sys/FastChat) for providing a fully OpenAI-compatible API server.
- Thanks to [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) for highly optimized LLM inference backend.
- Thanks to [Langchain](https://github.com/langchain-ai/langchain) for the wonderful llm application framework.
- Thanks to [Langchain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat) for the inspiration provided on local knowledge base Q&A.
Expand Down
11 changes: 8 additions & 3 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,10 @@ QAnything使用的检索组件[BCEmbedding](https://github.com/netease-youdao/BC
git clone https://github.com/netease-youdao/QAnything.git
```
### step2: 进入项目根目录执行启动脚本
如果在Windows系统下请先进入wsl环境
* [📖 QAnything_Startup_Usage](docs/QAnything_Startup_Usage_README.md)
* 执行 ```bash ./run.sh -h``` 获取详细的LLM服务配置方法

如果在Windows系统下请先进入**WSL**环境
```shell
cd QAnything
bash run.sh # 默认在0号GPU上启动
Expand All @@ -165,7 +168,7 @@ bash run.sh # 默认在0号GPU上启动

```shell
cd QAnything
bash run.sh 0 # 指定0号GPU启动 GPU编号从0开始 windows机器一般只有一张卡,所以只能指定0号GPU
bash ./run.sh -c local -i 0 -b default # 指定0号GPU启动 GPU编号从0开始 windows机器一般只有一张卡,所以只能指定0号GPU
```
</details>

Expand All @@ -174,7 +177,7 @@ bash run.sh 0 # 指定0号GPU启动 GPU编号从0开始 windows机器一般只

```shell
cd QAnything
bash run.sh 0,1 # 指定0,1号GPU启动,请确认有多张GPU可用,最多支持两张卡启动
bash ./run.sh -c local -i 0,1 -b default # 指定0,1号GPU启动,请确认有多张GPU可用,最多支持两张卡启动
```
</details>

Expand Down Expand Up @@ -248,6 +251,8 @@ [email protected]
- [BCEmbedding](https://github.com/netease-youdao/BCEmbedding)
- [Qwen](https://github.com/QwenLM/Qwen)
- [Triton Inference Server](https://github.com/triton-inference-server/server)
- [vllm](https://github.com/vllm-project/vllm)
- [FastChat](https://github.com/lm-sys/FastChat)
- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
- [Langchain](https://github.com/langchain-ai/langchain)
- [Langchain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat)
Expand Down
Empty file added assets/custom_models/.gitignore
Empty file.
5 changes: 3 additions & 2 deletions docker-compose-linux.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ services:

qanything_local:
container_name: qanything-container-local
image: freeren/qanything:v1.0.9
image: freeren/qanything:v1.1.1
# runtime: nvidia
deploy:
resources:
Expand All @@ -87,11 +87,12 @@ services:
- driver: nvidia
count: "all"
capabilities: ["gpu"]
command: /workspace/qanything_local/scripts/run_for_local.sh
command: /bin/bash -c 'if [ "${LLM_API}" = "local" ]; then /workspace/qanything_local/scripts/run_for_local_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND -m $MODEL_NAME -t $CONV_TEMPLATE -p $TP -r $GPU_MEM_UTILI; else /workspace/qanything_local/scripts/run_for_cloud_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND; fi; while true; do sleep 5; done'
privileged: true
shm_size: '8gb'
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/models:/model_repos/QAEnsemble
- ${DOCKER_VOLUME_DIRECTORY:-.}/assets/custom_models:/model_repos/CustomLLM
- ${DOCKER_VOLUME_DIRECTORY:-.}/:/workspace/qanything_local/
ports:
- "5052:5052"
Expand Down
5 changes: 3 additions & 2 deletions docker-compose-windows.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ services:

qanything_local:
container_name: qanything-container-local
image: freeren/qanything-win:v1.0.9
image: freeren/qanything-win:v1.1.1
# runtime: nvidia
deploy:
resources:
Expand All @@ -87,11 +87,12 @@ services:
- driver: nvidia
count: "all"
capabilities: ["gpu"]
command: /workspace/qanything_local/scripts/run_for_local.sh
command: sh -c 'if [ "${LLM_API}" = "local" ]; then /workspace/qanything_local/scripts/run_for_local_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND -m $MODEL_NAME -t $CONV_TEMPLATE -p $TP -r $GPU_MEM_UTILI; else /workspace/qanything_local/scripts/run_for_cloud_option.sh -c $LLM_API -i $DEVICE_ID -b $RUNTIME_BACKEND; fi; while true; do sleep 5; done'
privileged: true
shm_size: '8gb'
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/models:/model_repos/QAEnsemble
- ${DOCKER_VOLUME_DIRECTORY:-.}/assets/custom_models:/model_repos/CustomLLM
- ${DOCKER_VOLUME_DIRECTORY:-.}/:/workspace/qanything_local/
ports:
- "5052:5052"
Expand Down
114 changes: 114 additions & 0 deletions docs/QAnything_Startup_Usage_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@


## Table of Contents

- [QAnything Service Startup Command Usage](#QAnything-Service-Startup-Command-Usage)
- [Supported Pulic LLM using FastChat API](#Supported-Pulic-LLM-using-FastChat-API-with-Huggingface-Transformers/vllm-runtime-backend)
- [Tricks for saving GPU VRAM](#Tricks-for-saving-GPU-VRAM)
- [Comming Soon](#Comming-Soon)


## QAnything Service Startup Command Usage

```bash
Usage: bash run.sh [-c <llm_api>] [-i <device_id>] [-b <runtime_backend>] [-m <model_name>] [-t <conv_template>] [-p <tensor_parallel>] [-r <gpu_memory_utilization>]

-c <llm_api>: "Options {local, cloud} to specify the llm API mode, default is 'local'. If set to '-c cloud', please mannually set the environments {OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH} into .env fisrt."
-i <device_id>: "Specify argument GPU device_id."
-b <runtime_backend>: "Specify argument LLM inference runtime backend, options={default, hf, vllm}"
-m <model_name>: "Specify argument the model name to load public LLM model using FastChat serve API, options={Qwen-7B-Chat, deepseek-llm-7b-chat, ...}"
-t <conv_template>: "Specify argument the conversation template according to the public LLM model when using FastChat serve API, options={qwen-7b-chat, deepseek-chat, ...}"
-p <tensor_parallel>: "Use options {1, 2} to set tensor parallel parameters for vllm backend when using FastChat serve API, default tensor_parallel=1"
-r <gpu_memory_utilization>: "Specify argument gpu_memory_utilization (0,1] for vllm backend when using FastChat serve API, default gpu_memory_utilization=0.81"
-h: "Display help usage message"
```

| Service Startup Command | GPUs | LLM Runtime Backend | LLM model |
| --------------------------------------------------------------------------------------- | -----|--------------------------| -------------------------------- |
| ```bash ./run.sh -c cloud -i 0 -b default``` | 1 | OpenAI API | OpenAI API |
| ```bash ./run.sh -c local -i 0 -b default``` | 1 | FasterTransformer | Qwen-7B-QAnything |
| ```bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat``` | 1 | Huggingface Transformers | Public LLM (e.g., MiniChat-2-3B) |
| ```bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.81``` | 1 | vllm | Public LLM (e.g., MiniChat-2-3B) |
| ```bash ./run.sh -c local -i 0,1 -b default``` | 2 | FasterTransformer | Qwen-7B-QAnything |
| ```bash ./run.sh -c local -i 0,1 -b hf -m MiniChat-2-3B -t minichat``` | 2 | Huggingface Transformers | Public LLM (e.g., MiniChat-2-3B) |
| ```bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.81``` | 2 | vllm | Public LLM (e.g., MiniChat-2-3B) |
| ```bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.81``` | 2 | vllm | Public LLM (e.g., MiniChat-2-3B) |

```bash
Note: You can choose the most suitable Service Startup Command based on your own device conditions.
(1) Local Embedding/Rerank will run on device gpu_id_1 when setting "-i 0,1", otherwise using gpu_id_0 as default.
(2) When setting "-c cloud" that will use local Embedding/Rerank and OpenAI LLM API, which only requires about 4GB VRAM (recommend for GPU device VRAM <= 8GB).
(3) When you use OpenAI LLM API, you will be required to enter {OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH} immediately.
(4) "-b hf" is the most recommended way for running public LLM inference for its compatibility but with poor performance.
(5) When you choose a public Chat LLM for QAnything system, you should take care of a more suitable **PROMPT_TEMPLATE** (/path/to/QAnything/qanything_kernel/configs/model_config.py) setting considering different LLM models.
```

## Supported Pulic LLM using FastChat API with Huggingface Transformers/vllm runtime backend

| model_name | conv_template | Supported Pulic LLM List |
|-------------------------------------------|---------------------|---------------------------------------------------------------------------------|
| Qwen-7B-QAnything | qwen-7b-qanything | [Qwen-7B-QAnything](https://huggingface.co/netease-youdao/Qwen-7B-QAnything) |
| Qwen-1.8B-Chat/Qwen-7B-Chat/Qwen-14B-Chat | qwen-7b-chat | [Qwen](https://huggingface.co/Qwen) |
| Baichuan2-7B-Chat/Baichuan2-13B-Chat | baichuan2-chat | [Baichuan2](https://huggingface.co/baichuan-inc) |
| MiniChat-2-3B | minichat | [MiniChat](https://huggingface.co/GeneZC/MiniChat-2-3B) |
| deepseek-llm-7b-chat | deepseek-chat | [Deepseek](https://huggingface.co/deepseek-ai/deepseek-llm-7b-chat) |
| Yi-6B-Chat | Yi-34b-chat | [Yi](https://huggingface.co/01-ai/Yi-6B-Chat) |
| chatglm3-6b | chatglm3 | [ChatGLM3](https://huggingface.co/THUDM/chatglm3-6b) |
| ... ```check or add conv_template for more LLMs in "/path/to/QAnything/third_party/FastChat/fastchat/conversation.py"``` |

### 1. Run QAnything using FastChat API with **Huggingface transformers** runtime backend (recommend for GPU device with VRAM <= 16GB).
```bash
## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/GeneZC/MiniChat-2-3B

## Step 2. Execute the service startup command. Here we use "-b hf" to specify the Huggingface transformers backend.
## Here we use "-b hf" to specify the transformers backend that will load model in 8 bits but do bf16 inference as default for saving VRAM.
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat

```

### 2. Run QAnything using FastChat API with **vllm** runtime backend (recommend for GPU device with enough VRAM).

```bash
## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/GeneZC/MiniChat-2-3B

## Step 2. Execute the service startup command.
## Here we use "-b vllm" to specify the vllm backend that will do bf16 inference as default.
## Note you should adjust the gpu_memory_utilization yourself according to the model size to aviod out of memory (e.g., gpu_memory_utilization=0.81 is set default for 7B. Here, gpu_memory_utilization is set to 0.5 by "-r 0.5").
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.5

## (Optional) Step 2. Execute the service startup command to specify the vllm backend by "-i 0,1 -p 2". It will do faster inference by setting a tensor parallel mode on 2 GPUs.
## bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.5

```

## Tricks for saving GPU VRAM
```bash
## Trick 1. (Recommend for VRAM<=12 GB) Using PaddleOCR serve in CPU mode **use_gpu=False** in '/path/to/QAnything/qanything_kernel/dependent_server/ocr_serve/ocr_server.py'
# Note that **use_gpu=False** must be set when using RTX-1080Ti GPU, otherwise PaddleOCR will always return **empty ocr result** when using **use_gpu=True**.
ocr_engine = PaddleOCR(use_angle_cls=True, lang="ch", use_gpu=False, show_log=False)

## Trick 2. Try 1.8B/3B size LLM, such as Qwen-1.8B-Chat and MiniChat-2-3B.

## Trick 3. Try to limit the max length of context window by decreasing the value of **token_window** and increasing that of **offcut_token**
# /path/to/QAnything/qanything_kernel/connector/llm/llm_for_fastchat.py
# /path/to/QAnything/qanything_kernel/connector/llm/llm_for_local.py

## Trick 4. Try INT4-Weight-Only Quantization methods such as GPTQ/AWQ. You should take care of the sampling parameters considering possible loss of accuracy.

```


## Comming Soon
<details><summary>Feature Request</summary>

- Support one-api interface to add more business LLM API (https://github.com/songquanpeng/one-api).
- Support more runtime backends, such as llama.cpp (https://github.com/ggerganov/llama.cpp) and sglang (https://github.com/sgl-project/sglang).
- ...

</details>
17 changes: 15 additions & 2 deletions qanything_kernel/configs/model_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@
请根据上述参考信息回答我的问题或回复我的指令。前面的参考信息可能有用,也可能没用,你需要从我给出的参考信息中选出与我的问题最相关的那些,来为你的回答提供依据。回答一定要忠于原文,简洁但不丢信息,不要胡乱编造。我的问题或指令是什么语种,你就用什么语种回复,
你的回复:"""

# For LLM Chat w/o Retrieval context
# PROMPT_TEMPLATE = """{question}"""

QUERY_PROMPT_TEMPLATE = """{question}"""

# 缓存知识库数量
Expand Down Expand Up @@ -57,20 +60,30 @@

# MILVUS向量数据库地址
MILVUS_HOST_LOCAL = 'milvus-standalone-local'
MILVUS_HOST_ONLINE = '10.55.163.98' # gpu63
MILVUS_HOST_ONLINE = 'milvus-standalone-local'
MILVUS_PORT = 19530

MYSQL_HOST_LOCAL = 'mysql-container-local'
MYSQL_HOST_ONLINE = '10.55.163.98'
MYSQL_HOST_ONLINE = 'mysql-container-local'
MYSQL_PORT = 3306
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_DATABASE = 'qanything'

llm_api_serve_model = os.getenv('LLM_API_SERVE_MODEL')
llm_api_serve_port = os.getenv('LLM_API_SERVE_PORT')
rerank_port = os.getenv('RERANK_PORT')
embed_port = os.getenv('EMBED_PORT')

print("llm_api_serve_port:", llm_api_serve_port)
print("rerank_port:", rerank_port)
print("embed_port:", embed_port)


LOCAL_LLM_SERVICE_URL = f"localhost:{llm_api_serve_port}"
LOCAL_LLM_MODEL_NAME = llm_api_serve_model
LOCAL_LLM_MAX_LENGTH = 4096

LOCAL_RERANK_SERVICE_URL = f"localhost:{rerank_port}"
LOCAL_RERANK_MODEL_NAME = 'rerank'
LOCAL_RERANK_MAX_LENGTH = 512
Expand Down
13 changes: 11 additions & 2 deletions qanything_kernel/connector/llm/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,11 @@
from .llm_for_online import OpenAILLM
from .llm_for_local import ZiyueLLM
import os
from dotenv import load_dotenv
from .llm_for_openai_api import OpenAILLM

load_dotenv()
RUNTIME_BACKEND = os.getenv("RUNTIME_BACKEND")

if RUNTIME_BACKEND == "default":
from .llm_for_local import ZiyueLLM
else: # hf/vllm
from .llm_for_fastchat import OpenAICustomLLM as ZiyueLLM
Loading