Skip to content

Commit 9ead10e

Browse files
更新文档 (#3975)
1 parent 571ddc6 commit 9ead10e

File tree

13 files changed

+429
-130
lines changed

13 files changed

+429
-130
lines changed

README.md

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,9 @@ FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**,
5757
- [Iluvatar GPU](./docs/get_started/installation/iluvatar_gpu.md)
5858
- [Enflame GCU](./docs/get_started/installation/Enflame_gcu.md)
5959
- [Hygon DCU](./docs/get_started/installation/hygon_dcu.md)
60+
- [MetaX GPU](./docs/get_started/installation/metax_gpu.md.md)
6061

61-
**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates!
62+
**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU are currently under development and testing. Stay tuned for updates!
6263

6364
## Get Started
6465

@@ -68,20 +69,12 @@ Learn how to use FastDeploy through our documentation:
6869
- [ERNIE-4.5-VL Multimodal Model Deployment](./docs/get_started/ernie-4.5-vl.md)
6970
- [Offline Inference Development](./docs/offline_inference.md)
7071
- [Online Service Deployment](./docs/online_serving/README.md)
71-
- [Full Supported Models List](./docs/supported_models.md)
7272
- [Best Practices](./docs/best_practices/README.md)
7373

7474
## Supported Models
7575

76-
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
77-
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
78-
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 ||||||128K |
79-
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 |||||| 128K |
80-
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP || WIP || WIP |128K |
81-
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 ||| WIP || WIP |128K |
82-
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 ||||||128K |
83-
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 ||||||128K |
84-
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 |||||| 128K |
76+
Learn how to download models, enable support for Torch weights, and calculate minimum resource requirements, and more:
77+
- [Full Supported Models List](./docs/supported_models.md)
8578

8679
## Advanced Usage
8780

README_CN.md

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,9 @@ FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU
5555
- [天数 CoreX](./docs/zh/get_started/installation/iluvatar_gpu.md)
5656
- [燧原 S60](./docs/zh/get_started/installation/Enflame_gcu.md)
5757
- [海光 DCU](./docs/zh/get_started/installation/hygon_dcu.md)
58+
- [沐曦 GPU](./docs/zh/get_started/installation/metax_gpu.md.md)
5859

59-
**注意:** 我们正在积极拓展硬件支持范围。目前,包括昇腾(Ascend)NPU 和 沐曦(MetaX)GPU 在内的其他硬件平台正在开发测试中。敬请关注更新!
60+
**注意:** 我们正在积极拓展硬件支持范围。目前,包括昇腾(Ascend)NPU 等其他硬件平台正在开发测试中。敬请关注更新!
6061

6162
## 入门指南
6263

@@ -66,20 +67,12 @@ FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU
6667
- [ERNIE-4.5-VL 部署](./docs/zh/get_started/ernie-4.5-vl.md)
6768
- [离线推理](./docs/zh/offline_inference.md)
6869
- [在线服务](./docs/zh/online_serving/README.md)
69-
- [模型支持列表](./docs/zh/supported_models.md)
7070
- [最佳实践](./docs/zh/best_practices/README.md)
7171

7272
## 支持模型列表
7373

74-
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
75-
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
76-
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 ||||||128K |
77-
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 |||||| 128K |
78-
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP || WIP || WIP |128K |
79-
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 ||| WIP || WIP |128K |
80-
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 ||||||128K |
81-
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 ||||||128K |
82-
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 |||||| 128K |
74+
通过我们的文档了解如何下载模型,如何支持Torch 权重,如何计算最小资源部署等:
75+
- [模型支持列表](./docs/zh/supported_models.md)
8376

8477
## 进阶用法
8578

docs/assets/images/favicon.ico

4.19 KB
Binary file not shown.

docs/assets/images/logo.jpg

13.6 KB
Loading
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Deploy QWEN3-0.6b in 10 Minutes
2+
3+
Before deployment, ensure your environment meets the following requirements:
4+
5+
- GPU Driver ≥ 535
6+
- CUDA ≥ 12.3
7+
- cuDNN ≥ 9.5
8+
- Linux X86_64
9+
- Python ≥ 3.10
10+
11+
This guide uses the lightweight QWEN3-0.6b model for demonstration, which can be deployed on most hardware configurations. Docker deployment is recommended.
12+
13+
For more information about how to install FastDeploy, refer to the [installation document](installation/README.md).
14+
15+
## 1. Launch Service
16+
After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
17+
18+
> ⚠️ **Note:**
19+
> When using HuggingFace models (torch format), you need to enable `--load_choices "default_v1"`.
20+
21+
```
22+
export ENABLE_V1_KVCACHE_SCHEDULER=1
23+
python -m fastdeploy.entrypoints.openai.api_server \
24+
--model Qwen/QWEN3-0.6b \
25+
--port 8180 \
26+
--metrics-port 8181 \
27+
--engine-worker-queue-port 8182 \
28+
--max-model-len 32768 \
29+
--max-num-seqs 32 \
30+
--load_choices "default_v1"
31+
```
32+
33+
> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```Qwen/QWEN3-0.6b```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md).
34+
```--max-model-len``` indicates the maximum number of tokens supported by the currently deployed service.
35+
```--max-num-seqs``` indicates the maximum number of concurrent processing supported by the currently deployed service.
36+
37+
**Related Documents**
38+
- [Service Deployment](../online_serving/README.md)
39+
- [Service Monitoring](../online_serving/metrics.md)
40+
41+
## 2. Request the Service
42+
After starting the service, the following output indicates successful initialization:
43+
44+
```shell
45+
api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
46+
api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
47+
api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
48+
INFO: Started server process [13909]
49+
INFO: Waiting for application startup.
50+
INFO: Application startup complete.
51+
INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
52+
```
53+
54+
### Health Check
55+
56+
Verify service status (HTTP 200 indicates success):
57+
58+
```shell
59+
curl -i http://0.0.0.0:8180/health
60+
```
61+
62+
### cURL Request
63+
64+
Send requests to the service with the following command:
65+
66+
```shell
67+
curl -X POST "http://0.0.0.0:1822/v1/chat/completions" \
68+
-H "Content-Type: application/json" \
69+
-d '{
70+
"messages": [
71+
{"role": "user", "content": "Write me a poem about large language model."}
72+
],
73+
"stream": true
74+
}'
75+
```
76+
77+
### Python Client (OpenAI-compatible API)
78+
79+
FastDeploy's API is OpenAI-compatible. You can also use Python for requests:
80+
81+
```python
82+
import openai
83+
host = "0.0.0.0"
84+
port = "8180"
85+
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
86+
87+
response = client.chat.completions.create(
88+
model="null",
89+
messages=[
90+
{"role": "system", "content": "I'm a helpful AI assistant."},
91+
{"role": "user", "content": "Write me a poem about large language model."},
92+
],
93+
stream=True,
94+
)
95+
for chunk in response:
96+
if chunk.choices[0].delta:
97+
print(chunk.choices[0].delta.content, end='')
98+
print('\n')
99+
```

docs/index.md

Lines changed: 32 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,39 @@
1111

1212
## Supported Models
1313

14-
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
14+
| Model | Data Type |[PD Disaggregation](./features/disaggregated.md) | [Chunked Prefill](./features/chunked_prefill.md) | [Prefix Caching](./features/prefix_caching.md) | [MTP](./features/speculative_decoding.md) | [CUDA Graph](./features/graph_optimization.md) | Maximum Context Length |
1515
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
16-
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 ||||| WIP |128K |
17-
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 ||||| WIP | 128K |
18-
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP || WIP || WIP |128K |
19-
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 ||| WIP || WIP |128K |
20-
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 ||||||128K |
21-
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 ||||||128K |
22-
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 |||||| 128K |
16+
|ERNIE-4.5-300B-A47B|BF16\WINT4\WINT8\W4A8C8\WINT2\FP8||||||128K|
17+
|ERNIE-4.5-300B-A47B-Base|BF16/WINT4/WINT8||||||128K|
18+
|ERNIE-4.5-VL-424B-A47B|BF16/WINT4/WINT8|🚧||🚧||🚧|128K|
19+
|ERNIE-4.5-VL-28B-A3B|BF16/WINT4/WINT8|||🚧||🚧|128K|
20+
|ERNIE-4.5-21B-A3B|BF16/WINT4/WINT8/FP8||||||128K|
21+
|ERNIE-4.5-21B-A3B-Base|BF16/WINT4/WINT8/FP8||||||128K|
22+
|ERNIE-4.5-0.3B|BF16/WINT8/FP8||||||128K|
23+
|QWEN3-MOE|BF16/WINT4/WINT8/FP8||||🚧||128K|
24+
|QWEN3|BF16/WINT8/FP8||||🚧||128K|
25+
|QWEN-VL|BF16/WINT8/FP8||||🚧||128K|
26+
|QWEN2|BF16/WINT8/FP8||||🚧||128K|
27+
|DEEPSEEK-V3|BF16/WINT4||||🚧||128K|
28+
|DEEPSEEK-R1|BF16/WINT4||||🚧||128K|
29+
30+
```
31+
✅ Supported 🚧 In Progress ⛔ No Plan
32+
```
33+
34+
## Supported Hardware
35+
36+
| Model | [NVIDIA GPU](./get_started/installation/nvidia_gpu.md) |[Kunlunxin XPU](./get_started/installation/kunlunxin_xpu.md) | Ascend NPU | [Hygon DCU](./get_started/installation/hygon_dcu.md) | [Iluvatar GPU](./get_started/installation/iluvatar_gpu.md) | [MetaX GPU](./get_started/installation/metax_gpu.md.md) | [Enflame GCU](./get_started/installation/Enflame_gcu.md) |
37+
|:------|---------|------------|----------|-------------|-----------|-------------|-------------|
38+
| ERNIE4.5-VL-424B-A47B || 🚧 | 🚧 |||||
39+
| ERNIE4.5-300B-A47B ||| 🚧 ||| 🚧 ||
40+
| ERNIE4.5-VL-28B-A3B || 🚧 | 🚧 || 🚧 |||
41+
| ERNIE4.5-21B-A3B ||| 🚧 |||||
42+
| ERNIE4.5-0.3B ||| 🚧 |||||
43+
44+
```
45+
✅ Supported 🚧 In Progress ⛔ No Plan
46+
```
2347

2448
## Documentation
2549

docs/parameters.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,10 @@ When using FastDeploy to deploy models (including offline inference and service
3434
| ```max_long_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum number of long requests in concurrent partial prefill batches, default: 1 |
3535
| ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 |
3636
| ```static_decode_blocks``` | `int` | During inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2 |
37-
| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output, refer [reasoning output](features/reasoning_output.md) for more details |
37+
| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output |
3838
| ```use_cudagraph``` | `bool` | Whether to use cuda graph, default False. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. Custom all-reduce needs to be enabled at the same time in multi-card scenarios. |
3939
| ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',Detailed description reference [graph_optimization.md](./features/graph_optimization.md)|
40-
| ```disable_custom_all_reduce``` | `bool` | Disable Custom all-reduce, default: False |
40+
| ```enable_custom_all_reduce``` | `bool` | Enable Custom all-reduce, default: False |
4141
| ```splitwise_role``` | `str` | Whether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"] |
4242
| ```innode_prefill_ports``` | `str` | Internal engine startup ports for prefill instances (only required for single-machine PD separation), default: None |
4343
| ```guided_decoding_backend``` | `str` | Specify the guided decoding backend to use, supports `auto`, `xgrammar`, `off`, default: `off` |
@@ -51,7 +51,7 @@ When using FastDeploy to deploy models (including offline inference and service
5151
| ```chat_template``` | `str` | Specify the template used for model concatenation, It supports both string input and file path input. The default value is None. If not specified, the model's default template will be used. |
5252
| ```tool_call_parser``` | `str` | Specify the function call parser to be used for extracting function call content from the model's output. |
5353
| ```tool_parser_plugin``` | `str` | Specify the file path of the tool parser to be registered, so as to register parsers that are not in the code repository. The code format within these parsers must adhere to the format used in the code repository. |
54-
| ```lm_head_fp32``` | `bool` | Specify the dtype of the lm_head layer as FP32. |
54+
| ```load_choices``` | `str` | By default, the "default" loader is used for weight loading. To load Torch weights or enable weight acceleration, "default_v1" must be used.|
5555

5656
## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?
5757

0 commit comments

Comments
 (0)