Skip to content

Commit 2f16499

Browse files
author
root
committed
up
1 parent 38f9175 commit 2f16499

File tree

2 files changed

+126
-132
lines changed

2 files changed

+126
-132
lines changed

benchmarks/README.md

Lines changed: 126 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -1,130 +1,137 @@
1-
---
2-
license: Apache License 2.0
3-
language:
4-
- Multilingual
5-
- Chinese
6-
- English
7-
tasks:
8-
- ERNIE Large Models
9-
- Large Language Models
10-
- Text Generation
11-
model_features:
12-
- 128k Context
13-
---
14-
15-
<div align="center" style="line-height: 1;">
16-
<a href="https://ernie.baidu.com/" target="_blank" style="margin: 2px;">
17-
<img alt="Chat" src="https://img.shields.io/badge/🤖_Chat-ERNIE_Bot-blue" style="display: inline-block; vertical-align: middle;"/>
18-
</a>
19-
<a href="https://huggingface.co/baidu" target="_blank" style="margin: 2px;">
20-
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Baidu-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
21-
</a>
22-
<a href="https://github.com/PaddlePaddle/ERNIE" target="_blank" style="margin: 2px;">
23-
<img alt="Github" src="https://img.shields.io/badge/GitHub-ERNIE-000?logo=github&color=0000FF" style="display: inline-block; vertical-align: middle;"/>
24-
</a>
25-
<a href="https://ernie.baidu.com/blog/ernie4.5" target="_blank" style="margin: 2px;">
26-
<img alt="Blog" src="https://img.shields.io/badge/🖖_Blog-ERNIE4.5-A020A0" style="display: inline-block; vertical-align: middle;"/>
27-
</a>
28-
<a href="https://discord.gg/JPmZXDsEEK" target="_blank" style="margin: 2px;">
29-
<img alt="Discord" src="https://img.shields.io/badge/Discord-ERNIE-5865F2?logo=discord&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
30-
</a>
31-
<a href="https://x.com/PaddlePaddle" target="_blank" style="margin: 2px;">
32-
<img alt="X" src="https://img.shields.io/badge/X-PaddlePaddle-6080F0"?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
33-
</a>
34-
</div>
35-
36-
<div align="center" style="line-height: 1;">
37-
<a href="#license" style="margin: 2px;">
38-
<img alt="License" src="https://img.shields.io/badge/License-Apache2.0-A5de54" style="display: inline-block; vertical-align: middle;"/>
39-
</a>
40-
</div>
41-
42-
# ERNIE-4.5-0.3B-Base
43-
44-
> [!NOTE]
45-
> Note: "**-Paddle**" models use [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) weights, while "**-PT**" models use Transformer-style PyTorch weights.
46-
47-
> [!NOTE]
48-
> Note: The Base model only supports text completion. For evaluation, use the `completion` API (not `chat_completion`) in vLLM/FastDeploy.
49-
50-
## ERNIE 4.5 Highlights
51-
52-
The advanced capabilities of the ERNIE 4.5 models, particularly the MoE-based A47B and A3B series, are underpinned by several key technical innovations:
53-
54-
1. **Multimodal Heterogeneous MoE Pre-Training:** Our models are jointly trained on both textual and visual modalities to better capture the nuances of multimodal information and improve performance on tasks involving text understanding and generation, image understanding, and cross-modal reasoning. To achieve this without one modality hindering the learning of another, we designed a *heterogeneous MoE structure*, incorporated *modality-isolated routing*, and employed *router orthogonal loss* and *multimodal token-balanced loss*. These architectural choices ensure that both modalities are effectively represented, allowing for mutual reinforcement during training.
55-
56-
2. **Scaling-Efficient Infrastructure:** We propose a novel heterogeneous hybrid parallelism and hierarchical load balancing strategy for efficient training of ERNIE 4.5 models. By using intra-node expert parallelism, memory-efficient pipeline scheduling, FP8 mixed-precision training and finegrained recomputation methods, we achieve remarkable pre-training throughput. For inference, we propose *multi-expert parallel collaboration* method and *convolutional code quantization* algorithm to achieve 4-bit/2-bit lossless quantization. Furthermore, we introduce PD disaggregation with dynamic role switching for effective resource utilization to enhance inference performance for ERNIE 4.5 MoE models. Built on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle), ERNIE 4.5 delivers high-performance inference across a wide range of hardware platforms.
57-
58-
3. **Modality-Specific Post-Training:** To meet the diverse requirements of real-world applications, we fine-tuned variants of the pre-trained model for specific modalities. Our LLMs are optimized for general-purpose language understanding and generation. The VLMs focuses on visuallanguage understanding and supports both thinking and non-thinking modes. Each model employed a combination of *Supervised Fine-tuning (SFT)*, *Direct Preference Optimization (DPO)* or a modified reinforcement learning method named *Unified Preference Optimization (UPO)* for post-training.
59-
60-
## Model Overview
61-
62-
ERNIE-4.5-0.3B-Base is a text dense Base model. The following are the model configuration details:
63-
64-
| Key | Value |
65-
| -------------- | ----------- |
66-
| Modality | Text |
67-
| Training Stage | Pretraining |
68-
| Params | 0.36B |
69-
| Layers | 18 |
70-
| Heads(Q/KV) | 16 / 2 |
71-
| Context Length | 131072 |
72-
73-
## Quickstart
74-
75-
### Using `transformers` library
76-
77-
**Note**: You'll need the `transformers` library (version 4.54.0 or newer) installed to use this model.
78-
79-
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
80-
81-
```python
82-
import torch
83-
from transformers import AutoModelForCausalLM, AutoTokenizer
84-
85-
model_name = "baidu/ERNIE-4.5-0.3B-Base-PT"
86-
tokenizer = AutoTokenizer.from_pretrained(model_name)
87-
model = AutoModelForCausalLM.from_pretrained(
88-
model_name,
89-
device_map="auto",
90-
torch_dtype=torch.bfloat16,
91-
)
92-
93-
prompt = "Large language model is"
94-
model_inputs = tokenizer([prompt], add_special_tokens=False, return_tensors="pt").to(model.device)
95-
96-
generated_ids = model.generate(
97-
**model_inputs,
98-
max_new_tokens=1024
99-
)
100-
result = tokenizer.decode(generated_ids[0].tolist(), skip_special_tokens=True)
101-
print("result:", result)
102-
```
1+
### FastDeploy服务化性能压测工具
2+
3+
#### 数据集:
4+
5+
wget下载到本地用于性能测试
1036

104-
### vLLM inference
7+
<table style="width:100%; border-collapse: collapse;">
8+
<thead>
9+
<tr>
10+
<th style="width:15%; text-align: left;">Dataset</th>
11+
<th style="width:65%; text-align: left;">Data Path</th>
12+
</tr>
13+
</thead>
14+
<tbody>
15+
<tr>
16+
<td><strong>开源数据集 2k条</strong></td>
17+
<td><code>https://fastdeploy.bj.bcebos.com/eb_query/filtered_sharedgpt_2000_input_1136_output_200_fd.json</code></td>
18+
</tr>
19+
</tbody>
20+
</table>
21+
22+
#### 使用方式:
23+
24+
```
25+
# 安装依赖
26+
python -m pip install -r requirements.txt
27+
```
10528

106-
[vllm](https://github.com/vllm-project/vllm/tree/main) github library. Python-only [build](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation).
29+
##### 参数说明
10730

10831
```bash
109-
vllm serve baidu/ERNIE-4.5-0.3B-Base-PT
32+
--backend openai-chat:压测使用的后端接口,指定为"openai-chat"使用chat/completion接口
33+
--model EB45T:模型名,任意取名,影响最后保存的结果文件名 EB45T \
34+
--endpoint /v1/chat/completions:endpoint,用于组url
35+
--host 0.0.0.0:服务ip地址,用于组url
36+
--port 9812:服务HTTP端口,用于组url
37+
--dataset-name EBChat:指定数据集类,指定为"EBChat"可读取转存的FD格式数据集
38+
--dataset-path ./eb45t_spv4_dataserver_1w_waigua_fd:压测数据集路径
39+
--hyperparameter-path EB45T.yaml:(可选)超参文件,请求时会更新进payload中,默认不带任何超参
40+
--percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len:性能结果中展示的指标集合
41+
--metric-percentiles 80,95,99,99.9,99.95,99.99:性能结果中展示的性能指标分位值
42+
--num-prompts 1:总计发送多少条请求
43+
--max-concurrency 1:压测并发数
44+
--save-result:开启结果保存,结果文件会存入json,默认False不保存
45+
--debug:开启debug模式,逐条打印payload和output内容,默认False
46+
--shuffle:是否打乱数据集,默认False不打乱
47+
--seed:打乱数据集时的随机种子,默认0
11048
```
11149

112-
## License
50+
##### /v1/chat/completions接口压测单条数据调试
51+
52+
```
53+
python benchmark_serving.py \
54+
--backend openai-chat \
55+
--model EB45T \
56+
--endpoint /v1/chat/completions \
57+
--host 0.0.0.0 \
58+
--port 9812 \
59+
--dataset-name EBChat \
60+
--dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
61+
--hyperparameter-path yaml/request_yaml/eb45t-32k.yaml \
62+
--percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
63+
--metric-percentiles 80,95,99,99.9,99.95,99.99 \
64+
--num-prompts 1 \
65+
--max-concurrency 1 \
66+
--save-result
67+
```
11368

114-
The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved.
69+
##### /v1/chat/completions接口完整100并发 2000条压测
11570

116-
## Citation
71+
```
72+
# 保存infer_log.txt
73+
python benchmark_serving.py \
74+
--backend openai-chat \
75+
--model EB45T \
76+
--endpoint /v1/chat/completions \
77+
--host 0.0.0.0 \
78+
--port 9812 \
79+
--dataset-name EBChat \
80+
--dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
81+
--hyperparameter-path yaml/request_yaml/eb45t-32k.yaml \
82+
--percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
83+
--metric-percentiles 80,95,99,99.9,99.95,99.99 \
84+
--num-prompts 2000 \
85+
--max-concurrency 100 \
86+
--save-result > infer_log.txt 2>&1 &
87+
```
11788

118-
If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:
89+
##### /v1/completions接口压测
11990

120-
```bibtex
121-
@misc{ernie2025technicalreport,
122-
title={ERNIE 4.5 Technical Report},
123-
author={Baidu ERNIE Team},
124-
year={2025},
125-
eprint={},
126-
archivePrefix={arXiv},
127-
primaryClass={cs.CL},
128-
url={}
129-
}
91+
修改endpoint为/v1/completions,backend为openai,会对/v1/completions接口进行压测
92+
93+
```
94+
# 保存infer_log.txt
95+
python benchmark_serving.py \
96+
--backend openai \
97+
--model EB45T \
98+
--endpoint /v1/completions \
99+
--host 0.0.0.0 \
100+
--port 9812 \
101+
--dataset-name EBChat \
102+
--dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
103+
--hyperparameter-path yaml/request_yaml/eb45t-32k.yaml \
104+
--percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
105+
--metric-percentiles 80,95,99,99.9,99.95,99.99 \
106+
--num-prompts 2000 \
107+
--max-concurrency 100 \
108+
--save-result > infer_log.txt 2>&1 &
109+
```
110+
111+
### 投机解码性能测试工具
112+
113+
#### 使用方式:
114+
115+
```bash
116+
python benchmarks/benchmark_mtp.py \
117+
--host 127.0.0.1 --port 8000 \
118+
--max-concurrency 16 32 64 96 --num-prompts 256 \
119+
--acceptance-rate 0.8 --draft-token-steps 1 2 3 \
120+
--s_itl-base-model 15.88 22.84 16.47 16.93 \
121+
--dataset-name EBChat \
122+
--dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json
123+
```
124+
125+
#### 参数说明
126+
127+
```bash
128+
--host:服务ip地址,用于组url
129+
--port:服务HTTP端口,用于组url
130+
--max-concurrency:测试并发数
131+
--num-prompts:总计发送多少条请求
132+
--acceptance-rate:投机解码的模拟接受率
133+
--draft-token-steps:投机解码的步数
134+
--s_itl-base-model:主模型的解码延迟,可由上述的性能压测工具获得,与batch-size一一对应
135+
--dataset-name:指定数据集类,指定为"EBChat"可读取转存的FD格式数据集
136+
--dataset-path:测试数据集路径
130137
```

fastdeploy/splitwise/splitwise_connector.py

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -504,19 +504,6 @@ def _handle_decode(self, payload):
504504
self.logger.debug(f"_handle_decode function receive {payload}")
505505
tasks = []
506506
for task in payload:
507-
# output = RequestOutput(
508-
# request_id=task["request_id"],
509-
# outputs=CompletionOutput(
510-
# index=task["outputs"]["index"],
511-
# send_idx=0,
512-
# token_ids=task["outputs"]["token_ids"],
513-
# draft_token_ids=task["outputs"]["draft_token_ids"],
514-
# ),
515-
# finished=True,
516-
# num_cached_tokens=task["num_cached_tokens"],
517-
# error_code=task["error_code"],
518-
# error_msg=task["error_msg"],
519-
# )
520507
output = RequestOutput.from_dict(task)
521508
tasks.append(output)
522509
self.engine_worker_queue.put_disaggregated_tasks(("decode", tasks))

0 commit comments

Comments
 (0)