up

root · root · commit 2f16499f9615 · 2025-11-04T07:21:05.000Z
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -1,130 +1,137 @@
----
-license: Apache License 2.0
-language:
-- Multilingual
-- Chinese
-- English
-tasks:
-- ERNIE Large Models
-- Large Language Models
-- Text Generation
-model_features:
-- 128k Context
----
-
-<div align="center" style="line-height: 1;">
-  <a href="https://ernie.baidu.com/" target="_blank" style="margin: 2px;">
-    <img alt="Chat" src="https://img.shields.io/badge/🤖_Chat-ERNIE_Bot-blue" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-  <a href="https://huggingface.co/baidu" target="_blank" style="margin: 2px;">
-    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Baidu-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-  <a href="https://github.com/PaddlePaddle/ERNIE" target="_blank" style="margin: 2px;">
-    <img alt="Github" src="https://img.shields.io/badge/GitHub-ERNIE-000?logo=github&color=0000FF" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-  <a href="https://ernie.baidu.com/blog/ernie4.5" target="_blank" style="margin: 2px;">
-    <img alt="Blog" src="https://img.shields.io/badge/🖖_Blog-ERNIE4.5-A020A0" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-  <a href="https://discord.gg/JPmZXDsEEK" target="_blank" style="margin: 2px;">
-    <img alt="Discord" src="https://img.shields.io/badge/Discord-ERNIE-5865F2?logo=discord&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-  <a href="https://x.com/PaddlePaddle" target="_blank" style="margin: 2px;">
-    <img alt="X" src="https://img.shields.io/badge/X-PaddlePaddle-6080F0"?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-</div>
-
-<div align="center" style="line-height: 1;">
-  <a href="#license" style="margin: 2px;">
-    <img alt="License" src="https://img.shields.io/badge/License-Apache2.0-A5de54" style="display: inline-block; vertical-align: middle;"/>
-  </a>
-</div>
-
-# ERNIE-4.5-0.3B-Base
-
-> [!NOTE]
-> Note: "**-Paddle**" models use [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) weights, while "**-PT**" models use Transformer-style PyTorch weights.
-
-> [!NOTE]
-> Note: The Base model only supports text completion. For evaluation, use the `completion` API (not `chat_completion`) in vLLM/FastDeploy.
-
-## ERNIE 4.5 Highlights
-
-The advanced capabilities of the ERNIE 4.5 models, particularly the MoE-based A47B and A3B series, are underpinned by several key technical innovations:
-
-1. **Multimodal Heterogeneous MoE Pre-Training:** Our models are jointly trained on both textual and visual modalities to better capture the nuances of multimodal information and improve performance on tasks involving text understanding and generation, image understanding, and cross-modal reasoning. To achieve this without one modality hindering the learning of another, we designed a *heterogeneous MoE structure*, incorporated *modality-isolated routing*, and employed *router orthogonal loss* and *multimodal token-balanced loss*. These architectural choices ensure that both modalities are effectively represented, allowing for mutual reinforcement during training.
-
-2. **Scaling-Efficient Infrastructure:** We propose a novel heterogeneous hybrid parallelism and hierarchical load balancing strategy for efficient training of ERNIE 4.5 models. By using intra-node expert parallelism, memory-efficient pipeline scheduling, FP8 mixed-precision training and finegrained recomputation methods, we achieve remarkable pre-training throughput. For inference, we propose *multi-expert parallel collaboration* method and *convolutional code quantization* algorithm to achieve 4-bit/2-bit lossless quantization. Furthermore, we introduce PD disaggregation with dynamic role switching for effective resource utilization to enhance inference performance for ERNIE 4.5 MoE models. Built on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle), ERNIE 4.5 delivers high-performance inference across a wide range of hardware platforms.
-
-3. **Modality-Specific Post-Training:** To meet the diverse requirements of real-world applications, we fine-tuned variants of the pre-trained model for specific modalities. Our LLMs are optimized for general-purpose language understanding and generation. The VLMs focuses on visuallanguage understanding and supports both thinking and non-thinking modes. Each model employed a combination of *Supervised Fine-tuning (SFT)*, *Direct Preference Optimization (DPO)* or a modified reinforcement learning method named *Unified Preference Optimization (UPO)* for post-training.
-
-## Model Overview
-
-ERNIE-4.5-0.3B-Base is a text dense Base model. The following are the model configuration details:
-
-| Key            | Value       |
-| -------------- | ----------- |
-| Modality       | Text        |
-| Training Stage | Pretraining |
-| Params         | 0.36B       |
-| Layers         | 18          |
-| Heads(Q/KV)    | 16 / 2      |
-| Context Length | 131072      |
-
-## Quickstart
-
-### Using `transformers` library
-
-**Note**: You'll need the `transformers` library (version 4.54.0 or newer) installed to use this model.
-
-The following contains a code snippet illustrating how to use the model generate content based on given inputs.
-
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-model_name = "baidu/ERNIE-4.5-0.3B-Base-PT"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    device_map="auto",
-    torch_dtype=torch.bfloat16,
-)
-
-prompt = "Large language model is"
-model_inputs = tokenizer([prompt], add_special_tokens=False, return_tensors="pt").to(model.device)
-
-generated_ids = model.generate(
-    **model_inputs,
-    max_new_tokens=1024
-)
-result = tokenizer.decode(generated_ids[0].tolist(), skip_special_tokens=True)
-print("result:", result)
-```
+### FastDeploy服务化性能压测工具
+
+#### 数据集：
+
+wget下载到本地用于性能测试
 
-### vLLM inference
+<table style="width:100%; border-collapse: collapse;">
+  <thead>
+    <tr>
+      <th style="width:15%; text-align: left;">Dataset</th>
+      <th style="width:65%; text-align: left;">Data Path</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><strong>开源数据集 2k条</strong></td>
+      <td><code>https://fastdeploy.bj.bcebos.com/eb_query/filtered_sharedgpt_2000_input_1136_output_200_fd.json</code></td>
+    </tr>
+  </tbody>
+</table>
+
+#### 使用方式：
+
+```
+# 安装依赖
+python -m pip install -r requirements.txt
+```
 
-[vllm](https://github.com/vllm-project/vllm/tree/main) github library. Python-only [build](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation).
+##### 参数说明
 
 ```bash
-vllm serve baidu/ERNIE-4.5-0.3B-Base-PT
+--backend openai-chat：压测使用的后端接口，指定为"openai-chat"使用chat/completion接口
+--model EB45T：模型名，任意取名，影响最后保存的结果文件名 EB45T \
+--endpoint /v1/chat/completions：endpoint，用于组url
+--host 0.0.0.0：服务ip地址，用于组url
+--port 9812：服务HTTP端口，用于组url
+--dataset-name EBChat：指定数据集类，指定为"EBChat"可读取转存的FD格式数据集
+--dataset-path ./eb45t_spv4_dataserver_1w_waigua_fd：压测数据集路径
+--hyperparameter-path EB45T.yaml：(可选)超参文件，请求时会更新进payload中，默认不带任何超参
+--percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len：性能结果中展示的指标集合
+--metric-percentiles 80,95,99,99.9,99.95,99.99：性能结果中展示的性能指标分位值
+--num-prompts 1：总计发送多少条请求
+--max-concurrency 1：压测并发数
+--save-result：开启结果保存，结果文件会存入json，默认False不保存
+--debug：开启debug模式，逐条打印payload和output内容，默认False
+--shuffle：是否打乱数据集，默认False不打乱
+--seed：打乱数据集时的随机种子，默认0
 ```
 
-## License
+##### /v1/chat/completions接口压测单条数据调试
+
+```
+python benchmark_serving.py \
+  --backend openai-chat \
+  --model EB45T \
+  --endpoint /v1/chat/completions \
+  --host 0.0.0.0 \
+  --port 9812 \
+  --dataset-name EBChat \
+  --dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
+  --hyperparameter-path yaml/request_yaml/eb45t-32k.yaml \
+  --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
+  --metric-percentiles 80,95,99,99.9,99.95,99.99 \
+  --num-prompts 1 \
+  --max-concurrency 1 \
+  --save-result
+```
 
-The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved.
+##### /v1/chat/completions接口完整100并发 2000条压测
 
-## Citation
+```
+# 保存infer_log.txt
+python benchmark_serving.py \
+  --backend openai-chat \
+  --model EB45T \
+  --endpoint /v1/chat/completions \
+  --host 0.0.0.0 \
+  --port 9812 \
+  --dataset-name EBChat \
+  --dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
+  --hyperparameter-path yaml/request_yaml/eb45t-32k.yaml \
+  --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
+  --metric-percentiles 80,95,99,99.9,99.95,99.99 \
+  --num-prompts 2000 \
+  --max-concurrency 100 \
+  --save-result > infer_log.txt 2>&1 &
+```
 
-If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:
+##### /v1/completions接口压测
 
-```bibtex
-@misc{ernie2025technicalreport,
-      title={ERNIE 4.5 Technical Report},
-      author={Baidu ERNIE Team},
-      year={2025},
-      eprint={},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={}
-}
+修改endpoint为/v1/completions，backend为openai，会对/v1/completions接口进行压测
+
+```
+# 保存infer_log.txt
+python benchmark_serving.py \
+  --backend openai \
+  --model EB45T \
+  --endpoint /v1/completions \
+  --host 0.0.0.0 \
+  --port 9812 \
+  --dataset-name EBChat \
+  --dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
+  --hyperparameter-path yaml/request_yaml/eb45t-32k.yaml \
+  --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
+  --metric-percentiles 80,95,99,99.9,99.95,99.99 \
+  --num-prompts 2000 \
+  --max-concurrency 100 \
+  --save-result > infer_log.txt 2>&1 &
+```
+
+### 投机解码性能测试工具
+
+#### 使用方式：
+
+```bash
+python benchmarks/benchmark_mtp.py \
+  --host 127.0.0.1 --port 8000 \
+  --max-concurrency 16 32 64 96 --num-prompts 256 \
+  --acceptance-rate 0.8 --draft-token-steps 1 2 3 \
+  --s_itl-base-model 15.88 22.84 16.47 16.93 \
+  --dataset-name EBChat \
+  --dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json
+```
+
+#### 参数说明
+
+```bash
+--host：服务ip地址，用于组url
+--port：服务HTTP端口，用于组url
+--max-concurrency：测试并发数
+--num-prompts：总计发送多少条请求
+--acceptance-rate：投机解码的模拟接受率
+--draft-token-steps：投机解码的步数
+--s_itl-base-model：主模型的解码延迟，可由上述的性能压测工具获得，与batch-size一一对应
+--dataset-name：指定数据集类，指定为"EBChat"可读取转存的FD格式数据集
+--dataset-path：测试数据集路径
 ```
diff --git a/fastdeploy/splitwise/splitwise_connector.py b/fastdeploy/splitwise/splitwise_connector.py
@@ -504,19 +504,6 @@ def _handle_decode(self, payload):
         self.logger.debug(f"_handle_decode function receive {payload}")
         tasks = []
         for task in payload:
-            # output = RequestOutput(
-            #         request_id=task["request_id"],
-            #         outputs=CompletionOutput(
-            #             index=task["outputs"]["index"],
-            #             send_idx=0,
-            #             token_ids=task["outputs"]["token_ids"],
-            #             draft_token_ids=task["outputs"]["draft_token_ids"],
-            #         ),
-            #         finished=True,
-            #         num_cached_tokens=task["num_cached_tokens"],
-            #         error_code=task["error_code"],
-            #         error_msg=task["error_msg"],
-            #     )
             output = RequestOutput.from_dict(task)
             tasks.append(output)
         self.engine_worker_queue.put_disaggregated_tasks(("decode", tasks))