Skip to content

Commit 37293e4

Browse files
authored
blog: add qwen3 disagg perf metrics (#5822)
1 parent fbb4cc7 commit 37293e4

File tree

2 files changed

+15
-0
lines changed

2 files changed

+15
-0
lines changed
341 KB
Loading

docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ By NVIDIA TensorRT-LLM Team
1818
- [ISL 4400 - OSL 1200 (Machine Translation Dataset)](#ISL-4400---OSL-1200-Machine-Translation-Dataset)
1919
- [ISL 8192 - OSL 256 (Synthetic Dataset)](#ISL-8192---OSL-256-Synthetic-Dataset)
2020
- [ISL 4096 - OSL 1024 (Machine Translation Dataset)](#ISL-4096---OSL-1024-Machine-Translation-Dataset)
21+
- [Qwen 3](#Qwen-3)
22+
- [ISL 8192 - OSL 1024 (Machine Translation Dataset)](#ISL-8192---OSL-1024-Machine-Translation-Dataset)
2123
- [Reproducing Steps](#Reproducing-Steps)
2224
- [Future Work](#Future-Work)
2325
- [Acknowledgement](#Acknowledgement)
@@ -260,6 +262,19 @@ In Figure 13 and 14, the E2E Pareto curves for aggregated serving and disaggrega
260262

261263
For Pareto curves with MTP = 1, 2, 3, it can be observed that disaggregated results show a **1.7x** improvement over aggregated results at 50 tokens/sec/user (20 ms latency). Enabling MTP provides a larger speedup at higher concurrencies.
262264

265+
### Qwen 3
266+
267+
#### ISL 8192 - OSL 1024 (Machine Translation Dataset)
268+
269+
<div align="center">
270+
<figure>
271+
<img src="https://github.com/Shixiaowei02/TensorRT-LLM/blob/user/xiaoweis/blog/docs/source/blogs/media/tech_blog5_Picture15.png" width="640" height="auto" alt="Qwen 3 Pareto curves">
272+
</figure>
273+
</div>
274+
<p align="center"><sub><em>Figure 15. Qwen 3 Pareto curves.</em></sub></p>
275+
276+
We also conducted performance evaluations of Qwen 3 on GB200 GPUs. The data indicate that the speedups achieved by disaggregation over aggregation range from 1.7x to 6.11x.
277+
263278
### Reproducing Steps
264279

265280
We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in [this document](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/scripts/disaggregated).

0 commit comments

Comments
 (0)