[1/N][Feat] Xlite Qwen3 MoE Support#5951
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request adds support for Qwen3 MoE models in Xlite graph mode. The changes include updating documentation, dependencies, and tests. The core logic is in vllm_ascend/xlite/xlite.py, where a new QwenMoeXliteModel class is introduced to handle MoE-specific configurations and model building. The existing LlamaXliteModel has been refactored to facilitate this extension. My review found a critical issue in the new MoE model configuration logic that could lead to a crash when expert parallelism is disabled. A fix has been suggested.
| def _build_model_config(self, vllm_config: VllmConfig) -> ModelConfig: | ||
| config = super()._build_model_config(vllm_config) | ||
| hf_config = vllm_config.model_config.hf_text_config | ||
| ep_group = get_ep_group() | ||
| config.n_layers = hf_config.max_window_layers | ||
| config.n_dense_layers = 0 | ||
| config.n_routed_experts = hf_config.num_experts | ||
| config.n_shared_experts = 0 | ||
| config.n_act_experts = hf_config.num_experts_per_tok | ||
| config.def_dp_size = vllm_config.parallel_config.data_parallel_size | ||
| config.moe_ep_size = ep_group.world_size if vllm_config.parallel_config.enable_expert_parallel else 1 | ||
| config.moe_tp_size = 1 if vllm_config.parallel_config.enable_expert_parallel else ep_group.world_size | ||
| config.experts_weight_transpose = True | ||
| config.moe_intermediate_size = hf_config.moe_intermediate_size | ||
| config.norm_topk_prob = hf_config.norm_topk_prob | ||
| config.scoring_func = ScoringFuncSoftmax | ||
| return config |
There was a problem hiding this comment.
The method _build_model_config in QwenMoeXliteModel unconditionally calls get_ep_group() on line 208. This will cause a crash with an AssertionError if expert parallelism is not enabled, as the expert parallel group (_EP_GROUP) will not be initialized. The call to get_ep_group() should be conditional on vllm_config.parallel_config.enable_expert_parallel being true. Additionally, the logic for setting moe_tp_size in the else branch is incorrect as it also relies on ep_group which would not be available.
def _build_model_config(self, vllm_config: VllmConfig) -> ModelConfig:
config = super()._build_model_config(vllm_config)
hf_config = vllm_config.model_config.hf_text_config
config.n_layers = hf_config.max_window_layers
config.n_dense_layers = 0
config.n_routed_experts = hf_config.num_experts
config.n_shared_experts = 0
config.n_act_experts = hf_config.num_experts_per_tok
config.def_dp_size = vllm_config.parallel_config.data_parallel_size
if vllm_config.parallel_config.enable_expert_parallel:
ep_group = get_ep_group()
config.moe_ep_size = ep_group.world_size
config.moe_tp_size = 1
else:
config.moe_ep_size = 1
config.moe_tp_size = 1
config.experts_weight_transpose = True
config.moe_intermediate_size = hf_config.moe_intermediate_size
config.norm_topk_prob = hf_config.norm_topk_prob
config.scoring_func = ScoringFuncSoftmax
return configHead branch was pushed to by a user without write access
791b5a3 to
ba2a1ec
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
f238f99 to
25ddbe9
Compare
Signed-off-by: changdawei1 <changdawei3@huawei.com>
Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com>
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (24 commits) add dispath_ffn_combine_bf16 (vllm-project#5866) [BugFix] Fix input parameter bug of dispatch_gmm_combine_decode[RFC: issue 5476] (vllm-project#5932) [1/N][Feat] Xlite Qwen3 MoE Support (vllm-project#5951) [Bugfix] Fix setting of `speculative_config.enforce_eager` for dsv32 (vllm-project#5945) [bugfix][mm] change get_num_encoder_tokens to get_num_encoder_embeds in recompute_schedule.py (vllm-project#5132) [Bugfix] fix pcp qwen full graph FIA bug (vllm-project#6037) [Bugfix]Fixed precision issues caused by pooled request pooling (vllm-project#6049) 【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (vllm-project#6045) [main][Bugfix] Fixed an problem related to embeddings sharing (vllm-project#5967) [Feature]refactor the npugraph_ex config, support online-infer with static kernel (vllm-project#5775) [CI][Lint] Show lint diff on failure (vllm-project#5956) [CI] Add wait logic for each individual case (vllm-project#6036) [CI] Add DeepSeek-V3.2-W8A8 nightly ci test (vllm-project#4633) model runner v2 support triton of penalty (vllm-project#5854) [Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker (vllm-project#6034) [Tests] move qwen3 performance test from nightly to e2e (vllm-project#5980) [Bugfix] fix bug of pcp+mtp+async scheduler (vllm-project#5994) [Main2Main] Upgrade vllm commit to releases/v0.14.0 (vllm-project#5988) [Ops] Add layernorm for qwen3Next (vllm-project#5765) [Doc] Add layer_sharding additional config for DeepSeek-V3.2-W8A8 (vllm-project#5921) ...
### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(69b170b) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph | maxconcurrency | item | TTFT(ms) | | TPOT(ms) | | QPS (req/s) | OutputSpeed (token/s) | | --- | --- | --- | --- | --- | --- | --- | --- | | | | Avg | P99 | Avg | P99 | | | | 1 | baseline-aclgraph | 205.07 | 287.29 | 12.34 | 12.65 | 0.14 | 78.81 | | 1 | xlite-full | 66.40 | 113.69 | 11.71 | 12.40 | 0.15 | 84.73 | | 1 | xlite-decode-only | 221.15 | 316.40 | 12.16 | 12.91 | 0.14 | 79.70 | | 1 | diff1 | -67.62% | -60.43% | -5.11% | -1.98% | 7.14% | 7.51% | | 1 | diff2 | 7.84% | 10.13% | -1.46% | 2.06% | 0.00% | 1.13% | | | | | | | | | | | 16 | baseline-aclgraph | 1892.16 | 13916.86 | 22.78 | 39.28 | 1.15 | 589.89 | | 16 | xlite-full | 1355.40 | 8907.45 | 15.96 | 25.15 | 1.65 | 850.21 | | 16 | xlite-decode-only | 1519.42 | 8711.64 | 19.23 | 29.73 | 1.38 | 711.60 | | 16 | diff1 | -28.37% | -36.00% | -29.94% | -35.97% | 43.48% | 44.13% | | 16 | diff2 | -19.70% | -37.40% | -15.58% | -24.31% | 20.00% | 20.63% | | | | | | | | | | | 32 | baseline-aclgraph | 673.80 | 3914.90 | 32.20 | 37.95 | 1.80 | 928.54 | | 32 | xlite-full | 481.65 | 2710.50 | 19.95 | 25.35 | 2.91 | 1506.67 | | 32 | xlite-decode-only | 372.22 | 1095.25 | 25.19 | 28.47 | 2.33 | 1202.82 | | 32 | diff1 | -28.52% | -30.76% | -38.04% | -33.20% | 61.67% | 62.26% | | 32 | diff2 | -44.76% | -72.02% | -21.77% | -24.98% | 29.44% | 29.54% | | | | | | | | | | | 48 | baseline-aclgraph | 583.18 | 3277.65 | 41.02 | 46.05 | 2.17 | 1115.08 | | 48 | xlite-full | 973.42 | 8237.33 | 23.29 | 30.50 | 3.71 | 1908.09 | | 48 | xlite-decode-only | 480.79 | 2026.98 | 31.48 | 35.41 | 2.83 | 1453.75 | | 48 | diff1 | 66.92% | 151.32% | -43.22% | -33.77% | 70.97% | 71.12% | | 48 | diff2 | -17.56% | -38.16% | -23.26% | -23.11% | 30.41% | 30.37% | | | | | | | | | | | 64 | baseline-aclgraph | 742.74 | 5953.39 | 47.79 | 53.15 | 2.48 | 1272.37 | | 64 | xlite-full | 545.22 | 3941.34 | 25.09 | 30.41 | 4.64 | 2376.44 | | 64 | xlite-decode-only | 752.40 | 4534.29 | 38.67 | 43.28 | 3.06 | 1567.94 | | 64 | diff1 | -26.59% | -33.80% | -47.50% | -42.78% | 87.10% | 86.77% | | 64 | diff2 | 1.30% | -23.84% | -19.08% | -18.57% | 23.39% | 23.23% | | | | | | | | | | | 100 | baseline-aclgraph | 565.52 | 1716.81 | 60.89 | 68.69 | 3.08 | 1580.64 | | 100 | xlite-full | 398.14 | 2328.88 | 30.70 | 32.45 | 6.01 | 3086.42 | | 100 | xlite-decode-only | 712.53 | 4875.94 | 52.71 | 60.78 | 3.53 | 1813.58 | | 100 | diff1 | -29.60% | 35.65% | -49.58% | -52.76% | 95.13% | 95.26% | | 100 | diff2 | 26.00% | 184.01% | -13.43% | -11.52% | 14.61% | 14.74% | | | | | | | | | | | 150 | baseline-aclgraph | 842.42 | 5175.01 | 73.60 | 88.18 | 3.80 | 1952.26 | | 150 | xlite-full | 568.52 | 4204.33 | 37.90 | 40.01 | 7.27 | 3734.72 | | 150 | xlite-decode-only | 654.43 | 2504.06 | 67.40 | 77.00 | 4.18 | 2145.11 | | 150 | diff1 | -32.51% | -18.76% | -48.51% | -54.63% | 91.32% | 91.30% | | 150 | diff2 | -22.32% | -51.61% | -8.42% | -12.68% | 10.00% | 9.88% | | | | | | | | | | | 200 | baseline-aclgraph | 750.63 | 3049.91 | 88.26 | 101.95 | 4.28 | 2189.72 | | 200 | xlite-full | 558.48 | 3791.98 | 45.54 | 49.04 | 8.17 | 4175.52 | | 200 | xlite-decode-only | 807.09 | 4254.95 | 85.18 | 101.79 | 4.44 | 2271.52 | | 200 | diff1 | -25.60% | 24.33% | -48.40% | -51.90% | 90.89% | 90.69% | | 200 | diff2 | 7.52% | 39.51% | -3.49% | -0.16% | 3.74% | 3.74% | | | | | | | | | | ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com> Signed-off-by: huangning1995 <huangning12@huawei.com>
This reverts commit 95e053e.
### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(69b170b) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph | maxconcurrency | item | TTFT(ms) | | TPOT(ms) | | QPS (req/s) | OutputSpeed (token/s) | | --- | --- | --- | --- | --- | --- | --- | --- | | | | Avg | P99 | Avg | P99 | | | | 1 | baseline-aclgraph | 205.07 | 287.29 | 12.34 | 12.65 | 0.14 | 78.81 | | 1 | xlite-full | 66.40 | 113.69 | 11.71 | 12.40 | 0.15 | 84.73 | | 1 | xlite-decode-only | 221.15 | 316.40 | 12.16 | 12.91 | 0.14 | 79.70 | | 1 | diff1 | -67.62% | -60.43% | -5.11% | -1.98% | 7.14% | 7.51% | | 1 | diff2 | 7.84% | 10.13% | -1.46% | 2.06% | 0.00% | 1.13% | | | | | | | | | | | 16 | baseline-aclgraph | 1892.16 | 13916.86 | 22.78 | 39.28 | 1.15 | 589.89 | | 16 | xlite-full | 1355.40 | 8907.45 | 15.96 | 25.15 | 1.65 | 850.21 | | 16 | xlite-decode-only | 1519.42 | 8711.64 | 19.23 | 29.73 | 1.38 | 711.60 | | 16 | diff1 | -28.37% | -36.00% | -29.94% | -35.97% | 43.48% | 44.13% | | 16 | diff2 | -19.70% | -37.40% | -15.58% | -24.31% | 20.00% | 20.63% | | | | | | | | | | | 32 | baseline-aclgraph | 673.80 | 3914.90 | 32.20 | 37.95 | 1.80 | 928.54 | | 32 | xlite-full | 481.65 | 2710.50 | 19.95 | 25.35 | 2.91 | 1506.67 | | 32 | xlite-decode-only | 372.22 | 1095.25 | 25.19 | 28.47 | 2.33 | 1202.82 | | 32 | diff1 | -28.52% | -30.76% | -38.04% | -33.20% | 61.67% | 62.26% | | 32 | diff2 | -44.76% | -72.02% | -21.77% | -24.98% | 29.44% | 29.54% | | | | | | | | | | | 48 | baseline-aclgraph | 583.18 | 3277.65 | 41.02 | 46.05 | 2.17 | 1115.08 | | 48 | xlite-full | 973.42 | 8237.33 | 23.29 | 30.50 | 3.71 | 1908.09 | | 48 | xlite-decode-only | 480.79 | 2026.98 | 31.48 | 35.41 | 2.83 | 1453.75 | | 48 | diff1 | 66.92% | 151.32% | -43.22% | -33.77% | 70.97% | 71.12% | | 48 | diff2 | -17.56% | -38.16% | -23.26% | -23.11% | 30.41% | 30.37% | | | | | | | | | | | 64 | baseline-aclgraph | 742.74 | 5953.39 | 47.79 | 53.15 | 2.48 | 1272.37 | | 64 | xlite-full | 545.22 | 3941.34 | 25.09 | 30.41 | 4.64 | 2376.44 | | 64 | xlite-decode-only | 752.40 | 4534.29 | 38.67 | 43.28 | 3.06 | 1567.94 | | 64 | diff1 | -26.59% | -33.80% | -47.50% | -42.78% | 87.10% | 86.77% | | 64 | diff2 | 1.30% | -23.84% | -19.08% | -18.57% | 23.39% | 23.23% | | | | | | | | | | | 100 | baseline-aclgraph | 565.52 | 1716.81 | 60.89 | 68.69 | 3.08 | 1580.64 | | 100 | xlite-full | 398.14 | 2328.88 | 30.70 | 32.45 | 6.01 | 3086.42 | | 100 | xlite-decode-only | 712.53 | 4875.94 | 52.71 | 60.78 | 3.53 | 1813.58 | | 100 | diff1 | -29.60% | 35.65% | -49.58% | -52.76% | 95.13% | 95.26% | | 100 | diff2 | 26.00% | 184.01% | -13.43% | -11.52% | 14.61% | 14.74% | | | | | | | | | | | 150 | baseline-aclgraph | 842.42 | 5175.01 | 73.60 | 88.18 | 3.80 | 1952.26 | | 150 | xlite-full | 568.52 | 4204.33 | 37.90 | 40.01 | 7.27 | 3734.72 | | 150 | xlite-decode-only | 654.43 | 2504.06 | 67.40 | 77.00 | 4.18 | 2145.11 | | 150 | diff1 | -32.51% | -18.76% | -48.51% | -54.63% | 91.32% | 91.30% | | 150 | diff2 | -22.32% | -51.61% | -8.42% | -12.68% | 10.00% | 9.88% | | | | | | | | | | | 200 | baseline-aclgraph | 750.63 | 3049.91 | 88.26 | 101.95 | 4.28 | 2189.72 | | 200 | xlite-full | 558.48 | 3791.98 | 45.54 | 49.04 | 8.17 | 4175.52 | | 200 | xlite-decode-only | 807.09 | 4254.95 | 85.18 | 101.79 | 4.44 | 2271.52 | | 200 | diff1 | -25.60% | 24.33% | -48.40% | -51.90% | 90.89% | 90.69% | | 200 | diff2 | 7.52% | 39.51% | -3.49% | -0.16% | 3.74% | 3.74% | | | | | | | | | | ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com>
### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(69b170b) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph | maxconcurrency | item | TTFT(ms) | | TPOT(ms) | | QPS (req/s) | OutputSpeed (token/s) | | --- | --- | --- | --- | --- | --- | --- | --- | | | | Avg | P99 | Avg | P99 | | | | 1 | baseline-aclgraph | 205.07 | 287.29 | 12.34 | 12.65 | 0.14 | 78.81 | | 1 | xlite-full | 66.40 | 113.69 | 11.71 | 12.40 | 0.15 | 84.73 | | 1 | xlite-decode-only | 221.15 | 316.40 | 12.16 | 12.91 | 0.14 | 79.70 | | 1 | diff1 | -67.62% | -60.43% | -5.11% | -1.98% | 7.14% | 7.51% | | 1 | diff2 | 7.84% | 10.13% | -1.46% | 2.06% | 0.00% | 1.13% | | | | | | | | | | | 16 | baseline-aclgraph | 1892.16 | 13916.86 | 22.78 | 39.28 | 1.15 | 589.89 | | 16 | xlite-full | 1355.40 | 8907.45 | 15.96 | 25.15 | 1.65 | 850.21 | | 16 | xlite-decode-only | 1519.42 | 8711.64 | 19.23 | 29.73 | 1.38 | 711.60 | | 16 | diff1 | -28.37% | -36.00% | -29.94% | -35.97% | 43.48% | 44.13% | | 16 | diff2 | -19.70% | -37.40% | -15.58% | -24.31% | 20.00% | 20.63% | | | | | | | | | | | 32 | baseline-aclgraph | 673.80 | 3914.90 | 32.20 | 37.95 | 1.80 | 928.54 | | 32 | xlite-full | 481.65 | 2710.50 | 19.95 | 25.35 | 2.91 | 1506.67 | | 32 | xlite-decode-only | 372.22 | 1095.25 | 25.19 | 28.47 | 2.33 | 1202.82 | | 32 | diff1 | -28.52% | -30.76% | -38.04% | -33.20% | 61.67% | 62.26% | | 32 | diff2 | -44.76% | -72.02% | -21.77% | -24.98% | 29.44% | 29.54% | | | | | | | | | | | 48 | baseline-aclgraph | 583.18 | 3277.65 | 41.02 | 46.05 | 2.17 | 1115.08 | | 48 | xlite-full | 973.42 | 8237.33 | 23.29 | 30.50 | 3.71 | 1908.09 | | 48 | xlite-decode-only | 480.79 | 2026.98 | 31.48 | 35.41 | 2.83 | 1453.75 | | 48 | diff1 | 66.92% | 151.32% | -43.22% | -33.77% | 70.97% | 71.12% | | 48 | diff2 | -17.56% | -38.16% | -23.26% | -23.11% | 30.41% | 30.37% | | | | | | | | | | | 64 | baseline-aclgraph | 742.74 | 5953.39 | 47.79 | 53.15 | 2.48 | 1272.37 | | 64 | xlite-full | 545.22 | 3941.34 | 25.09 | 30.41 | 4.64 | 2376.44 | | 64 | xlite-decode-only | 752.40 | 4534.29 | 38.67 | 43.28 | 3.06 | 1567.94 | | 64 | diff1 | -26.59% | -33.80% | -47.50% | -42.78% | 87.10% | 86.77% | | 64 | diff2 | 1.30% | -23.84% | -19.08% | -18.57% | 23.39% | 23.23% | | | | | | | | | | | 100 | baseline-aclgraph | 565.52 | 1716.81 | 60.89 | 68.69 | 3.08 | 1580.64 | | 100 | xlite-full | 398.14 | 2328.88 | 30.70 | 32.45 | 6.01 | 3086.42 | | 100 | xlite-decode-only | 712.53 | 4875.94 | 52.71 | 60.78 | 3.53 | 1813.58 | | 100 | diff1 | -29.60% | 35.65% | -49.58% | -52.76% | 95.13% | 95.26% | | 100 | diff2 | 26.00% | 184.01% | -13.43% | -11.52% | 14.61% | 14.74% | | | | | | | | | | | 150 | baseline-aclgraph | 842.42 | 5175.01 | 73.60 | 88.18 | 3.80 | 1952.26 | | 150 | xlite-full | 568.52 | 4204.33 | 37.90 | 40.01 | 7.27 | 3734.72 | | 150 | xlite-decode-only | 654.43 | 2504.06 | 67.40 | 77.00 | 4.18 | 2145.11 | | 150 | diff1 | -32.51% | -18.76% | -48.51% | -54.63% | 91.32% | 91.30% | | 150 | diff2 | -22.32% | -51.61% | -8.42% | -12.68% | 10.00% | 9.88% | | | | | | | | | | | 200 | baseline-aclgraph | 750.63 | 3049.91 | 88.26 | 101.95 | 4.28 | 2189.72 | | 200 | xlite-full | 558.48 | 3791.98 | 45.54 | 49.04 | 8.17 | 4175.52 | | 200 | xlite-decode-only | 807.09 | 4254.95 | 85.18 | 101.79 | 4.44 | 2271.52 | | 200 | diff1 | -25.60% | 24.33% | -48.40% | -51.90% | 90.89% | 90.69% | | 200 | diff2 | 7.52% | 39.51% | -3.49% | -0.16% | 3.74% | 3.74% | | | | | | | | | | ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(69b170b) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph | maxconcurrency | item | TTFT(ms) | | TPOT(ms) | | QPS (req/s) | OutputSpeed (token/s) | | --- | --- | --- | --- | --- | --- | --- | --- | | | | Avg | P99 | Avg | P99 | | | | 1 | baseline-aclgraph | 205.07 | 287.29 | 12.34 | 12.65 | 0.14 | 78.81 | | 1 | xlite-full | 66.40 | 113.69 | 11.71 | 12.40 | 0.15 | 84.73 | | 1 | xlite-decode-only | 221.15 | 316.40 | 12.16 | 12.91 | 0.14 | 79.70 | | 1 | diff1 | -67.62% | -60.43% | -5.11% | -1.98% | 7.14% | 7.51% | | 1 | diff2 | 7.84% | 10.13% | -1.46% | 2.06% | 0.00% | 1.13% | | | | | | | | | | | 16 | baseline-aclgraph | 1892.16 | 13916.86 | 22.78 | 39.28 | 1.15 | 589.89 | | 16 | xlite-full | 1355.40 | 8907.45 | 15.96 | 25.15 | 1.65 | 850.21 | | 16 | xlite-decode-only | 1519.42 | 8711.64 | 19.23 | 29.73 | 1.38 | 711.60 | | 16 | diff1 | -28.37% | -36.00% | -29.94% | -35.97% | 43.48% | 44.13% | | 16 | diff2 | -19.70% | -37.40% | -15.58% | -24.31% | 20.00% | 20.63% | | | | | | | | | | | 32 | baseline-aclgraph | 673.80 | 3914.90 | 32.20 | 37.95 | 1.80 | 928.54 | | 32 | xlite-full | 481.65 | 2710.50 | 19.95 | 25.35 | 2.91 | 1506.67 | | 32 | xlite-decode-only | 372.22 | 1095.25 | 25.19 | 28.47 | 2.33 | 1202.82 | | 32 | diff1 | -28.52% | -30.76% | -38.04% | -33.20% | 61.67% | 62.26% | | 32 | diff2 | -44.76% | -72.02% | -21.77% | -24.98% | 29.44% | 29.54% | | | | | | | | | | | 48 | baseline-aclgraph | 583.18 | 3277.65 | 41.02 | 46.05 | 2.17 | 1115.08 | | 48 | xlite-full | 973.42 | 8237.33 | 23.29 | 30.50 | 3.71 | 1908.09 | | 48 | xlite-decode-only | 480.79 | 2026.98 | 31.48 | 35.41 | 2.83 | 1453.75 | | 48 | diff1 | 66.92% | 151.32% | -43.22% | -33.77% | 70.97% | 71.12% | | 48 | diff2 | -17.56% | -38.16% | -23.26% | -23.11% | 30.41% | 30.37% | | | | | | | | | | | 64 | baseline-aclgraph | 742.74 | 5953.39 | 47.79 | 53.15 | 2.48 | 1272.37 | | 64 | xlite-full | 545.22 | 3941.34 | 25.09 | 30.41 | 4.64 | 2376.44 | | 64 | xlite-decode-only | 752.40 | 4534.29 | 38.67 | 43.28 | 3.06 | 1567.94 | | 64 | diff1 | -26.59% | -33.80% | -47.50% | -42.78% | 87.10% | 86.77% | | 64 | diff2 | 1.30% | -23.84% | -19.08% | -18.57% | 23.39% | 23.23% | | | | | | | | | | | 100 | baseline-aclgraph | 565.52 | 1716.81 | 60.89 | 68.69 | 3.08 | 1580.64 | | 100 | xlite-full | 398.14 | 2328.88 | 30.70 | 32.45 | 6.01 | 3086.42 | | 100 | xlite-decode-only | 712.53 | 4875.94 | 52.71 | 60.78 | 3.53 | 1813.58 | | 100 | diff1 | -29.60% | 35.65% | -49.58% | -52.76% | 95.13% | 95.26% | | 100 | diff2 | 26.00% | 184.01% | -13.43% | -11.52% | 14.61% | 14.74% | | | | | | | | | | | 150 | baseline-aclgraph | 842.42 | 5175.01 | 73.60 | 88.18 | 3.80 | 1952.26 | | 150 | xlite-full | 568.52 | 4204.33 | 37.90 | 40.01 | 7.27 | 3734.72 | | 150 | xlite-decode-only | 654.43 | 2504.06 | 67.40 | 77.00 | 4.18 | 2145.11 | | 150 | diff1 | -32.51% | -18.76% | -48.51% | -54.63% | 91.32% | 91.30% | | 150 | diff2 | -22.32% | -51.61% | -8.42% | -12.68% | 10.00% | 9.88% | | | | | | | | | | | 200 | baseline-aclgraph | 750.63 | 3049.91 | 88.26 | 101.95 | 4.28 | 2189.72 | | 200 | xlite-full | 558.48 | 3791.98 | 45.54 | 49.04 | 8.17 | 4175.52 | | 200 | xlite-decode-only | 807.09 | 4254.95 | 85.18 | 101.79 | 4.44 | 2271.52 | | 200 | diff1 | -25.60% | 24.33% | -48.40% | -51.90% | 90.89% | 90.69% | | 200 | diff2 | 7.52% | 39.51% | -3.49% | -0.16% | 3.74% | 3.74% | | | | | | | | | | ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com>
### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(69b170b) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph | maxconcurrency | item | TTFT(ms) | | TPOT(ms) | | QPS (req/s) | OutputSpeed (token/s) | | --- | --- | --- | --- | --- | --- | --- | --- | | | | Avg | P99 | Avg | P99 | | | | 1 | baseline-aclgraph | 205.07 | 287.29 | 12.34 | 12.65 | 0.14 | 78.81 | | 1 | xlite-full | 66.40 | 113.69 | 11.71 | 12.40 | 0.15 | 84.73 | | 1 | xlite-decode-only | 221.15 | 316.40 | 12.16 | 12.91 | 0.14 | 79.70 | | 1 | diff1 | -67.62% | -60.43% | -5.11% | -1.98% | 7.14% | 7.51% | | 1 | diff2 | 7.84% | 10.13% | -1.46% | 2.06% | 0.00% | 1.13% | | | | | | | | | | | 16 | baseline-aclgraph | 1892.16 | 13916.86 | 22.78 | 39.28 | 1.15 | 589.89 | | 16 | xlite-full | 1355.40 | 8907.45 | 15.96 | 25.15 | 1.65 | 850.21 | | 16 | xlite-decode-only | 1519.42 | 8711.64 | 19.23 | 29.73 | 1.38 | 711.60 | | 16 | diff1 | -28.37% | -36.00% | -29.94% | -35.97% | 43.48% | 44.13% | | 16 | diff2 | -19.70% | -37.40% | -15.58% | -24.31% | 20.00% | 20.63% | | | | | | | | | | | 32 | baseline-aclgraph | 673.80 | 3914.90 | 32.20 | 37.95 | 1.80 | 928.54 | | 32 | xlite-full | 481.65 | 2710.50 | 19.95 | 25.35 | 2.91 | 1506.67 | | 32 | xlite-decode-only | 372.22 | 1095.25 | 25.19 | 28.47 | 2.33 | 1202.82 | | 32 | diff1 | -28.52% | -30.76% | -38.04% | -33.20% | 61.67% | 62.26% | | 32 | diff2 | -44.76% | -72.02% | -21.77% | -24.98% | 29.44% | 29.54% | | | | | | | | | | | 48 | baseline-aclgraph | 583.18 | 3277.65 | 41.02 | 46.05 | 2.17 | 1115.08 | | 48 | xlite-full | 973.42 | 8237.33 | 23.29 | 30.50 | 3.71 | 1908.09 | | 48 | xlite-decode-only | 480.79 | 2026.98 | 31.48 | 35.41 | 2.83 | 1453.75 | | 48 | diff1 | 66.92% | 151.32% | -43.22% | -33.77% | 70.97% | 71.12% | | 48 | diff2 | -17.56% | -38.16% | -23.26% | -23.11% | 30.41% | 30.37% | | | | | | | | | | | 64 | baseline-aclgraph | 742.74 | 5953.39 | 47.79 | 53.15 | 2.48 | 1272.37 | | 64 | xlite-full | 545.22 | 3941.34 | 25.09 | 30.41 | 4.64 | 2376.44 | | 64 | xlite-decode-only | 752.40 | 4534.29 | 38.67 | 43.28 | 3.06 | 1567.94 | | 64 | diff1 | -26.59% | -33.80% | -47.50% | -42.78% | 87.10% | 86.77% | | 64 | diff2 | 1.30% | -23.84% | -19.08% | -18.57% | 23.39% | 23.23% | | | | | | | | | | | 100 | baseline-aclgraph | 565.52 | 1716.81 | 60.89 | 68.69 | 3.08 | 1580.64 | | 100 | xlite-full | 398.14 | 2328.88 | 30.70 | 32.45 | 6.01 | 3086.42 | | 100 | xlite-decode-only | 712.53 | 4875.94 | 52.71 | 60.78 | 3.53 | 1813.58 | | 100 | diff1 | -29.60% | 35.65% | -49.58% | -52.76% | 95.13% | 95.26% | | 100 | diff2 | 26.00% | 184.01% | -13.43% | -11.52% | 14.61% | 14.74% | | | | | | | | | | | 150 | baseline-aclgraph | 842.42 | 5175.01 | 73.60 | 88.18 | 3.80 | 1952.26 | | 150 | xlite-full | 568.52 | 4204.33 | 37.90 | 40.01 | 7.27 | 3734.72 | | 150 | xlite-decode-only | 654.43 | 2504.06 | 67.40 | 77.00 | 4.18 | 2145.11 | | 150 | diff1 | -32.51% | -18.76% | -48.51% | -54.63% | 91.32% | 91.30% | | 150 | diff2 | -22.32% | -51.61% | -8.42% | -12.68% | 10.00% | 9.88% | | | | | | | | | | | 200 | baseline-aclgraph | 750.63 | 3049.91 | 88.26 | 101.95 | 4.28 | 2189.72 | | 200 | xlite-full | 558.48 | 3791.98 | 45.54 | 49.04 | 8.17 | 4175.52 | | 200 | xlite-decode-only | 807.09 | 4254.95 | 85.18 | 101.79 | 4.44 | 2271.52 | | 200 | diff1 | -25.60% | 24.33% | -48.40% | -51.90% | 90.89% | 90.69% | | 200 | diff2 | 7.52% | 39.51% | -3.49% | -0.16% | 3.74% | 3.74% | | | | | | | | | | ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md. Qwen3-MoE TODO List: - [ ] Qwen3-235B-A22B support - [ ] Qwen3-MoE weights NZ support - [ ] Qwen3-MoE data parallel support ## Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison - aclgraph: main(69b170b) - xlite-full: main + xlite-full - xlite-decode-only: main + xlite-decode-only - diff1: Performance comparison between xlite-full and aclgraph - diff2: Performance comparison between xlite-decode-only and aclgraph | maxconcurrency | item | TTFT(ms) | | TPOT(ms) | | QPS (req/s) | OutputSpeed (token/s) | | --- | --- | --- | --- | --- | --- | --- | --- | | | | Avg | P99 | Avg | P99 | | | | 1 | baseline-aclgraph | 205.07 | 287.29 | 12.34 | 12.65 | 0.14 | 78.81 | | 1 | xlite-full | 66.40 | 113.69 | 11.71 | 12.40 | 0.15 | 84.73 | | 1 | xlite-decode-only | 221.15 | 316.40 | 12.16 | 12.91 | 0.14 | 79.70 | | 1 | diff1 | -67.62% | -60.43% | -5.11% | -1.98% | 7.14% | 7.51% | | 1 | diff2 | 7.84% | 10.13% | -1.46% | 2.06% | 0.00% | 1.13% | | | | | | | | | | | 16 | baseline-aclgraph | 1892.16 | 13916.86 | 22.78 | 39.28 | 1.15 | 589.89 | | 16 | xlite-full | 1355.40 | 8907.45 | 15.96 | 25.15 | 1.65 | 850.21 | | 16 | xlite-decode-only | 1519.42 | 8711.64 | 19.23 | 29.73 | 1.38 | 711.60 | | 16 | diff1 | -28.37% | -36.00% | -29.94% | -35.97% | 43.48% | 44.13% | | 16 | diff2 | -19.70% | -37.40% | -15.58% | -24.31% | 20.00% | 20.63% | | | | | | | | | | | 32 | baseline-aclgraph | 673.80 | 3914.90 | 32.20 | 37.95 | 1.80 | 928.54 | | 32 | xlite-full | 481.65 | 2710.50 | 19.95 | 25.35 | 2.91 | 1506.67 | | 32 | xlite-decode-only | 372.22 | 1095.25 | 25.19 | 28.47 | 2.33 | 1202.82 | | 32 | diff1 | -28.52% | -30.76% | -38.04% | -33.20% | 61.67% | 62.26% | | 32 | diff2 | -44.76% | -72.02% | -21.77% | -24.98% | 29.44% | 29.54% | | | | | | | | | | | 48 | baseline-aclgraph | 583.18 | 3277.65 | 41.02 | 46.05 | 2.17 | 1115.08 | | 48 | xlite-full | 973.42 | 8237.33 | 23.29 | 30.50 | 3.71 | 1908.09 | | 48 | xlite-decode-only | 480.79 | 2026.98 | 31.48 | 35.41 | 2.83 | 1453.75 | | 48 | diff1 | 66.92% | 151.32% | -43.22% | -33.77% | 70.97% | 71.12% | | 48 | diff2 | -17.56% | -38.16% | -23.26% | -23.11% | 30.41% | 30.37% | | | | | | | | | | | 64 | baseline-aclgraph | 742.74 | 5953.39 | 47.79 | 53.15 | 2.48 | 1272.37 | | 64 | xlite-full | 545.22 | 3941.34 | 25.09 | 30.41 | 4.64 | 2376.44 | | 64 | xlite-decode-only | 752.40 | 4534.29 | 38.67 | 43.28 | 3.06 | 1567.94 | | 64 | diff1 | -26.59% | -33.80% | -47.50% | -42.78% | 87.10% | 86.77% | | 64 | diff2 | 1.30% | -23.84% | -19.08% | -18.57% | 23.39% | 23.23% | | | | | | | | | | | 100 | baseline-aclgraph | 565.52 | 1716.81 | 60.89 | 68.69 | 3.08 | 1580.64 | | 100 | xlite-full | 398.14 | 2328.88 | 30.70 | 32.45 | 6.01 | 3086.42 | | 100 | xlite-decode-only | 712.53 | 4875.94 | 52.71 | 60.78 | 3.53 | 1813.58 | | 100 | diff1 | -29.60% | 35.65% | -49.58% | -52.76% | 95.13% | 95.26% | | 100 | diff2 | 26.00% | 184.01% | -13.43% | -11.52% | 14.61% | 14.74% | | | | | | | | | | | 150 | baseline-aclgraph | 842.42 | 5175.01 | 73.60 | 88.18 | 3.80 | 1952.26 | | 150 | xlite-full | 568.52 | 4204.33 | 37.90 | 40.01 | 7.27 | 3734.72 | | 150 | xlite-decode-only | 654.43 | 2504.06 | 67.40 | 77.00 | 4.18 | 2145.11 | | 150 | diff1 | -32.51% | -18.76% | -48.51% | -54.63% | 91.32% | 91.30% | | 150 | diff2 | -22.32% | -51.61% | -8.42% | -12.68% | 10.00% | 9.88% | | | | | | | | | | | 200 | baseline-aclgraph | 750.63 | 3049.91 | 88.26 | 101.95 | 4.28 | 2189.72 | | 200 | xlite-full | 558.48 | 3791.98 | 45.54 | 49.04 | 8.17 | 4175.52 | | 200 | xlite-decode-only | 807.09 | 4254.95 | 85.18 | 101.79 | 4.44 | 2271.52 | | 200 | diff1 | -25.60% | 24.33% | -48.40% | -51.90% | 90.89% | 90.69% | | 200 | diff2 | 7.52% | 39.51% | -3.49% | -0.16% | 3.74% | 3.74% | | | | | | | | | | ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 --------- Signed-off-by: changdawei1 <changdawei3@huawei.com> Co-authored-by: LVYANGGUO <275926687@qq.com> Co-authored-by: lulina <lina.lulina@huawei.com>
What this PR does / why we need it?
This patch adds support for the Qwen3-MoE model in Xlite. For more details about Xlite, please refer to the following link:https://atomgit.com/openeuler/GVirt/blob/master/xlite/README.md.
Qwen3-MoE TODO List:
Qwen3-30B-A3B-Instruct-2507 910B3(A2) Online Inference Performance Comparison
online server config:
test_config:
Does this PR introduce any user-facing change?
How was this patch tested?