Skip to content

[Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476]#5755

Merged
wangxiyuan merged 1 commit intovllm-project:mainfrom
wangqiankun13:adapt_scale_dtype
Jan 19, 2026
Merged

[Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476]#5755
wangxiyuan merged 1 commit intovllm-project:mainfrom
wangqiankun13:adapt_scale_dtype

Conversation

@wangqiankun13
Copy link
Copy Markdown
Contributor

@wangqiankun13 wangqiankun13 commented Jan 9, 2026

What this PR does / why we need it?

[Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators.

  • Before: weight scale must be float32
  • After: weight scale can be float32/float16 when x is float16, float32/bfloat16 when x is float32/bfloat16. And w1 scale can use different dtype with w2 scale.

More info about this operator, please refer to RFC: issue #5476

Does this PR introduce any user-facing change?

How was this patch tested?

Perf

When scale is of type fp16 or bf16, it will be cast to fp32 internally within the operator, while the subsequent computations remain unchanged. Therefore, this PR will introduce an additional cast operation but halve the memory copy operations for scale . Furthermore, since the scale data is only a few KB in size and participates in relatively few computations, its impact is almost negligible compared to major operations like matrix multiplication. Thus, the theoretical performance change should be minimal.

test single operator cases from qwen3-235b,

  • single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b ep32)
  • batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536

The test was conducted for 100 rounds, and the average of the last 95 rounds was taken.

bs18(us) bs32(us)
Without this PR 96.28 108.83
With this PR 96.06 107.90

Note: Single-operator benchmarks represent an ideal scenario. They are usually only useful for referencing relative changes and may not fully align with performance data observed within the full model.

Acc

test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode

dataset version metric mode vllm-api-stream-chat
aime2024 604a78 accuracy gen 83.33

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adapts the DispathGmmCombineDecode operator to support different data types for weight scales, enhancing flexibility. The changes are consistently applied across C++ kernel definitions, templates, and the corresponding Python code. While the overall direction is good, I've identified a critical issue in two C++ headers related to uninitialized tensor access that could lead to undefined behavior, and a high-severity issue with incorrect static assertions for buffer sizing.

Comment on lines +304 to +305
auto &ubRawScale = ubRawScaleList[ubListId];
auto layoutRawUbScale = LayoutScale::template MakeLayoutInUb<ElementRawScale>(scaleTileShape);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical issue here. The ubRawScaleList member is not initialized in the constructor when ElementRawScale is the same as ElementFp32Scale (e.g., float). Accessing it here via ubRawScaleList[ubListId] leads to using an uninitialized LocalTensor, which is undefined behavior.

To fix this, you should move the declarations of ubRawScale and layoutRawUbScale inside the if constexpr blocks where they are actually used (at lines 307 and 334).

Comment on lines +222 to +223
auto &ubRawScale = ubRawScaleList[ubListId];
auto layoutRawUbScale = LayoutScale::template MakeLayoutInUb<ElementRawScale>(scaleTileShape);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical issue here. The ubRawScaleList member is not initialized in the constructor when ElementRawScale is the same as ElementFp32Scale (e.g., float). Accessing it here via ubRawScaleList[ubListId] leads to using an uninitialized LocalTensor, which is undefined behavior.

To fix this, you should move the declarations of ubRawScale and layoutRawUbScale inside the if constexpr blocks where ubRawScale is actually used (at lines 225 and 251).

Comment on lines 76 to 80
static_assert((UB_STAGES * (TileShape::COUNT * sizeof(ElementC) + TileShape::COLUMN * (sizeof(ElementRawScale) + sizeof(ElementFp32Scale)) +
TileShape::ROW * sizeof(ElementPerTokenScale) + TileShape::COUNT * sizeof(ElementD)) +
(TileShape::COUNT + TileShape::COUNT) * sizeof(float) + TileShape::ROW * BYTE_PER_BLK) <=
ArchTag::UB_SIZE,
"TileShape is too large to fit in UB");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The static_assert for UB size is incorrect. When ElementRawScale is the same as ElementFp32Scale (i.e., float), only ubFp32ScaleList is allocated. However, this static_assert calculates the size as if both ubRawScaleList and ubFp32ScaleList are allocated, by using sizeof(ElementRawScale) + sizeof(ElementFp32Scale). This makes the check overly restrictive and may cause compilation to fail for valid configurations.

    static_assert((UB_STAGES * (TileShape::COUNT * sizeof(ElementC) + (std::is_same_v<ElementRawScale, ElementFp32Scale> ? 0 : TileShape::COLUMN * sizeof(ElementRawScale)) + TileShape::COLUMN * sizeof(ElementFp32Scale) +
                                TileShape::ROW * sizeof(ElementPerTokenScale) + TileShape::COUNT * sizeof(ElementD)) +
                   (TileShape::COUNT + TileShape::COUNT) * sizeof(float) + TileShape::ROW * BYTE_PER_BLK) <=
                      ArchTag::UB_SIZE,
                  "TileShape is too large to fit in UB");

Comment on lines 80 to 84
static_assert((UB_STAGES * (TileShape::COUNT * sizeof(ElementC) + TileShape::COLUMN * (sizeof(ElementRawScale) + sizeof(ElementFp32Scale)) +
TileShape::ROW * sizeof(ElementPerTokenScale) + TileShape::COUNT * sizeof(ElementD)) +
(TileShape::COUNT + TileShape::COUNT) * sizeof(float) + TileShape::ROW * BYTE_PER_BLK) <=
ArchTag::UB_SIZE,
"TileShape is too large to fit in UB");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The static_assert for UB size is incorrect. When ElementRawScale is the same as ElementFp32Scale (i.e., float), only ubFp32ScaleList is allocated. However, this static_assert calculates the size as if both ubRawScaleList and ubFp32ScaleList are allocated, by using sizeof(ElementRawScale) + sizeof(ElementFp32Scale). This makes the check overly restrictive and may cause compilation to fail for valid configurations.

    static_assert((UB_STAGES * (TileShape::COUNT * sizeof(ElementC) + (std::is_same_v<ElementRawScale, ElementFp32Scale> ? 0 : TileShape::COLUMN * sizeof(ElementRawScale)) + TileShape::COLUMN * sizeof(ElementFp32Scale) +
                                TileShape::ROW * sizeof(ElementPerTokenScale) + TileShape::COUNT * sizeof(ElementD)) +
                   (TileShape::COUNT + TileShape::COUNT) * sizeof(float) + TileShape::ROW * BYTE_PER_BLK) <=
                      ArchTag::UB_SIZE,
                  "TileShape is too large to fit in UB");

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 9, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wangqiankun13 wangqiankun13 changed the title [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 Jan 9, 2026
@wangqiankun13 wangqiankun13 changed the title [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. RFC: issue https://github.com/vllm-project/vllm-ascend/issues/5476 [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC](https://github.com/vllm-project/vllm-ascend/issues/5476) Jan 9, 2026
@wangqiankun13 wangqiankun13 changed the title [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC](https://github.com/vllm-project/vllm-ascend/issues/5476) [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: 5476] Jan 9, 2026
@wangqiankun13 wangqiankun13 changed the title [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: 5476] [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] Jan 9, 2026
@wangqiankun13 wangqiankun13 force-pushed the adapt_scale_dtype branch 3 times, most recently from ae3fe03 to 6a281a5 Compare January 14, 2026 11:02
@wangxiyuan wangxiyuan enabled auto-merge (squash) January 15, 2026 08:36
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 15, 2026
auto-merge was automatically disabled January 15, 2026 12:04

Head branch was pushed to by a user without write access

@wangqiankun13 wangqiankun13 force-pushed the adapt_scale_dtype branch 2 times, most recently from 114a8e3 to 6254e81 Compare January 16, 2026 02:59
…scale dtype of small operators

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
@wangxiyuan wangxiyuan merged commit ebb9406 into vllm-project:main Jan 19, 2026
20 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 19, 2026
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (110 commits)
  [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936)
  [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960)
  [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)
  [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834)
  [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897)
  [CI]fix for lint CI (vllm-project#5982)
  [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034)
  [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928)
  [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933)
  [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908)
  [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855)
  [doc]Table split  (vllm-project#5929)
  [Doc] Upgrade outdated ut doc (vllm-project#5937)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977)
  Eagle3 mm support, enablement on qwen3vl (vllm-project#4848)
  [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959)
  [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968)
  [Feature] Support fine-grained shared expert overlap (vllm-project#5482)
  [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963)
  [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776)
  ...
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 21, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (637 commits)
  [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936)
  [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960)
  [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)
  [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834)
  [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897)
  [CI]fix for lint CI (vllm-project#5982)
  [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034)
  [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928)
  [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933)
  [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908)
  [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855)
  [doc]Table split  (vllm-project#5929)
  [Doc] Upgrade outdated ut doc (vllm-project#5937)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977)
  Eagle3 mm support, enablement on qwen3vl (vllm-project#4848)
  [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959)
  [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968)
  [Feature] Support fine-grained shared expert overlap (vllm-project#5482)
  [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963)
  [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776)
  ...
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)

### What this PR does / why we need it?

[Feature] Adapt DispathGmmCombineDecode opertor to align with weight
scale dtype of small operators.
- **Before**: weight scale must be float32
- **After**: weight scale can be float32/float16 when x is float16,
float32/bfloat16 when x is float32/bfloat16. And w1 scale can use
different dtype with w2 scale.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
#### Perf

> When scale is of type fp16 or bf16, it will be cast to fp32 internally
within the operator, while the subsequent computations remain unchanged.
Therefore, this PR will introduce an additional cast operation but halve
the memory copy operations for scale . Furthermore, since the scale data
is only a few KB in size and participates in relatively few
computations, its impact is almost negligible compared to major
operations like matrix multiplication. Thus, the theoretical performance
change should be minimal.

test single operator cases from qwen3-235b,
- single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b
ep32)
- batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536

The test was conducted for 100 rounds, and the average of the last 95
rounds was taken.
| | bs18(us)| bs32(us)|
| -----| -----| -----|
|Without this PR|96.28|108.83|
|With this PR|96.06|107.90|

Note: Single-operator benchmarks represent an ideal scenario. They are
usually only useful for referencing relative changes and may not fully
align with performance data observed within the full model.

#### Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)

### What this PR does / why we need it?

[Feature] Adapt DispathGmmCombineDecode opertor to align with weight
scale dtype of small operators.
- **Before**: weight scale must be float32
- **After**: weight scale can be float32/float16 when x is float16,
float32/bfloat16 when x is float32/bfloat16. And w1 scale can use
different dtype with w2 scale.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
#### Perf

> When scale is of type fp16 or bf16, it will be cast to fp32 internally
within the operator, while the subsequent computations remain unchanged.
Therefore, this PR will introduce an additional cast operation but halve
the memory copy operations for scale . Furthermore, since the scale data
is only a few KB in size and participates in relatively few
computations, its impact is almost negligible compared to major
operations like matrix multiplication. Thus, the theoretical performance
change should be minimal.

test single operator cases from qwen3-235b,
- single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b
ep32)
- batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536

The test was conducted for 100 rounds, and the average of the last 95
rounds was taken.
| | bs18(us)| bs32(us)|
| -----| -----| -----|
|Without this PR|96.28|108.83|
|With this PR|96.06|107.90|

Note: Single-operator benchmarks represent an ideal scenario. They are
usually only useful for referencing relative changes and may not fully
align with performance data observed within the full model.

#### Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
…scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)

### What this PR does / why we need it?

[Feature] Adapt DispathGmmCombineDecode opertor to align with weight
scale dtype of small operators.
- **Before**: weight scale must be float32
- **After**: weight scale can be float32/float16 when x is float16,
float32/bfloat16 when x is float32/bfloat16. And w1 scale can use
different dtype with w2 scale.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
#### Perf

> When scale is of type fp16 or bf16, it will be cast to fp32 internally
within the operator, while the subsequent computations remain unchanged.
Therefore, this PR will introduce an additional cast operation but halve
the memory copy operations for scale . Furthermore, since the scale data
is only a few KB in size and participates in relatively few
computations, its impact is almost negligible compared to major
operations like matrix multiplication. Thus, the theoretical performance
change should be minimal.

test single operator cases from qwen3-235b,
- single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b
ep32)
- batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536

The test was conducted for 100 rounds, and the average of the last 95
rounds was taken.
| | bs18(us)| bs32(us)|
| -----| -----| -----|
|Without this PR|96.28|108.83|
|With this PR|96.06|107.90|

Note: Single-operator benchmarks represent an ideal scenario. They are
usually only useful for referencing relative changes and may not fully
align with performance data observed within the full model.

#### Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)

### What this PR does / why we need it?

[Feature] Adapt DispathGmmCombineDecode opertor to align with weight
scale dtype of small operators.
- **Before**: weight scale must be float32
- **After**: weight scale can be float32/float16 when x is float16,
float32/bfloat16 when x is float32/bfloat16. And w1 scale can use
different dtype with w2 scale.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
#### Perf

> When scale is of type fp16 or bf16, it will be cast to fp32 internally
within the operator, while the subsequent computations remain unchanged.
Therefore, this PR will introduce an additional cast operation but halve
the memory copy operations for scale . Furthermore, since the scale data
is only a few KB in size and participates in relatively few
computations, its impact is almost negligible compared to major
operations like matrix multiplication. Thus, the theoretical performance
change should be minimal.

test single operator cases from qwen3-235b,
- single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b
ep32)
- batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536

The test was conducted for 100 rounds, and the average of the last 95
rounds was taken.
| | bs18(us)| bs32(us)|
| -----| -----| -----|
|Without this PR|96.28|108.83|
|With this PR|96.06|107.90|

Note: Single-operator benchmarks represent an ideal scenario. They are
usually only useful for referencing relative changes and may not fully
align with performance data observed within the full model.

#### Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)

### What this PR does / why we need it?

[Feature] Adapt DispathGmmCombineDecode opertor to align with weight
scale dtype of small operators.
- **Before**: weight scale must be float32
- **After**: weight scale can be float32/float16 when x is float16,
float32/bfloat16 when x is float32/bfloat16. And w1 scale can use
different dtype with w2 scale.

More info about this operator, please refer to RFC: issue
vllm-project#5476

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
#### Perf

> When scale is of type fp16 or bf16, it will be cast to fp32 internally
within the operator, while the subsequent computations remain unchanged.
Therefore, this PR will introduce an additional cast operation but halve
the memory copy operations for scale . Furthermore, since the scale data
is only a few KB in size and participates in relatively few
computations, its impact is almost negligible compared to major
operations like matrix multiplication. Thus, the theoretical performance
change should be minimal.

test single operator cases from qwen3-235b,
- single A3 node(ep16), 64 moe experts, 4 experts / die (like qwen3-235b
ep32)
- batch=18/32, token_hidden_size 4096, moe_intermediate_size 1536

The test was conducted for 100 rounds, and the average of the last 95
rounds was taken.
| | bs18(us)| bs32(us)|
| -----| -----| -----|
|Without this PR|96.28|108.83|
|With this PR|96.06|107.90|

Note: Single-operator benchmarks represent an ideal scenario. They are
usually only useful for referencing relative changes and may not fully
align with performance data observed within the full model.

#### Acc
test qwen3-235b eplb on a single A3 node(ep16),
with dispatch_gmm_combine_decode
| dataset | version | metric | mode | vllm-api-stream-chat |
|----- | ----- | ----- | ----- | -----|
| aime2024 | 604a78 | accuracy | gen | 83.33 |

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: wangqiankun <wangqiankun13@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:quantization module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants