[BugFix][SpecDecode] Align extract_hidden_states proposer DP/SP batch… by learning-sketch · Pull Request #9689 · vllm-project/vllm-ascend

learning-sketch · 2026-05-29T02:43:52Z

What this PR does / why we need it?

AscendExtractHiddenStatesProposer inherits the upstream ExtractHiddenStatesProposer._determine_batch_execution_and_padding unchanged, which on Ascend causes two distinct failures when running extract_hidden_states on a MoE target model with DP > 1 and sequence parallelism enabled (e.g. MiniMax-M2 with VLLM_ASCEND_ENABLE_FLASHCOMM1=1):

gloo shape mismatch on the DP cpu_group:
```
what(): [enforce fail at .../gloo/transport/tcp/pair.cc:456]
op.preamble.length <= op.nbytes. 8 vs 4.
Received data size doesn't match expected size.
Is there a distributed collective mismatch in your code?
```
Upstream coordinate_batch_across_dp posts a [4, dp_size] int32 tensor to the DP cpu_group, while Ascend's main runner uses _sync_metadata_across_dp with a [2, dp_size] tensor on the same cpu_group. The two shapes collide within one step.
reduce_scatter shape-not-divisible assertion on the idle DP rank:
```
File ".../vllm_ascend/ops/linear_op.py", line 574, in matmul_and_reduce
    output = tensor_model_parallel_reduce_scatter(output_parallel, 0)
File ".../base_device_communicator.py", line 234, in reduce_scatter
    assert input_tensor.shape[0] % world_size == 0
AssertionError
```
The proposer's own cudagraph_dispatcher is initialized as PIECEWISE/NONE only (never FULL), so dispatch(num_tokens=6) returns 6 as-is (no SP padding). That 6 enters DP sync, the synced max stays 6, and the idle DP rank's main MoE forward then crashes in SP reduce_scatter because 6 % TP=4 != 0.

Eagle3/MTP do not reproduce this because Ascend's AscendEagleProposer uses runner.cudagraph_dispatcher.dispatch(...) which dispatches against the runner's FULL-mode capture sizes (always TP-aligned).

Fix

Override AscendExtractHiddenStatesProposer._determine_batch_execution_and_padding:

SP-pad num_tokens via runner._pad_for_sequence_parallelism before dispatch, so the contribution to DP sync is always TP-aligned. Mirrors what the runner's main path does at model_runner_v1.py:_determine_batch_execution_and_padding.
Use runner._sync_metadata_across_dp (packed_tensor shape [2, dp_size]) for DP coordination instead of upstream coordinate_batch_across_dp (shape [4, dp_size]), so all DP collectives in a single step that hit the cpu_group use a consistent tensor shape.
Fail fast at the entry of the override with a clear AssertionError if the proposer was constructed without a runner reference, instead of letting the unguarded runner._pad_for_sequence_parallelism call raise a confusing AttributeError.
Document the is_draft_model=True semantics: it intentionally makes should_skip_allreduce_across_dp_group short-circuit (the cache-only "draft" here is not MoE), so the call degenerates to a local broadcast. The actual cross-DP all_reduce has already been done by the main runner earlier in the step; the SP padding above is what keeps the value TP-aligned regardless.

Does this PR introduce any user-facing change?

No user-facing API change. Fixes a runtime crash for users running extract_hidden_states speculative decoding with
--data-parallel-size > 1 on MoE target models on Ascend NPU. Single-DP runs and dense target models (e.g. Qwen3-8B) are unaffected.

How was this patch tested?

Reproduced the crash and verified the fix on MiniMax-M2:

2x8 NPU, --tensor-parallel-size 8 --data-parallel-size 2
--enable-expert-parallel
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'
--speculative-config '{"method": "extract_hidden_states", "num_speculative_tokens": 1, "draft_model_config": {"hf_config": {"eagle_aux_hidden_state_layer_ids": [2, 18, 34]}}}'
--kv-transfer-config '{"kv_connector": "ExampleHiddenStatesConnector", "kv_role": "kv_producer", ...}'
Sending one /v1/completions request with max_tokens=1:
- Before fix: idle DP rank crashes on first execute_dummy_batch with the AssertionError shown above.
- After fix: request returns 200 OK, hidden_states .safetensors file is written with the expected (prompt_len, len(layer_ids), hidden_size) shape.

Also verified the existing dense Qwen3-8B + extract_hidden_states path still works unchanged.

Unit tests added in
tests/ut/spec_decode/test_extract_hidden_states_proposer.py:

test_determine_batch_execution_and_padding_asserts_when_runner_is_none: regression guard for the AttributeError that would otherwise be raised on the unguarded self.runner._pad_for_sequence_parallelism call at the entry of the override.
test_determine_batch_execution_and_padding_dp1_sp_pads_and_skips_sync: with DP=1 the runner's _pad_for_sequence_parallelism is still consulted (so cache_only forward gets an SP-aligned input) but _sync_metadata_across_dp is not called.
test_determine_batch_execution_and_padding_dp2_uses_runner_sync: with DP>1 the override calls runner._sync_metadata_across_dp with the SP-padded num_tokens and is_draft_model=True, and does NOT call the upstream coordinate_batch_across_dp (regression guard for the gloo 8 vs 4 shape mismatch).
test_determine_batch_execution_and_padding_dp2_keeps_tp_aligned_for_main_forward: regression guard for the reduce_scatter input.shape[0] % world_size == 0 assertion 闁?the final num_tokens_padded returned to the caller is always TP-aligned.

Related upstream PR: vllm-project/vllm#39949 (introduced extract_hidden_states speculative method).

vLLM version: v0.20.2
vLLM main: vllm-project/vllm@39910f2

… padding with Ascend model runner ### What this PR does / why we need it? `AscendExtractHiddenStatesProposer` inherits the upstream `ExtractHiddenStatesProposer._determine_batch_execution_and_padding` unchanged, which on Ascend causes two distinct failures when running `extract_hidden_states` on a MoE target model with DP > 1 and sequence parallelism enabled (e.g. MiniMax-M2 with `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`): 1. **gloo shape mismatch on the DP cpu_group**: what(): [enforce fail at .../gloo/transport/tcp/pair.cc:456] op.preamble.length <= op.nbytes. 8 vs 4. Received data size doesn't match expected size. Is there a distributed collective mismatch in your code? Upstream `coordinate_batch_across_dp` posts a `[4, dp_size]` int32 tensor to the DP cpu_group, while Ascend's main runner uses `_sync_metadata_across_dp` with a `[2, dp_size]` tensor on the same cpu_group. The two shapes collide within one step. 2. **reduce_scatter shape-not-divisible assertion on the idle DP rank**: File ".../vllm_ascend/ops/linear_op.py", line 574, in matmul_and_reduce output = tensor_model_parallel_reduce_scatter(output_parallel, 0) File ".../base_device_communicator.py", line 234, in reduce_scatter assert input_tensor.shape[0] % world_size == 0 AssertionError The proposer's own `cudagraph_dispatcher` is initialized as PIECEWISE/NONE only (never FULL), so `dispatch(num_tokens=6)` returns 6 as-is (no SP padding). That 6 enters DP sync, the synced max stays 6, and the idle DP rank's main MoE forward then crashes in SP reduce_scatter because `6 % TP=4 != 0`. Eagle3/MTP do not reproduce this because Ascend's `AscendEagleProposer` uses `runner.cudagraph_dispatcher.dispatch(...)` which dispatches against the runner's FULL-mode capture sizes (always TP-aligned). ### Fix Override `AscendExtractHiddenStatesProposer._determine_batch_execution_and_padding`: 1. SP-pad `num_tokens` via `runner._pad_for_sequence_parallelism` before dispatch, so the contribution to DP sync is always TP-aligned. Mirrors what the runner's main path does at `model_runner_v1.py:_determine_batch_execution_and_padding`. 2. Use `runner._sync_metadata_across_dp` (packed_tensor shape `[2, dp_size]`) for DP coordination instead of upstream `coordinate_batch_across_dp` (shape `[4, dp_size]`), so all DP collectives in a single step that hit the cpu_group use a consistent tensor shape. 3. Fail fast at the entry of the override with a clear `AssertionError` if the proposer was constructed without a `runner` reference, instead of letting the unguarded `runner._pad_for_sequence_parallelism` call raise a confusing `AttributeError`. 4. Document the `is_draft_model=True` semantics: it intentionally makes `should_skip_allreduce_across_dp_group` short-circuit (the cache-only "draft" here is not MoE), so the call degenerates to a local broadcast. The actual cross-DP all_reduce has already been done by the main runner earlier in the step; the SP padding above is what keeps the value TP-aligned regardless. ### Does this PR introduce _any_ user-facing change? No user-facing API change. Fixes a runtime crash for users running `extract_hidden_states` speculative decoding with `--data-parallel-size > 1` on MoE target models on Ascend NPU. Single-DP runs and dense target models (e.g. Qwen3-8B) are unaffected. ### How was this patch tested? Reproduced the crash and verified the fix on MiniMax-M2: - 2x8 NPU, `--tensor-parallel-size 4 --data-parallel-size 2` - `--enable-expert-parallel` - `--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'` - `--speculative-config '{"method": "extract_hidden_states", "num_speculative_tokens": 1, "draft_model_config": {"hf_config": {"eagle_aux_hidden_state_layer_ids": [2, 18, 34]}}}'` - `--kv-transfer-config '{"kv_connector": "ExampleHiddenStatesConnector", "kv_role": "kv_producer", ...}'` - Sending one `/v1/completions` request with `max_tokens=1`: - Before fix: idle DP rank crashes on first `execute_dummy_batch` with the `AssertionError` shown above. - After fix: request returns 200 OK, hidden_states `.safetensors` file is written with the expected `(prompt_len, len(layer_ids), hidden_size)` shape. Also verified the existing dense Qwen3-8B + extract_hidden_states path still works unchanged. Unit tests added in `tests/ut/spec_decode/test_extract_hidden_states_proposer.py`: - `test_determine_batch_execution_and_padding_asserts_when_runner_is_none`: regression guard for the `AttributeError` that would otherwise be raised on the unguarded `self.runner._pad_for_sequence_parallelism` call at the entry of the override. - `test_determine_batch_execution_and_padding_dp1_sp_pads_and_skips_sync`: with DP=1 the runner's `_pad_for_sequence_parallelism` is still consulted (so cache_only forward gets an SP-aligned input) but `_sync_metadata_across_dp` is not called. - `test_determine_batch_execution_and_padding_dp2_uses_runner_sync`: with DP>1 the override calls `runner._sync_metadata_across_dp` with the SP-padded `num_tokens` and `is_draft_model=True`, and does NOT call the upstream `coordinate_batch_across_dp` (regression guard for the gloo `8 vs 4` shape mismatch). - `test_determine_batch_execution_and_padding_dp2_keeps_tp_aligned_for_main_forward`: regression guard for the `reduce_scatter` `input.shape[0] % world_size == 0` assertion 闁?the final `num_tokens_padded` returned to the caller is always TP-aligned. Related upstream PR: vllm-project/vllm#39949 (introduced extract_hidden_states speculative method). Signed-off-by: learning-sketch <learning-sketch@users.noreply.github.com>

gemini-code-assist · 2026-05-29T02:44:00Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses runtime crashes encountered when using speculative decoding with extract_hidden_states on Ascend NPUs with data parallelism enabled. By overriding the batch execution and padding logic, the proposer now correctly aligns tokens for sequence parallelism and uses a consistent metadata synchronization path across DP ranks, ensuring compatibility with the main model runner's execution flow.

Highlights

Sequence Parallelism Alignment: Implemented SP-padding for num_tokens in AscendExtractHiddenStatesProposer to ensure inputs are always TP-aligned, preventing crashes in reduce_scatter operations.
DP Coordination Fix: Switched to using runner._sync_metadata_across_dp for data parallel coordination, resolving gloo shape mismatches caused by inconsistent tensor shapes.
Regression Testing: Added comprehensive unit tests to verify SP-padding behavior, correct DP sync usage, and fail-fast assertions when the runner reference is missing.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-05-29T02:44:11Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

Suggested PR Title:

[vLLM-Ascend][Ops][Feature] Implement _determine_batch_execution_and_padding for AscendExtractHiddenStatesProposer

Suggested PR Summary:

### What this PR does / why we need it?
This pull request implements the `_determine_batch_execution_and_padding` method in `AscendExtractHiddenStatesProposer`. This method ensures sequence parallelism (SP) padding is applied before data parallel (DP) synchronization, matching the behavior of the main model runner. It also reuses the runner's DP synchronization mechanism to prevent shape mismatches during DP coordination.

Feedback from the review suggests:
- Adding checks to verify if the runner supports `_pad_for_sequence_parallelism` and `_sync_metadata_across_dp` to avoid `AttributeError` when used with the v2 runner (`NPUModelRunner`), raising a clear `NotImplementedError` instead.
- Removing a redundant in-place assignment to `num_tokens_across_dp` to prevent potential unnecessary device synchronization.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
The changes are covered by new unit tests added in `tests/ut/spec_decode/test_extract_hidden_states_proposer.py` which mock the runner and verify padding, DP synchronization, and error handling.

… proposer - Guard `runner._pad_for_sequence_parallelism` and `runner._sync_metadata_across_dp` with `hasattr` checks, raising a clear `NotImplementedError` when paired with a runner (e.g. the v2 NPUModelRunner) that does not implement them, instead of a confusing `AttributeError`. - Drop the redundant in-place `num_tokens_across_dp[self.dp_rank] = ...` assignment: the value was just read from that slot and asserted equal to `batch_desc.num_tokens`, and the write can trigger an unnecessary NPU device sync. - Reformat the unit test file to satisfy `ruff-format`. Signed-off-by: learning-sketch <learning-sketch@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>

learning-sketch requested a review from wangxiyuan as a code owner May 29, 2026 02:43

github-actions Bot added the module:tests label May 29, 2026

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

Comment thread vllm_ascend/spec_decode/extract_hidden_states_proposer.py Outdated

Comment thread vllm_ascend/spec_decode/extract_hidden_states_proposer.py

Comment thread vllm_ascend/spec_decode/extract_hidden_states_proposer.py Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix][SpecDecode] Align extract_hidden_states proposer DP/SP batch…#9689

[BugFix][SpecDecode] Align extract_hidden_states proposer DP/SP batch…#9689
learning-sketch wants to merge 2 commits into
vllm-project:mainfrom
learning-sketch:dev

learning-sketch commented May 29, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

learning-sketch commented May 29, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Fix

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented May 29, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

learning-sketch commented May 29, 2026 •

edited by github-actions Bot

Loading