Skip to content

[BugFix][SpecDecode] Align extract_hidden_states proposer DP/SP batch…#9689

Open
learning-sketch wants to merge 2 commits into
vllm-project:mainfrom
learning-sketch:dev
Open

[BugFix][SpecDecode] Align extract_hidden_states proposer DP/SP batch…#9689
learning-sketch wants to merge 2 commits into
vllm-project:mainfrom
learning-sketch:dev

Conversation

@learning-sketch
Copy link
Copy Markdown

@learning-sketch learning-sketch commented May 29, 2026

What this PR does / why we need it?

AscendExtractHiddenStatesProposer inherits the upstream ExtractHiddenStatesProposer._determine_batch_execution_and_padding unchanged, which on Ascend causes two distinct failures when running extract_hidden_states on a MoE target model with DP > 1 and sequence parallelism enabled (e.g. MiniMax-M2 with VLLM_ASCEND_ENABLE_FLASHCOMM1=1):

  1. gloo shape mismatch on the DP cpu_group:

    what(): [enforce fail at .../gloo/transport/tcp/pair.cc:456]
    op.preamble.length <= op.nbytes. 8 vs 4.
    Received data size doesn't match expected size.
    Is there a distributed collective mismatch in your code?
    

    Upstream coordinate_batch_across_dp posts a [4, dp_size] int32 tensor to the DP cpu_group, while Ascend's main runner uses _sync_metadata_across_dp with a [2, dp_size] tensor on the same cpu_group. The two shapes collide within one step.

  2. reduce_scatter shape-not-divisible assertion on the idle DP rank:

    File ".../vllm_ascend/ops/linear_op.py", line 574, in matmul_and_reduce
        output = tensor_model_parallel_reduce_scatter(output_parallel, 0)
    File ".../base_device_communicator.py", line 234, in reduce_scatter
        assert input_tensor.shape[0] % world_size == 0
    AssertionError
    

    The proposer's own cudagraph_dispatcher is initialized as PIECEWISE/NONE only (never FULL), so dispatch(num_tokens=6) returns 6 as-is (no SP padding). That 6 enters DP sync, the synced max stays 6, and the idle DP rank's main MoE forward then crashes in SP reduce_scatter because 6 % TP=4 != 0.

    Eagle3/MTP do not reproduce this because Ascend's AscendEagleProposer uses runner.cudagraph_dispatcher.dispatch(...) which dispatches against the runner's FULL-mode capture sizes (always TP-aligned).

Fix

Override AscendExtractHiddenStatesProposer._determine_batch_execution_and_padding:

  1. SP-pad num_tokens via runner._pad_for_sequence_parallelism before dispatch, so the contribution to DP sync is always TP-aligned. Mirrors what the runner's main path does at model_runner_v1.py:_determine_batch_execution_and_padding.

  2. Use runner._sync_metadata_across_dp (packed_tensor shape [2, dp_size]) for DP coordination instead of upstream coordinate_batch_across_dp (shape [4, dp_size]), so all DP collectives in a single step that hit the cpu_group use a consistent tensor shape.

  3. Fail fast at the entry of the override with a clear AssertionError if the proposer was constructed without a runner reference, instead of letting the unguarded runner._pad_for_sequence_parallelism call raise a confusing AttributeError.

  4. Document the is_draft_model=True semantics: it intentionally makes should_skip_allreduce_across_dp_group short-circuit (the cache-only "draft" here is not MoE), so the call degenerates to a local broadcast. The actual cross-DP all_reduce has already been done by the main runner earlier in the step; the SP padding above is what keeps the value TP-aligned regardless.

Does this PR introduce any user-facing change?

No user-facing API change. Fixes a runtime crash for users running extract_hidden_states speculative decoding with
--data-parallel-size > 1 on MoE target models on Ascend NPU. Single-DP runs and dense target models (e.g. Qwen3-8B) are unaffected.

How was this patch tested?

Reproduced the crash and verified the fix on MiniMax-M2:

  • 2x8 NPU, --tensor-parallel-size 8 --data-parallel-size 2
  • --enable-expert-parallel
  • --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'
  • --speculative-config '{"method": "extract_hidden_states", "num_speculative_tokens": 1, "draft_model_config": {"hf_config": {"eagle_aux_hidden_state_layer_ids": [2, 18, 34]}}}'
  • --kv-transfer-config '{"kv_connector": "ExampleHiddenStatesConnector", "kv_role": "kv_producer", ...}'
  • Sending one /v1/completions request with max_tokens=1:
    • Before fix: idle DP rank crashes on first execute_dummy_batch with the AssertionError shown above.
    • After fix: request returns 200 OK, hidden_states .safetensors file is written with the expected (prompt_len, len(layer_ids), hidden_size) shape.

Also verified the existing dense Qwen3-8B + extract_hidden_states path still works unchanged.

Unit tests added in
tests/ut/spec_decode/test_extract_hidden_states_proposer.py:

  • test_determine_batch_execution_and_padding_asserts_when_runner_is_none: regression guard for the AttributeError that would otherwise be raised on the unguarded self.runner._pad_for_sequence_parallelism call at the entry of the override.
  • test_determine_batch_execution_and_padding_dp1_sp_pads_and_skips_sync: with DP=1 the runner's _pad_for_sequence_parallelism is still consulted (so cache_only forward gets an SP-aligned input) but _sync_metadata_across_dp is not called.
  • test_determine_batch_execution_and_padding_dp2_uses_runner_sync: with DP>1 the override calls runner._sync_metadata_across_dp with the SP-padded num_tokens and is_draft_model=True, and does NOT call the upstream coordinate_batch_across_dp (regression guard for the gloo 8 vs 4 shape mismatch).
  • test_determine_batch_execution_and_padding_dp2_keeps_tp_aligned_for_main_forward: regression guard for the reduce_scatter input.shape[0] % world_size == 0 assertion 闁?the final num_tokens_padded returned to the caller is always TP-aligned.

Related upstream PR: vllm-project/vllm#39949 (introduced extract_hidden_states speculative method).

… padding with Ascend model runner

### What this PR does / why we need it?

`AscendExtractHiddenStatesProposer` inherits the upstream
`ExtractHiddenStatesProposer._determine_batch_execution_and_padding`
unchanged, which on Ascend causes two distinct failures when running
`extract_hidden_states` on a MoE target model with DP > 1 and
sequence parallelism enabled (e.g. MiniMax-M2 with
`VLLM_ASCEND_ENABLE_FLASHCOMM1=1`):

1. **gloo shape mismatch on the DP cpu_group**:

       what(): [enforce fail at .../gloo/transport/tcp/pair.cc:456]
       op.preamble.length <= op.nbytes. 8 vs 4.
       Received data size doesn't match expected size.
       Is there a distributed collective mismatch in your code?

   Upstream `coordinate_batch_across_dp` posts a `[4, dp_size]` int32
   tensor to the DP cpu_group, while Ascend's main runner uses
   `_sync_metadata_across_dp` with a `[2, dp_size]` tensor on the
   same cpu_group. The two shapes collide within one step.

2. **reduce_scatter shape-not-divisible assertion on the idle DP rank**:

       File ".../vllm_ascend/ops/linear_op.py", line 574, in matmul_and_reduce
           output = tensor_model_parallel_reduce_scatter(output_parallel, 0)
       File ".../base_device_communicator.py", line 234, in reduce_scatter
           assert input_tensor.shape[0] % world_size == 0
       AssertionError

   The proposer's own `cudagraph_dispatcher` is initialized as
   PIECEWISE/NONE only (never FULL), so `dispatch(num_tokens=6)`
   returns 6 as-is (no SP padding). That 6 enters DP sync, the synced
   max stays 6, and the idle DP rank's main MoE forward then crashes
   in SP reduce_scatter because `6 % TP=4 != 0`.

   Eagle3/MTP do not reproduce this because Ascend's `AscendEagleProposer`
   uses `runner.cudagraph_dispatcher.dispatch(...)` which dispatches
   against the runner's FULL-mode capture sizes (always TP-aligned).

### Fix

Override `AscendExtractHiddenStatesProposer._determine_batch_execution_and_padding`:

1. SP-pad `num_tokens` via `runner._pad_for_sequence_parallelism`
   before dispatch, so the contribution to DP sync is always
   TP-aligned. Mirrors what the runner's main path does at
   `model_runner_v1.py:_determine_batch_execution_and_padding`.

2. Use `runner._sync_metadata_across_dp` (packed_tensor shape
   `[2, dp_size]`) for DP coordination instead of upstream
   `coordinate_batch_across_dp` (shape `[4, dp_size]`), so all DP
   collectives in a single step that hit the cpu_group use a
   consistent tensor shape.

3. Fail fast at the entry of the override with a clear `AssertionError`
   if the proposer was constructed without a `runner` reference,
   instead of letting the unguarded `runner._pad_for_sequence_parallelism`
   call raise a confusing `AttributeError`.

4. Document the `is_draft_model=True` semantics: it intentionally
   makes `should_skip_allreduce_across_dp_group` short-circuit (the
   cache-only "draft" here is not MoE), so the call degenerates to a
   local broadcast. The actual cross-DP all_reduce has already been
   done by the main runner earlier in the step; the SP padding above
   is what keeps the value TP-aligned regardless.

### Does this PR introduce _any_ user-facing change?

No user-facing API change. Fixes a runtime crash for users running
`extract_hidden_states` speculative decoding with
`--data-parallel-size > 1` on MoE target models on Ascend NPU.
Single-DP runs and dense target models (e.g. Qwen3-8B) are unaffected.

### How was this patch tested?

Reproduced the crash and verified the fix on MiniMax-M2:

- 2x8 NPU, `--tensor-parallel-size 4 --data-parallel-size 2`
- `--enable-expert-parallel`
- `--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'`
- `--speculative-config '{"method": "extract_hidden_states", "num_speculative_tokens": 1, "draft_model_config": {"hf_config": {"eagle_aux_hidden_state_layer_ids": [2, 18, 34]}}}'`
- `--kv-transfer-config '{"kv_connector": "ExampleHiddenStatesConnector", "kv_role": "kv_producer", ...}'`
- Sending one `/v1/completions` request with `max_tokens=1`:
    - Before fix: idle DP rank crashes on first `execute_dummy_batch`
      with the `AssertionError` shown above.
    - After fix: request returns 200 OK, hidden_states `.safetensors`
      file is written with the expected
      `(prompt_len, len(layer_ids), hidden_size)` shape.

Also verified the existing dense Qwen3-8B + extract_hidden_states path
still works unchanged.

Unit tests added in
`tests/ut/spec_decode/test_extract_hidden_states_proposer.py`:

- `test_determine_batch_execution_and_padding_asserts_when_runner_is_none`:
  regression guard for the `AttributeError` that would otherwise be
  raised on the unguarded `self.runner._pad_for_sequence_parallelism`
  call at the entry of the override.
- `test_determine_batch_execution_and_padding_dp1_sp_pads_and_skips_sync`:
  with DP=1 the runner's `_pad_for_sequence_parallelism` is still
  consulted (so cache_only forward gets an SP-aligned input) but
  `_sync_metadata_across_dp` is not called.
- `test_determine_batch_execution_and_padding_dp2_uses_runner_sync`:
  with DP>1 the override calls `runner._sync_metadata_across_dp`
  with the SP-padded `num_tokens` and `is_draft_model=True`, and does
  NOT call the upstream `coordinate_batch_across_dp` (regression
  guard for the gloo `8 vs 4` shape mismatch).
- `test_determine_batch_execution_and_padding_dp2_keeps_tp_aligned_for_main_forward`:
  regression guard for the `reduce_scatter`
  `input.shape[0] % world_size == 0` assertion 闁?the final
  `num_tokens_padded` returned to the caller is always TP-aligned.

Related upstream PR: vllm-project/vllm#39949 (introduced
extract_hidden_states speculative method).

Signed-off-by: learning-sketch <learning-sketch@users.noreply.github.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses runtime crashes encountered when using speculative decoding with extract_hidden_states on Ascend NPUs with data parallelism enabled. By overriding the batch execution and padding logic, the proposer now correctly aligns tokens for sequence parallelism and uses a consistent metadata synchronization path across DP ranks, ensuring compatibility with the main model runner's execution flow.

Highlights

  • Sequence Parallelism Alignment: Implemented SP-padding for num_tokens in AscendExtractHiddenStatesProposer to ensure inputs are always TP-aligned, preventing crashes in reduce_scatter operations.
  • DP Coordination Fix: Switched to using runner._sync_metadata_across_dp for data parallel coordination, resolving gloo shape mismatches caused by inconsistent tensor shapes.
  • Regression Testing: Added comprehensive unit tests to verify SP-padding behavior, correct DP sync usage, and fail-fast assertions when the runner reference is missing.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[vLLM-Ascend][Ops][Feature] Implement _determine_batch_execution_and_padding for AscendExtractHiddenStatesProposer

Suggested PR Summary:

### What this PR does / why we need it?
This pull request implements the `_determine_batch_execution_and_padding` method in `AscendExtractHiddenStatesProposer`. This method ensures sequence parallelism (SP) padding is applied before data parallel (DP) synchronization, matching the behavior of the main model runner. It also reuses the runner's DP synchronization mechanism to prevent shape mismatches during DP coordination.

Feedback from the review suggests:
- Adding checks to verify if the runner supports `_pad_for_sequence_parallelism` and `_sync_metadata_across_dp` to avoid `AttributeError` when used with the v2 runner (`NPUModelRunner`), raising a clear `NotImplementedError` instead.
- Removing a redundant in-place assignment to `num_tokens_across_dp` to prevent potential unnecessary device synchronization.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
The changes are covered by new unit tests added in `tests/ut/spec_decode/test_extract_hidden_states_proposer.py` which mock the runner and verify padding, DP synchronization, and error handling.

Comment thread vllm_ascend/spec_decode/extract_hidden_states_proposer.py Outdated
Comment thread vllm_ascend/spec_decode/extract_hidden_states_proposer.py
Comment thread vllm_ascend/spec_decode/extract_hidden_states_proposer.py Outdated
… proposer

- Guard `runner._pad_for_sequence_parallelism` and
  `runner._sync_metadata_across_dp` with `hasattr` checks, raising a clear
  `NotImplementedError` when paired with a runner (e.g. the v2 NPUModelRunner)
  that does not implement them, instead of a confusing `AttributeError`.
- Drop the redundant in-place `num_tokens_across_dp[self.dp_rank] = ...`
  assignment: the value was just read from that slot and asserted equal to
  `batch_desc.num_tokens`, and the write can trigger an unnecessary NPU
  device sync.
- Reformat the unit test file to satisfy `ruff-format`.

Signed-off-by: learning-sketch <learning-sketch@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant