[BugFix][SpecDecode] Align extract_hidden_states proposer DP/SP batch…#9689
[BugFix][SpecDecode] Align extract_hidden_states proposer DP/SP batch…#9689learning-sketch wants to merge 2 commits into
Conversation
… padding with Ascend model runner
### What this PR does / why we need it?
`AscendExtractHiddenStatesProposer` inherits the upstream
`ExtractHiddenStatesProposer._determine_batch_execution_and_padding`
unchanged, which on Ascend causes two distinct failures when running
`extract_hidden_states` on a MoE target model with DP > 1 and
sequence parallelism enabled (e.g. MiniMax-M2 with
`VLLM_ASCEND_ENABLE_FLASHCOMM1=1`):
1. **gloo shape mismatch on the DP cpu_group**:
what(): [enforce fail at .../gloo/transport/tcp/pair.cc:456]
op.preamble.length <= op.nbytes. 8 vs 4.
Received data size doesn't match expected size.
Is there a distributed collective mismatch in your code?
Upstream `coordinate_batch_across_dp` posts a `[4, dp_size]` int32
tensor to the DP cpu_group, while Ascend's main runner uses
`_sync_metadata_across_dp` with a `[2, dp_size]` tensor on the
same cpu_group. The two shapes collide within one step.
2. **reduce_scatter shape-not-divisible assertion on the idle DP rank**:
File ".../vllm_ascend/ops/linear_op.py", line 574, in matmul_and_reduce
output = tensor_model_parallel_reduce_scatter(output_parallel, 0)
File ".../base_device_communicator.py", line 234, in reduce_scatter
assert input_tensor.shape[0] % world_size == 0
AssertionError
The proposer's own `cudagraph_dispatcher` is initialized as
PIECEWISE/NONE only (never FULL), so `dispatch(num_tokens=6)`
returns 6 as-is (no SP padding). That 6 enters DP sync, the synced
max stays 6, and the idle DP rank's main MoE forward then crashes
in SP reduce_scatter because `6 % TP=4 != 0`.
Eagle3/MTP do not reproduce this because Ascend's `AscendEagleProposer`
uses `runner.cudagraph_dispatcher.dispatch(...)` which dispatches
against the runner's FULL-mode capture sizes (always TP-aligned).
### Fix
Override `AscendExtractHiddenStatesProposer._determine_batch_execution_and_padding`:
1. SP-pad `num_tokens` via `runner._pad_for_sequence_parallelism`
before dispatch, so the contribution to DP sync is always
TP-aligned. Mirrors what the runner's main path does at
`model_runner_v1.py:_determine_batch_execution_and_padding`.
2. Use `runner._sync_metadata_across_dp` (packed_tensor shape
`[2, dp_size]`) for DP coordination instead of upstream
`coordinate_batch_across_dp` (shape `[4, dp_size]`), so all DP
collectives in a single step that hit the cpu_group use a
consistent tensor shape.
3. Fail fast at the entry of the override with a clear `AssertionError`
if the proposer was constructed without a `runner` reference,
instead of letting the unguarded `runner._pad_for_sequence_parallelism`
call raise a confusing `AttributeError`.
4. Document the `is_draft_model=True` semantics: it intentionally
makes `should_skip_allreduce_across_dp_group` short-circuit (the
cache-only "draft" here is not MoE), so the call degenerates to a
local broadcast. The actual cross-DP all_reduce has already been
done by the main runner earlier in the step; the SP padding above
is what keeps the value TP-aligned regardless.
### Does this PR introduce _any_ user-facing change?
No user-facing API change. Fixes a runtime crash for users running
`extract_hidden_states` speculative decoding with
`--data-parallel-size > 1` on MoE target models on Ascend NPU.
Single-DP runs and dense target models (e.g. Qwen3-8B) are unaffected.
### How was this patch tested?
Reproduced the crash and verified the fix on MiniMax-M2:
- 2x8 NPU, `--tensor-parallel-size 4 --data-parallel-size 2`
- `--enable-expert-parallel`
- `--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'`
- `--speculative-config '{"method": "extract_hidden_states", "num_speculative_tokens": 1, "draft_model_config": {"hf_config": {"eagle_aux_hidden_state_layer_ids": [2, 18, 34]}}}'`
- `--kv-transfer-config '{"kv_connector": "ExampleHiddenStatesConnector", "kv_role": "kv_producer", ...}'`
- Sending one `/v1/completions` request with `max_tokens=1`:
- Before fix: idle DP rank crashes on first `execute_dummy_batch`
with the `AssertionError` shown above.
- After fix: request returns 200 OK, hidden_states `.safetensors`
file is written with the expected
`(prompt_len, len(layer_ids), hidden_size)` shape.
Also verified the existing dense Qwen3-8B + extract_hidden_states path
still works unchanged.
Unit tests added in
`tests/ut/spec_decode/test_extract_hidden_states_proposer.py`:
- `test_determine_batch_execution_and_padding_asserts_when_runner_is_none`:
regression guard for the `AttributeError` that would otherwise be
raised on the unguarded `self.runner._pad_for_sequence_parallelism`
call at the entry of the override.
- `test_determine_batch_execution_and_padding_dp1_sp_pads_and_skips_sync`:
with DP=1 the runner's `_pad_for_sequence_parallelism` is still
consulted (so cache_only forward gets an SP-aligned input) but
`_sync_metadata_across_dp` is not called.
- `test_determine_batch_execution_and_padding_dp2_uses_runner_sync`:
with DP>1 the override calls `runner._sync_metadata_across_dp`
with the SP-padded `num_tokens` and `is_draft_model=True`, and does
NOT call the upstream `coordinate_batch_across_dp` (regression
guard for the gloo `8 vs 4` shape mismatch).
- `test_determine_batch_execution_and_padding_dp2_keeps_tp_aligned_for_main_forward`:
regression guard for the `reduce_scatter`
`input.shape[0] % world_size == 0` assertion 闁?the final
`num_tokens_padded` returned to the caller is always TP-aligned.
Related upstream PR: vllm-project/vllm#39949 (introduced
extract_hidden_states speculative method).
Signed-off-by: learning-sketch <learning-sketch@users.noreply.github.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses runtime crashes encountered when using speculative decoding with Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[vLLM-Ascend][Ops][Feature] Implement _determine_batch_execution_and_padding for AscendExtractHiddenStatesProposerSuggested PR Summary:
### What this PR does / why we need it?
This pull request implements the `_determine_batch_execution_and_padding` method in `AscendExtractHiddenStatesProposer`. This method ensures sequence parallelism (SP) padding is applied before data parallel (DP) synchronization, matching the behavior of the main model runner. It also reuses the runner's DP synchronization mechanism to prevent shape mismatches during DP coordination.
Feedback from the review suggests:
- Adding checks to verify if the runner supports `_pad_for_sequence_parallelism` and `_sync_metadata_across_dp` to avoid `AttributeError` when used with the v2 runner (`NPUModelRunner`), raising a clear `NotImplementedError` instead.
- Removing a redundant in-place assignment to `num_tokens_across_dp` to prevent potential unnecessary device synchronization.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The changes are covered by new unit tests added in `tests/ut/spec_decode/test_extract_hidden_states_proposer.py` which mock the runner and verify padding, DP synchronization, and error handling.… proposer - Guard `runner._pad_for_sequence_parallelism` and `runner._sync_metadata_across_dp` with `hasattr` checks, raising a clear `NotImplementedError` when paired with a runner (e.g. the v2 NPUModelRunner) that does not implement them, instead of a confusing `AttributeError`. - Drop the redundant in-place `num_tokens_across_dp[self.dp_rank] = ...` assignment: the value was just read from that slot and asserted equal to `batch_desc.num_tokens`, and the write can trigger an unnecessary NPU device sync. - Reformat the unit test file to satisfy `ruff-format`. Signed-off-by: learning-sketch <learning-sketch@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>
What this PR does / why we need it?
AscendExtractHiddenStatesProposerinherits the upstreamExtractHiddenStatesProposer._determine_batch_execution_and_paddingunchanged, which on Ascend causes two distinct failures when runningextract_hidden_stateson a MoE target model with DP > 1 and sequence parallelism enabled (e.g. MiniMax-M2 withVLLM_ASCEND_ENABLE_FLASHCOMM1=1):gloo shape mismatch on the DP cpu_group:
Upstream
coordinate_batch_across_dpposts a[4, dp_size]int32 tensor to the DP cpu_group, while Ascend's main runner uses_sync_metadata_across_dpwith a[2, dp_size]tensor on the same cpu_group. The two shapes collide within one step.reduce_scatter shape-not-divisible assertion on the idle DP rank:
The proposer's own
cudagraph_dispatcheris initialized as PIECEWISE/NONE only (never FULL), sodispatch(num_tokens=6)returns 6 as-is (no SP padding). That 6 enters DP sync, the synced max stays 6, and the idle DP rank's main MoE forward then crashes in SP reduce_scatter because6 % TP=4 != 0.Eagle3/MTP do not reproduce this because Ascend's
AscendEagleProposerusesrunner.cudagraph_dispatcher.dispatch(...)which dispatches against the runner's FULL-mode capture sizes (always TP-aligned).Fix
Override
AscendExtractHiddenStatesProposer._determine_batch_execution_and_padding:SP-pad
num_tokensviarunner._pad_for_sequence_parallelismbefore dispatch, so the contribution to DP sync is always TP-aligned. Mirrors what the runner's main path does atmodel_runner_v1.py:_determine_batch_execution_and_padding.Use
runner._sync_metadata_across_dp(packed_tensor shape[2, dp_size]) for DP coordination instead of upstreamcoordinate_batch_across_dp(shape[4, dp_size]), so all DP collectives in a single step that hit the cpu_group use a consistent tensor shape.Fail fast at the entry of the override with a clear
AssertionErrorif the proposer was constructed without arunnerreference, instead of letting the unguardedrunner._pad_for_sequence_parallelismcall raise a confusingAttributeError.Document the
is_draft_model=Truesemantics: it intentionally makesshould_skip_allreduce_across_dp_groupshort-circuit (the cache-only "draft" here is not MoE), so the call degenerates to a local broadcast. The actual cross-DP all_reduce has already been done by the main runner earlier in the step; the SP padding above is what keeps the value TP-aligned regardless.Does this PR introduce any user-facing change?
No user-facing API change. Fixes a runtime crash for users running
extract_hidden_statesspeculative decoding with--data-parallel-size > 1on MoE target models on Ascend NPU. Single-DP runs and dense target models (e.g. Qwen3-8B) are unaffected.How was this patch tested?
Reproduced the crash and verified the fix on MiniMax-M2:
--tensor-parallel-size 8 --data-parallel-size 2--enable-expert-parallel--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'--speculative-config '{"method": "extract_hidden_states", "num_speculative_tokens": 1, "draft_model_config": {"hf_config": {"eagle_aux_hidden_state_layer_ids": [2, 18, 34]}}}'--kv-transfer-config '{"kv_connector": "ExampleHiddenStatesConnector", "kv_role": "kv_producer", ...}'/v1/completionsrequest withmax_tokens=1:execute_dummy_batchwith theAssertionErrorshown above..safetensorsfile is written with the expected(prompt_len, len(layer_ids), hidden_size)shape.Also verified the existing dense Qwen3-8B + extract_hidden_states path still works unchanged.
Unit tests added in
tests/ut/spec_decode/test_extract_hidden_states_proposer.py:test_determine_batch_execution_and_padding_asserts_when_runner_is_none: regression guard for theAttributeErrorthat would otherwise be raised on the unguardedself.runner._pad_for_sequence_parallelismcall at the entry of the override.test_determine_batch_execution_and_padding_dp1_sp_pads_and_skips_sync: with DP=1 the runner's_pad_for_sequence_parallelismis still consulted (so cache_only forward gets an SP-aligned input) but_sync_metadata_across_dpis not called.test_determine_batch_execution_and_padding_dp2_uses_runner_sync: with DP>1 the override callsrunner._sync_metadata_across_dpwith the SP-paddednum_tokensandis_draft_model=True, and does NOT call the upstreamcoordinate_batch_across_dp(regression guard for the gloo8 vs 4shape mismatch).test_determine_batch_execution_and_padding_dp2_keeps_tp_aligned_for_main_forward: regression guard for thereduce_scatterinput.shape[0] % world_size == 0assertion 闁?the finalnum_tokens_paddedreturned to the caller is always TP-aligned.Related upstream PR: vllm-project/vllm#39949 (introduced extract_hidden_states speculative method).