Skip to content

[Perf] Skip blocking GPU->CPU sync of num_accepted_tokens in hybrid+a…#42574

Open
mamingyuan-nv wants to merge 1 commit into
vllm-project:mainfrom
mamingyuan-nv:skip-mamba-postprocess-blocking-sync
Open

[Perf] Skip blocking GPU->CPU sync of num_accepted_tokens in hybrid+a…#42574
mamingyuan-nv wants to merge 1 commit into
vllm-project:mainfrom
mamingyuan-nv:skip-mamba-postprocess-blocking-sync

Conversation

@mamingyuan-nv
Copy link
Copy Markdown

@mamingyuan-nv mamingyuan-nv commented May 13, 2026

…lign mode when no mamba block boundary can be crossed

In _update_states_after_model_execute, the per-step .cpu().numpy() on num_accepted_tokens.gpu[:num_reqs] (line ~1486) blocks the EngineCore CPU for the time it takes the GPU to finish the target forward + sampler + sum kernel. For hybrid models running MTP/EAGLE with mamba_cache_mode == "align", this stall happens every decode step and feeds mamba_utils.postprocess_mamba.

postprocess_mamba only does work when a request crosses a mamba block boundary in this iteration:

aligned_new_computed_tokens >= num_tokens_running_state

num_accepted_tokens is bounded by num_speculative_tokens + 1 (the shape of output_token_ids). For typical hybrid configs (mamba_block_size = 4336, num_speculative_tokens = 3), the worst case adds at most 4 tokens per cycle, so the boundary is provably uncrossable in ~98% of decode steps. In that regime we can:

  • Issue an async (non-blocking) device-to-host copy of num_accepted_tokens into the existing pinned buffer.
  • Record num_accepted_tokens_event.
  • Skip the postprocess_mamba call entirely (it would be a no-op).
  • Let the existing event.synchronize() in _prepare_inputs (which fires after the draft forwards in the next iteration) absorb the wait. By that point the GPU has long since finished the copy, so the synchronize is essentially free.

When the skip condition cannot be proven (boundary may be crossed), we fall back to the original blocking .cpu() + postprocess_mamba path, so this is purely a per-step "skip when provably redundant" optimization.

Benchmark (Nemotron-Super-120B-A12B-NVFP4, MTP=3, GB300 single GPU, aiperf 480-req, synthetic_acceptance_length=3):

overall TPS:        65,945 -> 77,411   (+17.4%)
decode TPS:          2,153 ->  2,495   (+15.9%)
inter-token latency: 7.43ms -> 6.41ms  (-13.7%)
avg req latency:    8,700ms -> 7,412ms (-14.8%)
bench wall:          265s   ->   226s  (-14.8%)
mean_acceptance_length:     3.0 -> 3.0  (unchanged, correctness)

nsys (32-req short profile) confirms the mechanism is sync-deferral, not GPU-work reduction:

slow cudaMemcpyAsync count (>1ms):  1,081 -> 80   (-92.6%)
cudaMemcpyAsync host time total:    15.99s -> 6.98s
cudaEventSynchronize host time:     0.041s -> 4.106s (wait deferred here)
GPU kernel time:                    47.46s -> 47.47s (UNCHANGED)
GPU kernel count:                   1,588,009 -> 1,592,716 (+0.3%, noise)

Validation:

  • Unit tests for the skip condition (math): 14/14 pass.
  • In-engine assertion run (P3 + blocking .cpu() + assert aligned < n_running for every skip): 0 failures across the full 480-req bench, confirming the math holds on real workload.
  • GSM8K accuracy (real rejection sampler, all 1319 problems, greedy): P3 90.22% — to be cross-checked against baseline.
  • postprocess_mamba side-effect audit: all mutations are gated by the inner if aligned >= n_running: block, which is provably unreachable when the skip condition returns True.

The patch reads cache_config.mamba_block_size and self.num_spec_tokens from config (no hardcoded MTP=3, no hardcoded block size). It is generic for any hybrid + mamba_cache_mode == "align" spec-decode configuration.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mamingyuan-nv mamingyuan-nv requested a review from njhill as a code owner May 13, 2026 21:48
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added the v1 label May 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the GPU-to-CPU synchronization for Mamba models when the cache mode is set to "align". It introduces a new method, _can_skip_mamba_postprocess, which uses CPU-side state to determine if the Mamba post-processing step will be a no-op. If the worst-case token growth does not cross a Mamba block boundary, the synchronization is deferred and handled asynchronously to reduce overhead. I have no feedback to provide as there were no review comments.

@mamingyuan-nv mamingyuan-nv force-pushed the skip-mamba-postprocess-blocking-sync branch from 9b7e7d5 to 6e303a3 Compare May 13, 2026 21:51
@mamingyuan-nv
Copy link
Copy Markdown
Author

mamingyuan-nv commented May 13, 2026

Optimizations by @Kh4L

1. Co-locate the predicate with postprocess_mamba in mamba_utils.py

The skip-safety argument hinges entirely on a property of postprocess_mamba's inner branch:

if aligned_new_computed_tokens >= num_tokens_running_state:
    # the only path that mutates state

If that condition is ever tweaked (different alignment policy, new edge case for boundary slots), the skip predicate has to move in lockstep — otherwise a step that needs state mutation gets silently skipped. Defining can_skip_mamba_postprocess as a module-level function in mamba_utils.py immediately above postprocess_mamba keeps both pieces of logic on the same screen, which is the strongest hint a future reader can have to update them together. As a method on GPUModelRunner they end up ~3000 lines apart.

The refactor also drops the dependence on self — signature becomes (scheduler_output, input_batch, requests, mamba_block_size) -> bool — which is unit-testable in isolation without instantiating a model runner.

2. Use per-request num_draft_tokens instead of the global num_spec_tokens

max_num_accepted = self.num_spec_tokens + 1

num_spec_tokens is the global config; the actual draft count for request i this iteration is len(scheduler_output.scheduled_spec_decode_tokens.get(req_id, ())) — which can be 0 for prefill iters where no speculation happened, or fewer than the global when partial drafts were scheduled. Using the per-req count gives a tighter bound: prefill iters and any spec-degraded request become eligible to skip, whereas the global bound stays on the slow path. Negligible in steady-state decode (where the headline speedup lives) but principled and free.

n_draft = len(spec_decode.get(req_id, ()))
max_new = n_running + n_draft
if (max_new // block_size) * block_size >= n_running:
    return False

@mamingyuan-nv
Copy link
Copy Markdown
Author

Accuracy validation (lm-evaluation-harness)

Ran the same accuracy validation pipeline that vLLM uses in CI: lm-evaluation-harness (EleutherAI) against vLLM's
OpenAI-compatible /v1/completions endpoint, with real rejection sampling (no synthetic_acceptance_length).

Setup

Framework lm-eval-harness via --model local-completions
Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
Engine flags --enable-prefix-caching --mamba-ssm-cache-dtype float16 --max-model-len 8192 --max-num-batched-tokens 16384 --gpu-memory-utilization 0.9 --speculative-config '{"model":...,"method":"mtp","num_speculative_tokens":3}' (no synthetic AL)
Hardware NVIDIA GB300, TP=1
Concurrency num_concurrent=16, batch_size=16
Tokenizer huggingface (nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4)

GSM8K (5-shot CoT, 1,319 problems)

Metric Baseline This PR Δ ±2σ window
exact_match · strict-match 0.9075 ± 0.0080 0.9189 ± 0.0075 +1.14 pp ±1.6 pp
exact_match · flexible-extract 0.9113 ± 0.0078 0.9265 ± 0.0072 +1.52 pp ±1.6 pp
Correct (strict) 1197 / 1319 1212 / 1319 +15

GPQA main (zero-shot, 448 problems)

Metric Baseline This PR Δ ±2σ window
acc 0.4420 ± 0.0235 0.4375 ± 0.0235 −0.45 pp ±4.7 pp
acc_norm 0.4420 ± 0.0235 0.4375 ± 0.0235 −0.45 pp ±4.7 pp
Correct 198 / 448 196 / 448 −2

Statistical interpretation

Task |Δ| 2σ threshold Verdict
GSM8K (strict) 1.14 pp 1.6 pp within 2σ — noise
GSM8K (flexible) 1.52 pp 1.6 pp within 2σ — noise
GPQA 0.45 pp 4.7 pp <<1σ — noise

If this PR had introduced a correctness regression (e.g., mamba state drift), we would expect a systematic same-direction drop across both tasks beyond 2σ. Instead, GSM8K is slightly up and GPQA is slightly down — both within run-to-run noise of vLLM's concurrent inference (continuous-batch composition order, FlashInfer atomic-reduction ordering). The patch is statistically indistinguishable from baseline on both
benchmarks.

Reproduce

```bash
lm_eval
--model local-completions
--model_args base_url=http://localhost:8000/v1/completions,model=nemotron120b,num_concurrent=16,timeout=300,max_retries=2,tokenizer_backend=huggingface,tokenizer=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
--tasks gsm8k,gpqa_main_zeroshot
--batch_size 16
--output_path lmeval_results
--log_samples
```

Copy link
Copy Markdown

@Kh4L Kh4L left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional regression analysis

Independent reproduction on the same workload (Nemotron-Super-120B-A12B-NVFP4, GB300 single GPU, MTP K=3, synthetic_acceptance_length=3, ISL≈34837 — 32K user-context + 2K synthetic, OSL=1024, aiperf with warmup ratio 1/3 and count 30×BS).

Denoised perf, BS=16, n=3+3 interleaved P/B/P/B/P/B (for node-drift control)

Metric Patched (n=3) Baseline (n=3) Δ Welch t
OutThroughput tok/s 2057.7 ± 17.9 1766.7 ± 22.8 +16.47% 17.4
ITL avg ms 6.525 ± 0.093 7.877 ± 0.079 −17.17% −19.2
TTFT avg ms 1186 ± 149 1094 ± 63 +8.4% 0.91 (n.s.)
Total tok/s 72063 61871 +16.47%

Per-arm throughput spread: 1.73% (P) / 2.45% (B). The effect is ~9× the within-arm noise floor; TTFT does not regress significantly.

Scaling check, BS=8, patched only, n=3

Metric Patched (n=3)
OutThroughput tok/s 1319.9 ± 7.96
ITL avg ms 5.325 ± 0.013
TTFT avg ms 684 ± 24
Total tok/s 46224

Per-arm spread: 0.6% on throughput. The optimization is stable across batch sizes.

Per-step logprob equivalence

Captured greedy top-k=5 logprobs for 5 short prompts × 32 tokens under three configs: P (skip=1), P2 (re-run on the same engine — FP-noise baseline), and B (skip=0, after engine restart).

Comparison Same-token Exact logprob match max |Δlogprob|
P vs P2 (same engine, two runs) 155/160 154/160 2.0
B vs P (env toggled, restart) 154/160 154/160 2.0
B vs P2 155/160 154/160 1.0

B-vs-P divergence is statistically indistinguishable from the P-vs-P2 FP-noise baseline. Mismatches happen at the same rate as same-engine repeat runs and always involve competitor tokens with close logits at late steps — consistent with FP non-determinism rather than a math change. First 29 of 32 tokens are bit-identical across all comparisons including logprobs to 6 decimals.

Mechanism check

Skip rate from a per-iteration counter we added locally: stable at 99.43% on this workload (block_size=4336, MTP K=3) — confirms the predicate fires at the expected rate.

@mamingyuan-nv
Copy link
Copy Markdown
Author

extra accuracy tests:

Screenshot 2026-05-13 at 10 58 53 PM

@mamingyuan-nv mamingyuan-nv force-pushed the skip-mamba-postprocess-blocking-sync branch from 6e303a3 to feb99c8 Compare May 14, 2026 16:22
@mamingyuan-nv
Copy link
Copy Markdown
Author

mamingyuan-nv commented May 14, 2026


Screenshot 2026-05-13 at 12 33 10 PM

@njhill njhill requested a review from tdoublep May 14, 2026 21:43
@mamingyuan-nv mamingyuan-nv force-pushed the skip-mamba-postprocess-blocking-sync branch from feb99c8 to 5db89ef Compare May 14, 2026 22:45
@njhill
Copy link
Copy Markdown
Member

njhill commented May 15, 2026

Thanks @mamingyuan-nv, LGTM with the cleanup, would be good for @tdoublep to take a quick look too, I think he had mentioned a plan to do this, not sure whether it was the same approach.

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label May 15, 2026
@mamingyuan-nv
Copy link
Copy Markdown
Author

thank you @njhill

@tdoublep
Copy link
Copy Markdown
Member

tdoublep commented May 15, 2026

Thanks @mamingyuan-nv, LGTM with the cleanup, would be good for @tdoublep to take a quick look too, I think he had mentioned a plan to do this, not sure whether it was the same approach.

Yes we have this PR under review for this: #40172

It is ready to land imo. The approach is different though, removing the need for the sync entirely.

@mamingyuan-nv
Copy link
Copy Markdown
Author

I cross-validated #40172 on our target workload.

Methodology: bind-mounted the two patched files from PR40172 HEAD 662cd43 (whole-file replacement) into the same pinned docker image (vllm/vllm-openai@sha256:bab6eca6…, vllm 28ee78a) used for this PR. Same engine flags, workload, metric pipeline, correctness gate (mean_AL = 3.0) across all three runs.

Caveat: PR40172's authoring base lacks #39487's create_custom_proposer import; no other commits touch either patched file between that base and 28ee78a, so the postprocess hot path is materially identical — but a fully upstream-clean comparison would build a docker image from PR40172's branch.

Results: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4, MTP=3, 1xGB300 (278GB)

┌──────────┬─────────────────┬──────────────────┬─────────┐
│          │   Overall TPS   │     ITL avg      │ mean_AL │
├──────────┼─────────────────┼──────────────────┼─────────┤
│ #40172   │ 65,967          │ 7.70 ms.         │ 3.0     │
├──────────┼─────────────────┼──────────────────┼─────────┤
│ this PR  │ 77,411.         │ 6.41 ms.         │ 3.0     │
└──────────┴─────────────────┴──────────────────┴─────────┘

The workload is concurrency=16, 32k prefix cache, 2k new ISL, 1k OSL

Based on my understanding, both PRs target the same num_accepted_tokens.cpu() blocking sync for hybrid + spec_decode + mamba_cache_mode="align".

#40172 replaces postprocess_mamba with a Triton kernel that runs every step plus per-step populate of GPU metadata buffers in _prepare_inputs.

However this PR is working on different perspective, proving postprocess_mamba is a no-op for the current step and skips the sync entirely; falls back to the original blocking path otherwise.

which means I think two PRs are complimentary from two perspectives - when (skip or not to skip) and where (GPU or CPU)

@ZJY0516 ZJY0516 removed the ready ONLY add when PR is ready to merge/full CI is needed label May 16, 2026
@vadiklyutiy
Copy link
Copy Markdown
Member

@mamingyuan-nv Could you pls measure perf difference for ISL/OSL=1/1024 and BS={512,1024}

@mamingyuan-nv
Copy link
Copy Markdown
Author

@vadiklyutiy

yes,

                      OutTPS   ITL_ms   ΔTPS    ΔITL   ΔTTFT   ΔreqLat
  baseline_bs512       7971    50.08     —       —      —        —
  pr42574_bs512        7981    50.34   +0.14%  +0.52%  -3.00%  -0.02%
  pr40172_bs512        8064    49.39   +1.17%  -1.37%  -0.96%  -1.31%

  baseline_bs1024      8004    50.22     —       —      —        —
  pr42574_bs1024       8050    50.16   +0.57%  -0.11%  -0.81%  -0.50%
  pr40172_bs1024       8148    49.57   +1.80%  -1.30%  -1.84%  -1.60%

…lign mode when no mamba block boundary can be crossed

In `_update_states_after_model_execute`, the per-step `.cpu().numpy()` on
`num_accepted_tokens.gpu[:num_reqs]` (line ~1486) blocks the EngineCore CPU
for the time it takes the GPU to finish the target forward + sampler +
sum kernel. For hybrid models running MTP/EAGLE with
`mamba_cache_mode == "align"`, this stall happens every decode step and
feeds `mamba_utils.postprocess_mamba`.

`postprocess_mamba` only does work when a request crosses a mamba block
boundary in this iteration:

    aligned_new_computed_tokens >= num_tokens_running_state

`num_accepted_tokens` is bounded by `num_speculative_tokens + 1` (the
shape of `output_token_ids`). For typical hybrid configs
(`mamba_block_size = 4336`, `num_speculative_tokens = 3`), the worst case
adds at most 4 tokens per cycle, so the boundary is provably uncrossable
in ~98% of decode steps. In that regime we can:

  - Issue an async (non-blocking) device-to-host copy of
    `num_accepted_tokens` into the existing pinned buffer.
  - Record `num_accepted_tokens_event`.
  - Skip the `postprocess_mamba` call entirely (it would be a no-op).
  - Let the existing `event.synchronize()` in `_prepare_inputs` (which
    fires after the draft forwards in the next iteration) absorb the
    wait. By that point the GPU has long since finished the copy, so the
    synchronize is essentially free.

When the skip condition cannot be proven (boundary may be crossed), we
fall back to the original blocking `.cpu()` + `postprocess_mamba` path,
so this is purely a per-step "skip when provably redundant" optimization.

Benchmark (Nemotron-Super-120B-A12B-NVFP4, MTP=3, GB300 single GPU,
aiperf 480-req, synthetic_acceptance_length=3):

    overall TPS:        65,945 -> 77,411   (+17.4%)
    decode TPS:          2,153 ->  2,495   (+15.9%)
    inter-token latency: 7.43ms -> 6.41ms  (-13.7%)
    avg req latency:    8,700ms -> 7,412ms (-14.8%)
    bench wall:          265s   ->   226s  (-14.8%)
    mean_acceptance_length:     3.0 -> 3.0  (unchanged, correctness)

nsys (32-req short profile) confirms the mechanism is sync-deferral, not
GPU-work reduction:

    slow cudaMemcpyAsync count (>1ms):  1,081 -> 80   (-92.6%)
    cudaMemcpyAsync host time total:    15.99s -> 6.98s
    cudaEventSynchronize host time:     0.041s -> 4.106s (wait deferred here)
    GPU kernel time:                    47.46s -> 47.47s (UNCHANGED)
    GPU kernel count:                   1,588,009 -> 1,592,716 (+0.3%, noise)

Validation:
  - Unit tests for the skip condition (math): 14/14 pass.
  - In-engine assertion run (P3 + blocking .cpu() + assert
    `aligned < n_running` for every skip): 0 failures across the full
    480-req bench, confirming the math holds on real workload.
  - GSM8K accuracy (real rejection sampler, all 1319 problems, greedy):
    P3 90.22% — to be cross-checked against baseline.
  - `postprocess_mamba` side-effect audit: all mutations are gated by the
    inner `if aligned >= n_running:` block, which is provably unreachable
    when the skip condition returns True.

The patch reads `cache_config.mamba_block_size` and
`self.num_spec_tokens` from config (no hardcoded MTP=3, no hardcoded
block size). It is generic for any hybrid + `mamba_cache_mode == "align"`
spec-decode configuration.

Signed-off-by: Mingyuan Ma <minma@nvidia.com>
@mamingyuan-nv mamingyuan-nv force-pushed the skip-mamba-postprocess-blocking-sync branch from 5db89ef to d3d965f Compare May 20, 2026 03:27
askliar pushed a commit to askliar/vllm that referenced this pull request May 21, 2026
…de when no mamba block boundary can be crossed

Re-applies vllm-project#42574 on top of the PR vllm-project#41233 series. The
upstream patch targets the pre-vllm-project#41233 two-branch (`is_align` / else)
structure where postprocess_mamba was only called in the align branch.
vllm-project#41233 unified the flow so postprocess_mamba is always called. This
commit adapts the same skip optimisation to the unified structure:

  - Add `can_skip_mamba_postprocess` helper in mamba_utils.py (verbatim
    from PR vllm-project#42574).
  - In `_update_states_after_model_execute`, when in align mode, decide
    on CPU whether any request can cross a mamba block boundary. If
    not, defer the device-to-host `.cpu().numpy()` sync via the
    existing non_blocking copy + event.record() path that the else
    branch uses, and early-return to skip the no-op
    `postprocess_mamba`. Otherwise fall through to the original
    blocking sync + postprocess_mamba call.

Benchmarks from the upstream PR description (Nemotron-Super-120B-A12B-
NVFP4, MTP=3, GB300 single GPU): +17% overall TPS, -13.7% ITL, slow
cudaMemcpyAsync count -92.6%, GPU kernel time unchanged.

Co-Authored-By: Mingyuan Ma <minma@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mamingyuan-nv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants