Skip to content

cherry-pick: upstream stabilization tier-1 (4 picks)#10

Merged
Natfii merged 5 commits into
mainfrom
cherry-pick/upstream-stabilization-tier1
May 7, 2026
Merged

cherry-pick: upstream stabilization tier-1 (4 picks)#10
Natfii merged 5 commits into
mainfrom
cherry-pick/upstream-stabilization-tier1

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented May 7, 2026

Summary

Stacks 4 upstream vLLM picks targeting our SM120/DGX Spark stability arcs. Order chosen to land cleanly with minimal blast radius; the riskier f44afef6d custom-op-strings pick is intentionally deferred to its own PR.

Picks

# upstream SHA PR what / why
1 884b5ae34 (cherry-pick of c51df4300) #41524 Disable FlashInfer autotune by default. Matches our manual workaround on SM120 (feedback_flashinfer_autotune_sm120). Clean cherry-pick, 1-file diff.
2 b383774ad partial of #39064 FLA write-side guard tightened (final_state_idx >= 0> 0) in fused_recurrent.py and fused_sigmoid_gating.py. Local 3ac2fc0b7 already covered the read-side and packed-decode hunks; this completes the fix.
3 9e3a48cd8 manual port of #40737 KV cache stride canonicalization for TMA alignment. New canonicalize_singleton_dim_strides() in torch_utils.py, applied across flash_attn, flash_attn_diffkv, flashinfer backends. Skipped upstream's cpu_resource_utils.py 1-line type-ignore (file doesn't exist locally, change is unrelated). Includes the upstream test file verbatim.
4 f3b4d3d09 manual port of #39450 Gemma4 EAGLE-3 support: EagleModelMixin on Gemma4Model, SupportsEagle3 on Gemma4ForCausalLM/Gemma4ForConditionalGeneration, aux_hidden_states plumbing in forward(), "gemma4" registered in SpeculativeConfig, and SlidingWindowManager's eagle-pop assert replaced with re-alignment loop. Skipped eagle.py hunk — local 8ad0aef4d already added Gemma4ForConditionalGeneration to the image-token list.

Verification (so far)

Test plan

  • Docker build (nvllm:gb10-tier1cp from docker/Dockerfile.gb10) — kicked off in tmux
  • GSM8K 50 (seed=42, max_tokens=512) against ig1/Qwen3.5-27B-NVFP4 via /v1/completions — gate ≥30/50 (kernel-change baseline per feedback_post_quant_sanity)
  • Single ShareGPT slice replay (max_tokens=128, matching wo_split soak pattern)
  • Append serve readiness, GSM8K result JSON, ShareGPT outputs to this PR

AI assistance disclosure

Per AGENTS.md §1: this PR was prepared with AI assistance (Claude Code, Opus 4.7). All cherry-picks were diffed against the upstream commits to verify byte-equivalence (or document intentional divergence). The submitting human is reviewing each commit and the smoke results before merge. No code-only AI PR.

Not duplicating

This is a per-area cherry-pick batch; no duplicate PR exists in Navi-AI-Lab/nvllm. Upstream PRs are merged commits, not active work.

🤖 Generated with Claude Code

wzhao18 and others added 5 commits May 7, 2026 08:23
…lm-project#41524)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
(cherry picked from commit c51df43)
Local 3ac2fc0 fixed the read-side and packed-decode guards
(`state_idx <= 0` already returns), but the INPLACE_FINAL_STATE
write-side still allowed `final_state_idx == 0` to overwrite the
null block. This tightens the comparison to `> 0` in both
fused_recurrent_gated_delta_rule_fwd_kernel and
fused_sigmoid_gating_delta_rule_update_kernel.

Partial port of upstream vLLM PR vllm-project#39064 (commit d4cb783):
"Bugfix: Fix GDN FLA kernel crashes with NULL_BLOCK_ID=0 CUDA
graph padding". Read-site and packed-decode hunks already match
locally and are intentionally not re-applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual port of upstream vLLM PR vllm-project#40737 (commit 66dfee7):
"Bugfix: Fix degenerate KV cache stride causing TMA
cudaErrorIllegalInstruction".

Adds canonicalize_singleton_dim_strides() to torch_utils, and
applies it across flash_attn, flash_attn_diffkv, and flashinfer
backends after kv_cache layout transforms. Fixes degenerate
strides on size-1 dims (e.g. num_kv_heads=1 with high TP),
which TMA on H100+ rejects with cudaErrorIllegalInstruction.

Skipped upstream's cpu_resource_utils.py change (1-line type:
ignore unrelated to this fix; file does not exist locally).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual port of upstream vLLM PR vllm-project#39450 (commit e7cfd7c):
"Add Gemma4 Eagle3 support".

Adds EagleModelMixin to Gemma4Model and SupportsEagle3 to
Gemma4ForCausalLM / Gemma4ForConditionalGeneration, registers
"gemma4" in SpeculativeConfig's aux_hidden_states_supported list,
wires _maybe_add_hidden_state into Gemma4Model.forward, and
relaxes SlidingWindowManager's assert in favor of a re-alignment
loop so block_size != alignment_tokens (Gemma4 hybrid layout)
no longer panics with eagle.

Builds on local 8ad0aef ("feat: Gemma4 speculative decoding
support") which already added Gemma4ForConditionalGeneration to
eagle.py's image_token_index list — that hunk was intentionally
skipped to avoid duplication.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apples-to-apples GSM8K 50 + ShareGPT 30-conv + 2048-token longdecode +
2-concurrent probe comparing nvllm:gb10-tier1cp (image dev102+gf3b4d3d09)
against the wo1 baseline of the wo_split production soak (image dev69+gf79cf418b).

Result: bit-clean. 48/50 vs 48/50 (zero answer divergences across 50 questions),
total wall -0.46% (within thermal noise). No errors in ShareGPT, longdecode,
or 2-concurrent. Not a perf claim — no nsys captured.

See summary.md for the full table and reproduction commands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii
Copy link
Copy Markdown
Author

Natfii commented May 7, 2026

Smoke verdict: ok to merge.

Apples-to-apples vs wo1 baseline (same evidence-PR runner, same image-stack except the 4 picks):

baseline (nvllm:gb10, dev69+gf79cf418b) tier1cp (nvllm:gb10-tier1cp, dev102+gf3b4d3d09) Δ
GSM8K 50 48/50 (96 %) 48/50 (96 %) 0
total wall 3737.8 s 3720.6 s −17.2 s (−0.46 %)
answer divergences 0/50

Plus ShareGPT 30-conv, longdecode 2048-tok (finish=length, 986.7 s), and 2-concurrent probe — no errors.

Picks are bit-clean on the production decode path. Sub-1 % wall delta is not claimed as a perf win (within thermal noise; no nsys captured per AGENTS.md §4).

Evidence committed at benchmarks/nvllm/traces/upstream_stabilization_tier1/2026-05-07-tier1cp-smoke/ (see summary.md).

🤖 Generated with Claude Code

@Natfii Natfii merged commit c761434 into main May 7, 2026
@Natfii Natfii deleted the cherry-pick/upstream-stabilization-tier1 branch May 7, 2026 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants