cherry-pick: upstream stabilization tier-1 (4 picks) by Natfii · Pull Request #10 · Navi-AI-Lab/nvllm

Natfii · 2026-05-07T13:05:28Z

Summary

Stacks 4 upstream vLLM picks targeting our SM120/DGX Spark stability arcs. Order chosen to land cleanly with minimal blast radius; the riskier f44afef6d custom-op-strings pick is intentionally deferred to its own PR.

Picks

#	upstream SHA	PR	what / why
1	`884b5ae34` (cherry-pick of `c51df4300`)	#41524	Disable FlashInfer autotune by default. Matches our manual workaround on SM120 (`feedback_flashinfer_autotune_sm120`). Clean cherry-pick, 1-file diff.
2	`b383774ad`	partial of #39064	FLA write-side guard tightened (`final_state_idx >= 0` → `> 0`) in `fused_recurrent.py` and `fused_sigmoid_gating.py`. Local `3ac2fc0b7` already covered the read-side and packed-decode hunks; this completes the fix.
3	`9e3a48cd8`	manual port of #40737	KV cache stride canonicalization for TMA alignment. New `canonicalize_singleton_dim_strides()` in `torch_utils.py`, applied across `flash_attn`, `flash_attn_diffkv`, `flashinfer` backends. Skipped upstream's `cpu_resource_utils.py` 1-line type-ignore (file doesn't exist locally, change is unrelated). Includes the upstream test file verbatim.
4	`f3b4d3d09`	manual port of #39450	Gemma4 EAGLE-3 support: `EagleModelMixin` on `Gemma4Model`, `SupportsEagle3` on `Gemma4ForCausalLM`/`Gemma4ForConditionalGeneration`, `aux_hidden_states` plumbing in `forward()`, `"gemma4"` registered in `SpeculativeConfig`, and `SlidingWindowManager`'s eagle-pop assert replaced with re-alignment loop. Skipped `eagle.py` hunk — local `8ad0aef4d` already added `Gemma4ForConditionalGeneration` to the image-token list.

Verification (so far)

git cherry-pick -x for Welcome to nvllm Discussions! #1 — clean
diff of patch bodies for Feat/own the stack phase b #2/Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers #3/feat: uber-kernel migration (nvllm-v0.3.0) — β-coop as production decode path #4 vs upstream commits — byte-identical except for our intentional partial-pick decisions (write-side only for Feat/own the stack phase b #2, no cpu_resource_utils.py for Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers #3, no "minimax_m2"/eagle.py for feat: uber-kernel migration (nvllm-v0.3.0) — β-coop as production decode path #4)
AST verification of Gemma4Model / Gemma4ForCausalLM / Gemma4ForConditionalGeneration class bases — EagleModelMixin and SupportsEagle3 correctly added
Independent smoke of canonicalize_singleton_dim_strides() — stride 1 → 2048 (16×128), zero-copy preserved, canonical-input no-op

Test plan

Docker build (nvllm:gb10-tier1cp from docker/Dockerfile.gb10) — kicked off in tmux
GSM8K 50 (seed=42, max_tokens=512) against ig1/Qwen3.5-27B-NVFP4 via /v1/completions — gate ≥30/50 (kernel-change baseline per feedback_post_quant_sanity)
Single ShareGPT slice replay (max_tokens=128, matching wo_split soak pattern)
Append serve readiness, GSM8K result JSON, ShareGPT outputs to this PR

AI assistance disclosure

Per AGENTS.md §1: this PR was prepared with AI assistance (Claude Code, Opus 4.7). All cherry-picks were diffed against the upstream commits to verify byte-equivalence (or document intentional divergence). The submitting human is reviewing each commit and the smoke results before merge. No code-only AI PR.

Not duplicating

This is a per-area cherry-pick batch; no duplicate PR exists in Navi-AI-Lab/nvllm. Upstream PRs are merged commits, not active work.

🤖 Generated with Claude Code

…lm-project#41524) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> (cherry picked from commit c51df43)

Local 3ac2fc0 fixed the read-side and packed-decode guards (`state_idx <= 0` already returns), but the INPLACE_FINAL_STATE write-side still allowed `final_state_idx == 0` to overwrite the null block. This tightens the comparison to `> 0` in both fused_recurrent_gated_delta_rule_fwd_kernel and fused_sigmoid_gating_delta_rule_update_kernel. Partial port of upstream vLLM PR vllm-project#39064 (commit d4cb783): "Bugfix: Fix GDN FLA kernel crashes with NULL_BLOCK_ID=0 CUDA graph padding". Read-site and packed-decode hunks already match locally and are intentionally not re-applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Manual port of upstream vLLM PR vllm-project#40737 (commit 66dfee7): "Bugfix: Fix degenerate KV cache stride causing TMA cudaErrorIllegalInstruction". Adds canonicalize_singleton_dim_strides() to torch_utils, and applies it across flash_attn, flash_attn_diffkv, and flashinfer backends after kv_cache layout transforms. Fixes degenerate strides on size-1 dims (e.g. num_kv_heads=1 with high TP), which TMA on H100+ rejects with cudaErrorIllegalInstruction. Skipped upstream's cpu_resource_utils.py change (1-line type: ignore unrelated to this fix; file does not exist locally). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Manual port of upstream vLLM PR vllm-project#39450 (commit e7cfd7c): "Add Gemma4 Eagle3 support". Adds EagleModelMixin to Gemma4Model and SupportsEagle3 to Gemma4ForCausalLM / Gemma4ForConditionalGeneration, registers "gemma4" in SpeculativeConfig's aux_hidden_states_supported list, wires _maybe_add_hidden_state into Gemma4Model.forward, and relaxes SlidingWindowManager's assert in favor of a re-alignment loop so block_size != alignment_tokens (Gemma4 hybrid layout) no longer panics with eagle. Builds on local 8ad0aef ("feat: Gemma4 speculative decoding support") which already added Gemma4ForConditionalGeneration to eagle.py's image_token_index list — that hunk was intentionally skipped to avoid duplication. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apples-to-apples GSM8K 50 + ShareGPT 30-conv + 2048-token longdecode + 2-concurrent probe comparing nvllm:gb10-tier1cp (image dev102+gf3b4d3d09) against the wo1 baseline of the wo_split production soak (image dev69+gf79cf418b). Result: bit-clean. 48/50 vs 48/50 (zero answer divergences across 50 questions), total wall -0.46% (within thermal noise). No errors in ShareGPT, longdecode, or 2-concurrent. Not a perf claim — no nsys captured. See summary.md for the full table and reproduction commands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Natfii · 2026-05-07T18:54:56Z

Smoke verdict: ok to merge.

Apples-to-apples vs wo1 baseline (same evidence-PR runner, same image-stack except the 4 picks):

	baseline (`nvllm:gb10`, dev69+gf79cf418b)	tier1cp (`nvllm:gb10-tier1cp`, dev102+gf3b4d3d09)	Δ
GSM8K 50	48/50 (96 %)	48/50 (96 %)	0
total wall	3737.8 s	3720.6 s	−17.2 s (−0.46 %)
answer divergences	—	—	0/50

Plus ShareGPT 30-conv, longdecode 2048-tok (finish=length, 986.7 s), and 2-concurrent probe — no errors.

Picks are bit-clean on the production decode path. Sub-1 % wall delta is not claimed as a perf win (within thermal noise; no nsys captured per AGENTS.md §4).

Evidence committed at benchmarks/nvllm/traces/upstream_stabilization_tier1/2026-05-07-tier1cp-smoke/ (see summary.md).

🤖 Generated with Claude Code

wzhao18 and others added 5 commits May 7, 2026 08:23

Disable flashinfer autotune temporarily due to correctness issues (vl…

884b5ae

…lm-project#41524) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> (cherry picked from commit c51df43)

Natfii merged commit c761434 into main May 7, 2026

Natfii deleted the cherry-pick/upstream-stabilization-tier1 branch May 7, 2026 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cherry-pick: upstream stabilization tier-1 (4 picks)#10

cherry-pick: upstream stabilization tier-1 (4 picks)#10
Natfii merged 5 commits into
mainfrom
cherry-pick/upstream-stabilization-tier1

Natfii commented May 7, 2026

Uh oh!

Natfii commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Natfii commented May 7, 2026

Summary

Picks

Verification (so far)

Test plan

AI assistance disclosure

Not duplicating

Uh oh!

Natfii commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants