cherry-pick: upstream stabilization tier-1 (4 picks)#10
Merged
Conversation
…lm-project#41524) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> (cherry picked from commit c51df43)
Local 3ac2fc0 fixed the read-side and packed-decode guards (`state_idx <= 0` already returns), but the INPLACE_FINAL_STATE write-side still allowed `final_state_idx == 0` to overwrite the null block. This tightens the comparison to `> 0` in both fused_recurrent_gated_delta_rule_fwd_kernel and fused_sigmoid_gating_delta_rule_update_kernel. Partial port of upstream vLLM PR vllm-project#39064 (commit d4cb783): "Bugfix: Fix GDN FLA kernel crashes with NULL_BLOCK_ID=0 CUDA graph padding". Read-site and packed-decode hunks already match locally and are intentionally not re-applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual port of upstream vLLM PR vllm-project#40737 (commit 66dfee7): "Bugfix: Fix degenerate KV cache stride causing TMA cudaErrorIllegalInstruction". Adds canonicalize_singleton_dim_strides() to torch_utils, and applies it across flash_attn, flash_attn_diffkv, and flashinfer backends after kv_cache layout transforms. Fixes degenerate strides on size-1 dims (e.g. num_kv_heads=1 with high TP), which TMA on H100+ rejects with cudaErrorIllegalInstruction. Skipped upstream's cpu_resource_utils.py change (1-line type: ignore unrelated to this fix; file does not exist locally). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual port of upstream vLLM PR vllm-project#39450 (commit e7cfd7c): "Add Gemma4 Eagle3 support". Adds EagleModelMixin to Gemma4Model and SupportsEagle3 to Gemma4ForCausalLM / Gemma4ForConditionalGeneration, registers "gemma4" in SpeculativeConfig's aux_hidden_states_supported list, wires _maybe_add_hidden_state into Gemma4Model.forward, and relaxes SlidingWindowManager's assert in favor of a re-alignment loop so block_size != alignment_tokens (Gemma4 hybrid layout) no longer panics with eagle. Builds on local 8ad0aef ("feat: Gemma4 speculative decoding support") which already added Gemma4ForConditionalGeneration to eagle.py's image_token_index list — that hunk was intentionally skipped to avoid duplication. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apples-to-apples GSM8K 50 + ShareGPT 30-conv + 2048-token longdecode + 2-concurrent probe comparing nvllm:gb10-tier1cp (image dev102+gf3b4d3d09) against the wo1 baseline of the wo_split production soak (image dev69+gf79cf418b). Result: bit-clean. 48/50 vs 48/50 (zero answer divergences across 50 questions), total wall -0.46% (within thermal noise). No errors in ShareGPT, longdecode, or 2-concurrent. Not a perf claim — no nsys captured. See summary.md for the full table and reproduction commands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
|
Smoke verdict: ok to merge. Apples-to-apples vs wo1 baseline (same evidence-PR runner, same image-stack except the 4 picks):
Plus ShareGPT 30-conv, longdecode 2048-tok (finish=length, 986.7 s), and 2-concurrent probe — no errors. Picks are bit-clean on the production decode path. Sub-1 % wall delta is not claimed as a perf win (within thermal noise; no nsys captured per AGENTS.md §4). Evidence committed at 🤖 Generated with Claude Code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacks 4 upstream vLLM picks targeting our SM120/DGX Spark stability arcs. Order chosen to land cleanly with minimal blast radius; the riskier
f44afef6dcustom-op-strings pick is intentionally deferred to its own PR.Picks
884b5ae34(cherry-pick ofc51df4300)feedback_flashinfer_autotune_sm120). Clean cherry-pick, 1-file diff.b383774adfinal_state_idx >= 0→> 0) infused_recurrent.pyandfused_sigmoid_gating.py. Local3ac2fc0b7already covered the read-side and packed-decode hunks; this completes the fix.9e3a48cd8canonicalize_singleton_dim_strides()intorch_utils.py, applied acrossflash_attn,flash_attn_diffkv,flashinferbackends. Skipped upstream'scpu_resource_utils.py1-line type-ignore (file doesn't exist locally, change is unrelated). Includes the upstream test file verbatim.f3b4d3d09EagleModelMixinonGemma4Model,SupportsEagle3onGemma4ForCausalLM/Gemma4ForConditionalGeneration,aux_hidden_statesplumbing inforward(),"gemma4"registered inSpeculativeConfig, andSlidingWindowManager's eagle-pop assert replaced with re-alignment loop. Skippedeagle.pyhunk — local8ad0aef4dalready addedGemma4ForConditionalGenerationto the image-token list.Verification (so far)
git cherry-pick -xfor Welcome to nvllm Discussions! #1 — cleandiffof patch bodies for Feat/own the stack phase b #2/Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers #3/feat: uber-kernel migration (nvllm-v0.3.0) — β-coop as production decode path #4 vs upstream commits — byte-identical except for our intentional partial-pick decisions (write-side only for Feat/own the stack phase b #2, nocpu_resource_utils.pyfor Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers #3, no"minimax_m2"/eagle.pyfor feat: uber-kernel migration (nvllm-v0.3.0) — β-coop as production decode path #4)Gemma4Model/Gemma4ForCausalLM/Gemma4ForConditionalGenerationclass bases —EagleModelMixinandSupportsEagle3correctly addedcanonicalize_singleton_dim_strides()— stride 1 → 2048 (16×128), zero-copy preserved, canonical-input no-opTest plan
nvllm:gb10-tier1cpfromdocker/Dockerfile.gb10) — kicked off in tmuxig1/Qwen3.5-27B-NVFP4via/v1/completions— gate ≥30/50 (kernel-change baseline perfeedback_post_quant_sanity)AI assistance disclosure
Per
AGENTS.md §1: this PR was prepared with AI assistance (Claude Code, Opus 4.7). All cherry-picks were diffed against the upstream commits to verify byte-equivalence (or document intentional divergence). The submitting human is reviewing each commit and the smoke results before merge. No code-only AI PR.Not duplicating
This is a per-area cherry-pick batch; no duplicate PR exists in
Navi-AI-Lab/nvllm. Upstream PRs are merged commits, not active work.🤖 Generated with Claude Code