Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers by Natfii · Pull Request #3 · Navi-AI-Lab/nvllm

Natfii · 2026-04-20T16:27:12Z

Summary

Lands the Phase B + C own-the-stack refactor for Qwen3.5 on the fork:

Phase B (2b2d67c3b, already on main): Qwen3_5Attention model class
moved from upstream vllm/model_executor/models/qwen3_next.py to fork-owned
vllm/nvllm/models/qwen3_5.py, with a 15-line from vllm.nvllm.models.qwen3_5 import *
shim in the upstream-tracking file. Adds attach_fusion(parent_layer) API on
the CuTe paged backend.
Phase C (this PR, 2 commits):
- cbfadb6a9 refactor(nvllm): Phase C own-the-stack — layers into vllm/nvllm/layers/
- 6434802d6 chore(nvllm): Phase C — trace evidence + audit fixes
Moves Qwen3_5RMSNorm and Qwen3_5MLP out of upstream-tracking files into
fork-owned vllm/nvllm/layers/layernorm.py and vllm/nvllm/layers/mlp.py.
Qwen3_5RMSNorm registers as CustomOp "qwen3_5_rms_norm"; Qwen3_5MLP
is a dense-only copy of Qwen2MoeMLP that drops expert_gate/reduce_results.

Why

The fused-MLP kernel work (Phase D, separate PR off feat/unreal-kernel-phase-d)
needs a place to live that's fork-owned — otherwise every upstream sync drops
our fusion hooks. Own-the-stack gives:

Fork-owned model class — fusion binding evolves without editing upstream files.
Fork-owned layer classes — uber-kernel can fuse in-class ops without touching
upstream Qwen2MoeMLP / GemmaRMSNorm.
Shim files in upstream paths — existing
from vllm.model_executor.models.qwen3_5 import * imports still work.

Test plan

GSM8K 8/8 on Qwen3.5-27B-NVFP4 via CuTe paged backend (validated 2026-04-17
at landing time; recorded in memory:project_own_the_stack)
Qwen3NextAttention upstream-clean after Phase B off-ramp (matches upstream
commit 494636b29)
Fusion binding via CutePagedAttentionImpl.attach_fusion(parent_layer)
works for Qwen3.5 (Phase B shipped + smoke-tested)

No model-forward behavior changes — pure code ownership movement with shim files
keeping all existing imports resolving.

AI assistance

Assembled with AI assistance (Claude Opus 4.7). Every changed line was reviewed
by the submitter. AI-assisted commits carry
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> trailers.

Notes

Base: main on Navi-AI-Lab/nvllm (this fork), NOT upstream vllm-project/vllm.
Phase D follows in a separate PR off feat/unreal-kernel-phase-d after this lands.

- vllm/nvllm/layers/layernorm.py: Qwen3_5RMSNorm (copy of GemmaRMSNorm body, registered as "qwen3_5_rms_norm" to avoid CustomOp collision) - vllm/nvllm/layers/mlp.py: Qwen3_5MLP (copy of dense Qwen2MoeMLP body, drops expert_gate + reduce_results kwargs; not used by 27B dense) - vllm/nvllm/models/qwen3_5.py: 8-line import diff, 3 call-site renames, 0 logic restructured - tools/pre_commit/mypy.py: add vllm/nvllm/layers to EXCLUDE - notebooks/nvllm/layers_smoke_tests.py: 5 host-side Tier-1 tests (5/5 pass) Ship gate: GSM8K 8/8 on natfii/Qwen3.5-27B-NVFP4-Opus-GB10 with PIECEWISE CUDA graphs, matching Phase B baseline (commit 4110dc7). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- benchmarks/nvllm/traces/cute_fusion/2026-04-17-phase-c/: - summary.md: Tier-3 ship-gate evidence (GSM8K 8/8, commit cbfadb6, image nvllm:gb10-ots-phaseC) - decode_log.txt: 32-line trimmed CUTE_DEBUG_FUSION math (first phaseB+phaseC pair per fusion-active layer 3..63, matches Phase B shape at 2026-04-17-own-the-stack/) - .gitignore: add benchmarks/nvllm/traces/**/*.full.txt rule so the unfiltered 1.4MB per-decode dumps stay local-only - notebooks/nvllm/layers_smoke_tests.py: add missing nvllm fork SPDX line (code-review audit suggestion) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

analyze.py carries the shortlisting logic (importable by both notebooks and the B.2.2 code generator). gemm_microbench_analysis.ipynb renders per-shape heatmaps + top-3 per (shape x M-bucket) + exports shortlist.json. Notebook committed with outputs pre-rendered so a future heuristic session can open it without re-running the sweep. Shortlist: 12 unique configs across 4 shapes x 4 M-buckets (16/16 cells populated). Tile 128x128x256_*_Pers dominates gate_up_proj / down_proj and o_proj mid-M; 128x256x128 wins qkv_proj / o_proj at small M; 256x128x128_TmaWSCoop_Pers wins o_proj at M=192-256. Pers tile scheduler wins nearly every bucket — SK only appears as #3 fallback. smoke_M256 (baseline) places in qkv_proj at large M, showing room for further tuning there is tight. Next (B.2.2): register shortlisted configs in the C++ dispatcher behind NVLLM_FP4_GEMM_CONFIG_M256 env var, rebuild, per-config E2E traces. Refs: docs/superpowers/plans/2026-04-21-gemm-sweep.md Task B.2.1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Each of the 16 full_attention layers in Qwen3.5-27B attaches its own PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full = None`, so `cute.compile()` fires once per layer on first request — 16 × ~23 s ≈ ~6 min cold-start stall. Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4` (audited via grep; key covers them all + 12 safe-redundant derived fields). Instances with matching config share one compiled kernel. Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`): - 16 β-coop attachments → 1 compile event (was 16). - Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each. - Projected savings ≈ 310 s (~5 min) shaved off first-request latency. - GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12', not a kernel regression — reproduces on baseline without this fix). Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`): - 6 new tests covering dict existence, key equivalence for matching configs, key distinctness for different configs, 16-instance → 1-compile behavior, distinct-config → N-compiles, and back-compat instance attr population. - 33/33 existing Phase E tests still pass. Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM shrink + #4 matched-concurrency baseline bench (follow-up session), #5 cudaProfilerApi hook (infra). Base: 7bc5773 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Torch profiler traces of the CuTe backend lumped all 16 full_attention layers together because the β-coop and β-lite call sites in `_backend.forward()` emitted no span markers. Per-layer attribution was only inferrable from kernel names, not from the profiler row labels. Wrap each call site in `torch.profiler.record_function`: - β-coop (line ~1144): `PhaseE_Beta.coop.{_layer_name}` - β-lite (line ~1219): `PhaseE_Beta.lite.{_layer_name}` `record_function` is a no-op when no profiler is active, so there is zero steady-state cost. In profile captures the spans give one row per layer per path in chrome://tracing. Unit tests (`tests/kernels/cute/test_phase_e_record_function_spans.py`): - record_function is imported. - β-coop branch wraps the run_beta_coop_full call with a span labelled PhaseE_Beta.coop.{_layer_name}. - β-lite branch wraps the _mlp_kernel call with a span labelled PhaseE_Beta.lite.{_layer_name}. - Span labels distinct between paths. 4/4 new tests pass, 43/43 total Phase E tests pass. Integration verify of the spans' trace output deferred to the next live profile capture (the wrap syntax is covered by the unit tests; runtime behaviour of record_function is owned by torch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Original Task 9 plan ("parameterize host Phase 1 mask helpers on wo_split") was already subsumed by the Task 4+5 combined dispatch. Repurposed to address the kernel-side cleanups flagged by the Task 8 spec+quality review: #2 (Important): R11 timing/spin/exit gates now use wo_split_const instead of self.wo_split, matching the W_O block (Task 8). Both are bound from int(self.wo_split) in the same JIT compile call, but mixing the two in the same kernel body forced readers to verify equivalence. Now uniform across the kernel body. #3 (Minor): Hoisted single pre_wo_consumer_active = (bx>0 && bx<wo_split_const && by<num_kv_heads) above the R11 entry; reused at entry timing, spin gate, and exit timing. Removes the duplicate pre_wo_consumer_active2 copy-paste artifact. #4 (Minor): Dropped "# NEW:" prefix from the wo_split cache-key inline comment — the marker would go stale at PR. #5 (Real, fixed in same diff via the L253 comment block): bound- restriction comment now points to docs/research/2026-05-03-w-o-k- parallel-harness/torch_reference.py (the committed path) instead of /tmp/wo_split_repro_workdir/torch_reference.py (machine-local transient). #6 (Minor): Added 3-line comment block before the new pre_wo_consumer_active declaration explaining bx==0 producers skip R11 because their attn_output reads are intra-CTA — the cross-CTA safety derivation that the spec reviewer pointed out was undocumented. Deferred to merge-prep (per user direction): - #1: total_ctas_per_seq_attn dead-arg cleanup (Task 12 PR-prep) - #7: cutlass.const_expr gate on wo_split=1 producer fence/atomic (revisit if Task 10/11 evidence shows wo_split=1 overhead matters) Pure refactor — bit-exact gate at wo_split=1 AND wo_split=8 still passes with max_abs == 0.0 against reference_split_order. Cache MISS on first launch (wo_split_const reference and mask hoist change the PTX even though numerics are identical at runtime). Task 9 of 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Natfii and others added 2 commits April 17, 2026 14:34

Natfii merged commit b7b1e04 into main Apr 20, 2026

Natfii deleted the feat/own-the-stack-phase-c branch April 22, 2026 08:38

Natfii mentioned this pull request May 7, 2026

cherry-pick: upstream stabilization tier-1 (4 picks) #10

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers#3

Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers#3
Natfii merged 2 commits into
mainfrom
feat/own-the-stack-phase-c

Natfii commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Natfii commented Apr 20, 2026

Summary

Why

Test plan

AI assistance

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant