Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers#3
Merged
Conversation
- vllm/nvllm/layers/layernorm.py: Qwen3_5RMSNorm (copy of GemmaRMSNorm body, registered as "qwen3_5_rms_norm" to avoid CustomOp collision) - vllm/nvllm/layers/mlp.py: Qwen3_5MLP (copy of dense Qwen2MoeMLP body, drops expert_gate + reduce_results kwargs; not used by 27B dense) - vllm/nvllm/models/qwen3_5.py: 8-line import diff, 3 call-site renames, 0 logic restructured - tools/pre_commit/mypy.py: add vllm/nvllm/layers to EXCLUDE - notebooks/nvllm/layers_smoke_tests.py: 5 host-side Tier-1 tests (5/5 pass) Ship gate: GSM8K 8/8 on natfii/Qwen3.5-27B-NVFP4-Opus-GB10 with PIECEWISE CUDA graphs, matching Phase B baseline (commit 4110dc7). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- benchmarks/nvllm/traces/cute_fusion/2026-04-17-phase-c/: - summary.md: Tier-3 ship-gate evidence (GSM8K 8/8, commit cbfadb6, image nvllm:gb10-ots-phaseC) - decode_log.txt: 32-line trimmed CUTE_DEBUG_FUSION math (first phaseB+phaseC pair per fusion-active layer 3..63, matches Phase B shape at 2026-04-17-own-the-stack/) - .gitignore: add benchmarks/nvllm/traces/**/*.full.txt rule so the unfiltered 1.4MB per-decode dumps stay local-only - notebooks/nvllm/layers_smoke_tests.py: add missing nvllm fork SPDX line (code-review audit suggestion) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
added a commit
that referenced
this pull request
Apr 22, 2026
analyze.py carries the shortlisting logic (importable by both notebooks and the B.2.2 code generator). gemm_microbench_analysis.ipynb renders per-shape heatmaps + top-3 per (shape x M-bucket) + exports shortlist.json. Notebook committed with outputs pre-rendered so a future heuristic session can open it without re-running the sweep. Shortlist: 12 unique configs across 4 shapes x 4 M-buckets (16/16 cells populated). Tile 128x128x256_*_Pers dominates gate_up_proj / down_proj and o_proj mid-M; 128x256x128 wins qkv_proj / o_proj at small M; 256x128x128_TmaWSCoop_Pers wins o_proj at M=192-256. Pers tile scheduler wins nearly every bucket — SK only appears as #3 fallback. smoke_M256 (baseline) places in qkv_proj at large M, showing room for further tuning there is tight. Next (B.2.2): register shortlisted configs in the C++ dispatcher behind NVLLM_FP4_GEMM_CONFIG_M256 env var, rebuild, per-config E2E traces. Refs: docs/superpowers/plans/2026-04-21-gemm-sweep.md Task B.2.1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
added a commit
that referenced
this pull request
Apr 25, 2026
Each of the 16 full_attention layers in Qwen3.5-27B attaches its own PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full = None`, so `cute.compile()` fires once per layer on first request — 16 × ~23 s ≈ ~6 min cold-start stall. Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4` (audited via grep; key covers them all + 12 safe-redundant derived fields). Instances with matching config share one compiled kernel. Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`): - 16 β-coop attachments → 1 compile event (was 16). - Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each. - Projected savings ≈ 310 s (~5 min) shaved off first-request latency. - GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12', not a kernel regression — reproduces on baseline without this fix). Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`): - 6 new tests covering dict existence, key equivalence for matching configs, key distinctness for different configs, 16-instance → 1-compile behavior, distinct-config → N-compiles, and back-compat instance attr population. - 33/33 existing Phase E tests still pass. Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM shrink + #4 matched-concurrency baseline bench (follow-up session), #5 cudaProfilerApi hook (infra). Base: 7bc5773 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
added a commit
that referenced
this pull request
Apr 25, 2026
Torch profiler traces of the CuTe backend lumped all 16 full_attention
layers together because the β-coop and β-lite call sites in
`_backend.forward()` emitted no span markers. Per-layer attribution was
only inferrable from kernel names, not from the profiler row labels.
Wrap each call site in `torch.profiler.record_function`:
- β-coop (line ~1144): `PhaseE_Beta.coop.{_layer_name}`
- β-lite (line ~1219): `PhaseE_Beta.lite.{_layer_name}`
`record_function` is a no-op when no profiler is active, so there is
zero steady-state cost. In profile captures the spans give one row per
layer per path in chrome://tracing.
Unit tests (`tests/kernels/cute/test_phase_e_record_function_spans.py`):
- record_function is imported.
- β-coop branch wraps the run_beta_coop_full call with a span labelled
PhaseE_Beta.coop.{_layer_name}.
- β-lite branch wraps the _mlp_kernel call with a span labelled
PhaseE_Beta.lite.{_layer_name}.
- Span labels distinct between paths.
4/4 new tests pass, 43/43 total Phase E tests pass. Integration verify
of the spans' trace output deferred to the next live profile capture
(the wrap syntax is covered by the unit tests; runtime behaviour of
record_function is owned by torch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
added a commit
that referenced
this pull request
May 4, 2026
Original Task 9 plan ("parameterize host Phase 1 mask helpers on
wo_split") was already subsumed by the Task 4+5 combined dispatch.
Repurposed to address the kernel-side cleanups flagged by the Task 8
spec+quality review:
#2 (Important): R11 timing/spin/exit gates now use wo_split_const
instead of self.wo_split, matching the W_O block (Task 8). Both
are bound from int(self.wo_split) in the same JIT compile call,
but mixing the two in the same kernel body forced readers to
verify equivalence. Now uniform across the kernel body.
#3 (Minor): Hoisted single pre_wo_consumer_active = (bx>0 &&
bx<wo_split_const && by<num_kv_heads) above the R11 entry; reused
at entry timing, spin gate, and exit timing. Removes the duplicate
pre_wo_consumer_active2 copy-paste artifact.
#4 (Minor): Dropped "# NEW:" prefix from the wo_split cache-key inline
comment — the marker would go stale at PR.
#5 (Real, fixed in same diff via the L253 comment block): bound-
restriction comment now points to docs/research/2026-05-03-w-o-k-
parallel-harness/torch_reference.py (the committed path) instead
of /tmp/wo_split_repro_workdir/torch_reference.py (machine-local
transient).
#6 (Minor): Added 3-line comment block before the new pre_wo_consumer_active
declaration explaining bx==0 producers skip R11 because their
attn_output reads are intra-CTA — the cross-CTA safety derivation
that the spec reviewer pointed out was undocumented.
Deferred to merge-prep (per user direction):
- #1: total_ctas_per_seq_attn dead-arg cleanup (Task 12 PR-prep)
- #7: cutlass.const_expr gate on wo_split=1 producer fence/atomic
(revisit if Task 10/11 evidence shows wo_split=1 overhead matters)
Pure refactor — bit-exact gate at wo_split=1 AND wo_split=8 still
passes with max_abs == 0.0 against reference_split_order. Cache MISS
on first launch (wo_split_const reference and mask hoist change the
PTX even though numerics are identical at runtime).
Task 9 of 12.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the Phase B + C own-the-stack refactor for Qwen3.5 on the fork:
Phase B (
2b2d67c3b, already on main):Qwen3_5Attentionmodel classmoved from upstream
vllm/model_executor/models/qwen3_next.pyto fork-ownedvllm/nvllm/models/qwen3_5.py, with a 15-linefrom vllm.nvllm.models.qwen3_5 import *shim in the upstream-tracking file. Adds
attach_fusion(parent_layer)API onthe CuTe paged backend.
Phase C (this PR, 2 commits):
cbfadb6a9refactor(nvllm): Phase C own-the-stack — layers into vllm/nvllm/layers/6434802d6chore(nvllm): Phase C — trace evidence + audit fixesMoves
Qwen3_5RMSNormandQwen3_5MLPout of upstream-tracking files intofork-owned
vllm/nvllm/layers/layernorm.pyandvllm/nvllm/layers/mlp.py.Qwen3_5RMSNormregisters as CustomOp"qwen3_5_rms_norm";Qwen3_5MLPis a dense-only copy of
Qwen2MoeMLPthat dropsexpert_gate/reduce_results.Why
The fused-MLP kernel work (Phase D, separate PR off
feat/unreal-kernel-phase-d)needs a place to live that's fork-owned — otherwise every upstream sync drops
our fusion hooks. Own-the-stack gives:
upstream
Qwen2MoeMLP/GemmaRMSNorm.from vllm.model_executor.models.qwen3_5 import *imports still work.Test plan
at landing time; recorded in
memory:project_own_the_stack)Qwen3NextAttentionupstream-clean after Phase B off-ramp (matches upstreamcommit
494636b29)CutePagedAttentionImpl.attach_fusion(parent_layer)works for Qwen3.5 (Phase B shipped + smoke-tested)
No model-forward behavior changes — pure code ownership movement with shim files
keeping all existing imports resolving.
AI assistance
Assembled with AI assistance (Claude Opus 4.7). Every changed line was reviewed
by the submitter. AI-assisted commits carry
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>trailers.Notes
mainonNavi-AI-Lab/nvllm(this fork), NOT upstreamvllm-project/vllm.feat/unreal-kernel-phase-dafter this lands.