Skip to content

Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers#3

Merged
Natfii merged 2 commits into
mainfrom
feat/own-the-stack-phase-c
Apr 20, 2026
Merged

Phase B + C: Qwen3.5 own-the-stack — fork-owned model class + layers#3
Natfii merged 2 commits into
mainfrom
feat/own-the-stack-phase-c

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented Apr 20, 2026

Summary

Lands the Phase B + C own-the-stack refactor for Qwen3.5 on the fork:

  • Phase B (2b2d67c3b, already on main): Qwen3_5Attention model class
    moved from upstream vllm/model_executor/models/qwen3_next.py to fork-owned
    vllm/nvllm/models/qwen3_5.py, with a 15-line from vllm.nvllm.models.qwen3_5 import *
    shim in the upstream-tracking file. Adds attach_fusion(parent_layer) API on
    the CuTe paged backend.

  • Phase C (this PR, 2 commits):

    • cbfadb6a9 refactor(nvllm): Phase C own-the-stack — layers into vllm/nvllm/layers/
    • 6434802d6 chore(nvllm): Phase C — trace evidence + audit fixes

    Moves Qwen3_5RMSNorm and Qwen3_5MLP out of upstream-tracking files into
    fork-owned vllm/nvllm/layers/layernorm.py and vllm/nvllm/layers/mlp.py.
    Qwen3_5RMSNorm registers as CustomOp "qwen3_5_rms_norm"; Qwen3_5MLP
    is a dense-only copy of Qwen2MoeMLP that drops expert_gate/reduce_results.

Why

The fused-MLP kernel work (Phase D, separate PR off feat/unreal-kernel-phase-d)
needs a place to live that's fork-owned — otherwise every upstream sync drops
our fusion hooks. Own-the-stack gives:

  1. Fork-owned model class — fusion binding evolves without editing upstream files.
  2. Fork-owned layer classes — uber-kernel can fuse in-class ops without touching
    upstream Qwen2MoeMLP / GemmaRMSNorm.
  3. Shim files in upstream paths — existing
    from vllm.model_executor.models.qwen3_5 import * imports still work.

Test plan

  • GSM8K 8/8 on Qwen3.5-27B-NVFP4 via CuTe paged backend (validated 2026-04-17
    at landing time; recorded in memory:project_own_the_stack)
  • Qwen3NextAttention upstream-clean after Phase B off-ramp (matches upstream
    commit 494636b29)
  • Fusion binding via CutePagedAttentionImpl.attach_fusion(parent_layer)
    works for Qwen3.5 (Phase B shipped + smoke-tested)

No model-forward behavior changes — pure code ownership movement with shim files
keeping all existing imports resolving.

AI assistance

Assembled with AI assistance (Claude Opus 4.7). Every changed line was reviewed
by the submitter. AI-assisted commits carry
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> trailers.

Notes

  • Base: main on Navi-AI-Lab/nvllm (this fork), NOT upstream vllm-project/vllm.
  • Phase D follows in a separate PR off feat/unreal-kernel-phase-d after this lands.

Natfii and others added 2 commits April 17, 2026 14:34
- vllm/nvllm/layers/layernorm.py: Qwen3_5RMSNorm (copy of GemmaRMSNorm body,
  registered as "qwen3_5_rms_norm" to avoid CustomOp collision)
- vllm/nvllm/layers/mlp.py: Qwen3_5MLP (copy of dense Qwen2MoeMLP body,
  drops expert_gate + reduce_results kwargs; not used by 27B dense)
- vllm/nvllm/models/qwen3_5.py: 8-line import diff, 3 call-site renames,
  0 logic restructured
- tools/pre_commit/mypy.py: add vllm/nvllm/layers to EXCLUDE
- notebooks/nvllm/layers_smoke_tests.py: 5 host-side Tier-1 tests (5/5 pass)

Ship gate: GSM8K 8/8 on natfii/Qwen3.5-27B-NVFP4-Opus-GB10 with
PIECEWISE CUDA graphs, matching Phase B baseline (commit 4110dc7).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- benchmarks/nvllm/traces/cute_fusion/2026-04-17-phase-c/:
  - summary.md: Tier-3 ship-gate evidence (GSM8K 8/8, commit cbfadb6,
    image nvllm:gb10-ots-phaseC)
  - decode_log.txt: 32-line trimmed CUTE_DEBUG_FUSION math (first
    phaseB+phaseC pair per fusion-active layer 3..63, matches Phase B
    shape at 2026-04-17-own-the-stack/)
- .gitignore: add benchmarks/nvllm/traces/**/*.full.txt rule so the
  unfiltered 1.4MB per-decode dumps stay local-only
- notebooks/nvllm/layers_smoke_tests.py: add missing nvllm fork SPDX
  line (code-review audit suggestion)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii merged commit b7b1e04 into main Apr 20, 2026
Natfii added a commit that referenced this pull request Apr 22, 2026
analyze.py carries the shortlisting logic (importable by both notebooks
and the B.2.2 code generator). gemm_microbench_analysis.ipynb renders
per-shape heatmaps + top-3 per (shape x M-bucket) + exports shortlist.json.
Notebook committed with outputs pre-rendered so a future heuristic session
can open it without re-running the sweep.

Shortlist: 12 unique configs across 4 shapes x 4 M-buckets (16/16 cells
populated). Tile 128x128x256_*_Pers dominates gate_up_proj / down_proj
and o_proj mid-M; 128x256x128 wins qkv_proj / o_proj at small M;
256x128x128_TmaWSCoop_Pers wins o_proj at M=192-256. Pers tile scheduler
wins nearly every bucket — SK only appears as #3 fallback. smoke_M256
(baseline) places in qkv_proj at large M, showing room for further
tuning there is tight.

Next (B.2.2): register shortlisted configs in the C++ dispatcher behind
NVLLM_FP4_GEMM_CONFIG_M256 env var, rebuild, per-config E2E traces.

Refs: docs/superpowers/plans/2026-04-21-gemm-sweep.md Task B.2.1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii deleted the feat/own-the-stack-phase-c branch April 22, 2026 08:38
Natfii added a commit that referenced this pull request Apr 25, 2026
Each of the 16 full_attention layers in Qwen3.5-27B attaches its own
PhaseE_Beta_Kernel instance with its own `self._compiled_phase_coop_full
= None`, so `cute.compile()` fires once per layer on first request —
16 × ~23 s ≈ ~6 min cold-start stall.

Fix: module-level `_PHASE_E_COOP_FULL_COMPILE_CACHE` keyed by the tuple
of all 22 `self.` constexprs read inside `_jit_launch_phase_0_to_4`
(audited via grep; key covers them all + 12 safe-redundant derived
fields). Instances with matching config share one compiled kernel.

Evidence (`benchmarks/nvllm/traces/phase_e_1/2026-04-24-coop-compile-cache/`):
- 16 β-coop attachments → 1 compile event (was 16).
- Cold Q1 = 79.4 s (compile + decode); warm Q2-Q8 = 22.7-23.2 s each.
- Projected savings ≈ 310 s (~5 min) shaved off first-request latency.
- GSM8K sanity PASS 7/8 (Q2 is a regex-extractor artifact on '120/12',
  not a kernel regression — reproduces on baseline without this fix).

Unit tests (`tests/kernels/cute/test_phase_e_compile_cache.py`):
- 6 new tests covering dict existence, key equivalence for matching
  configs, key distinctness for different configs, 16-instance → 1-compile
  behavior, distinct-config → N-compiles, and back-compat instance attr
  population.
- 33/33 existing Phase E tests still pass.

Next in Phase E.1: #3 record_function spans (this PR), #2 β-coop SMEM
shrink + #4 matched-concurrency baseline bench (follow-up session),
#5 cudaProfilerApi hook (infra).

Base: 7bc5773

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii added a commit that referenced this pull request Apr 25, 2026
Torch profiler traces of the CuTe backend lumped all 16 full_attention
layers together because the β-coop and β-lite call sites in
`_backend.forward()` emitted no span markers. Per-layer attribution was
only inferrable from kernel names, not from the profiler row labels.

Wrap each call site in `torch.profiler.record_function`:

- β-coop (line ~1144): `PhaseE_Beta.coop.{_layer_name}`
- β-lite (line ~1219): `PhaseE_Beta.lite.{_layer_name}`

`record_function` is a no-op when no profiler is active, so there is
zero steady-state cost. In profile captures the spans give one row per
layer per path in chrome://tracing.

Unit tests (`tests/kernels/cute/test_phase_e_record_function_spans.py`):
- record_function is imported.
- β-coop branch wraps the run_beta_coop_full call with a span labelled
  PhaseE_Beta.coop.{_layer_name}.
- β-lite branch wraps the _mlp_kernel call with a span labelled
  PhaseE_Beta.lite.{_layer_name}.
- Span labels distinct between paths.

4/4 new tests pass, 43/43 total Phase E tests pass. Integration verify
of the spans' trace output deferred to the next live profile capture
(the wrap syntax is covered by the unit tests; runtime behaviour of
record_function is owned by torch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii added a commit that referenced this pull request May 4, 2026
Original Task 9 plan ("parameterize host Phase 1 mask helpers on
wo_split") was already subsumed by the Task 4+5 combined dispatch.
Repurposed to address the kernel-side cleanups flagged by the Task 8
spec+quality review:

#2 (Important): R11 timing/spin/exit gates now use wo_split_const
    instead of self.wo_split, matching the W_O block (Task 8). Both
    are bound from int(self.wo_split) in the same JIT compile call,
    but mixing the two in the same kernel body forced readers to
    verify equivalence. Now uniform across the kernel body.

#3 (Minor): Hoisted single pre_wo_consumer_active = (bx>0 &&
    bx<wo_split_const && by<num_kv_heads) above the R11 entry; reused
    at entry timing, spin gate, and exit timing. Removes the duplicate
    pre_wo_consumer_active2 copy-paste artifact.

#4 (Minor): Dropped "# NEW:" prefix from the wo_split cache-key inline
    comment — the marker would go stale at PR.

#5 (Real, fixed in same diff via the L253 comment block): bound-
    restriction comment now points to docs/research/2026-05-03-w-o-k-
    parallel-harness/torch_reference.py (the committed path) instead
    of /tmp/wo_split_repro_workdir/torch_reference.py (machine-local
    transient).

#6 (Minor): Added 3-line comment block before the new pre_wo_consumer_active
    declaration explaining bx==0 producers skip R11 because their
    attn_output reads are intra-CTA — the cross-CTA safety derivation
    that the spec reviewer pointed out was undocumented.

Deferred to merge-prep (per user direction):
- #1: total_ctas_per_seq_attn dead-arg cleanup (Task 12 PR-prep)
- #7: cutlass.const_expr gate on wo_split=1 producer fence/atomic
       (revisit if Task 10/11 evidence shows wo_split=1 overhead matters)

Pure refactor — bit-exact gate at wo_split=1 AND wo_split=8 still
passes with max_abs == 0.0 against reference_split_order. Cache MISS
on first launch (wo_split_const reference and mask hoist change the
PTX even though numerics are identical at runtime).

Task 9 of 12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant