evidence(W_O K-parallel): validation harness + 8.39x sweep + NCU memory-bound by Natfii · Pull Request #7 · Navi-AI-Lab/nvllm

Natfii · 2026-05-03T20:49:10Z

Summary

W_O K-parallel validation passed. Bit-exact correctness (max_abs=0.0) for all four split-order variants against per-variant FP32 reference; sweep reaches 8.39× speedup at 32 W_O CTAs (vs 4 baseline).
NCU classification: scaled endpoint becomes memory-bound (55.95% peak DRAM bw, 6.51% SM busy) while the baseline is latency-limited (8.06% / 1.17%). The W_O lever converts a latency-limited kernel into a memory-bound one — PR evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck #6's matrix gate now passes at the scaled endpoint.
Production parity gap recorded as audit follow-up, not a conclusion — the slope is valid evidence; absolute-timing mismatch needs a separate denominator + launch-shape audit.

Curated evidence: benchmarks/nvllm/traces/cute_paged_attn/2026-05-03-w-o-k-parallel-harness/summary.md.

Scope

Updated PR evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck #6 breakdown summaries (mirror pair) — adds the "NCU pivot" subsection recording the structural failures of --replay-mode kernel (deadlocks cooperative grid barrier) and --replay-mode application (long-running vllm serve blocks application-replay), with decision-record line that R2-R4 coupling carries W_O to validation despite the matrix gate not yet being met at that point.
NCU attempt logs under docs/research/2026-05-02-beta-region-breakdown/ncu-attempt{1,2}/ — provenance for the failed in-process attempts that drove the pivot.
New harness under docs/research/2026-05-03-w-o-k-parallel-harness/:
- microkernel.py — W_O kernel with wo_split Constexpr + per-K-group block-scale dequant (production-equivalent FP4 + FP8 path)
- torch_reference.py — three FP32 references; reference_split_order(wo_split) is AUTHORITATIVE per variant (mirrors each kernel's reduction tree)
- run_harness.py — configurable sweep + NCU re-exec, fail-fast correctness gate
- run_sweep.sh — docker-wrapped 4-variant sweep (CTAs ∈ {4, 8, 16, 32})
- README.md — design contract + ratification record
Evidence artifacts under benchmarks/nvllm/traces/cute_paged_attn/2026-05-03-w-o-k-parallel-harness/:
- 4 variant scratchpads (50 launches each, timing.csv + 4 correctness JSONs)
- ncu/wo_split_1/ and ncu/wo_split_8/ — bounded NCU runs (filtered by --kernel-name regex:wo_kernel_body)
- ncu/ncu_unfiltered_aborted/wo_split_1/ — first-attempt metadata only (the 2.7 GB partial .ncu-rep was discarded; first attempt aborted at 5h31m due to unfiltered profiling of non-target kernels)
- summary.md — curated headline numbers + classification verdicts

Total new evidence: ~1.1 MB (NCU reports 324 KB + 525 KB; everything else is text).

Headline numbers

wo_split	W_O CTAs	Mean elapsed (μs)	Speedup	NCU classification
1	4	13754	1.00×	latency-limited
2	8	5176	2.66×	(unprofiled)
4	16	2693	5.11×	(unprofiled)
8	32	1639	8.39×	memory-bound

NCU at the two endpoints (same kernel symbol, same grid, only the bx < wo_split gate differs):

Metric	wo_split=1	wo_split=8
Max Bandwidth (% peak DRAM)	8.06%	55.95%
SM Busy	1.17%	6.51%
No-Eligible Warps (% cycles)	98.33%	91.86%
Eligible Warps / Scheduler	0.02	0.08

Test plan

Sweep correctness gate at all 4 splits (max_abs=0.0 against reference_split_order(wo_split))
NCU on baseline endpoint (wo_split=1)
NCU on scaled endpoint (wo_split=8)
Verify aborted-attempt metadata preserved (no 2.7 GB .ncu-rep in tree)
DRAM headroom corrected to ~44% (matches 55.95% peak)
No __pycache__ in harness tree
Reviewer reproduce: bash docs/research/2026-05-03-w-o-k-parallel-harness/run_sweep.sh
Reviewer reproduce: NCU command in summary.md § How to reproduce

Notes

AI-assisted: harness design + implementation + NCU run iteration done with Claude Opus 4.7. Curated commit message + summary.md authored under direct user direction with explicit framing constraints (slope vs absolute parity, classification wording).
No production code changed. This is harness + evidence only. Production integration (β-coop W_O K-parallel) is the next ticket; design notes in docs/research/2026-05-03-w-o-k-parallel-harness/README.md §1b.

🤖 Generated with Claude Code

…d endpoint Standalone validation harness for the W_O K-parallel lever proposed by PR #6's per-region timing breakdown. Built in isolation from production to satisfy PR #6's matrix gate ("Conditional. Pursue only if NCU shows memory-bound classification"). Harness (docs/research/2026-05-03-w-o-k-parallel-harness/): - microkernel.py: W_O kernel with wo_split Constexpr and per-K-group block-scale dequant; production-equivalent FP4 + FP8 scale path - torch_reference.py: per-variant FP32 reference mirroring each variant's reduction tree (chained for split=1, slot-id-ordered for splits) - run_harness.py: configurable sweep + NCU re-exec; enforces correctness gate before timing - run_sweep.sh: docker-wrapped 4-variant sweep (CTAs in {4,8,16,32}) Sweep evidence (benchmarks/.../2026-05-03-w-o-k-parallel-harness/): - 50 launches per variant, all bit-exact (max_abs=0.0) against reference_split_order(wo_split) - Means: 13754 / 5176 / 2693 / 1639 us -> speedups 1.00x / 2.66x / 5.11x / 8.39x NCU classification (kernel_cutlass__wo_kernel_body_________________0): - wo_split=1 (4 W_O CTAs): latency-limited 8.06% peak DRAM bw, 1.17% SM busy, 98.33% no-eligible warps - wo_split=8 (32 W_O CTAs): memory-bound 55.95% peak DRAM bw, 6.51% SM busy, 91.86% no-eligible warps - The W_O lever converts a latency-limited kernel into a memory-bound one. Matrix gate now passes at the scaled endpoint. NCU run log: - First attempt aborted at 5h31m (unfiltered profiling generated 2.7 GB report). Aborted-attempt metadata preserved at ncu/ncu_unfiltered_aborted/wo_split_1/ (.ncu-rep discarded). - Rerun used --kernel-name regex:wo_kernel_body, --launch-count 1, dropped --target-processes all. Both endpoints in ~60s, reports 324K and 525K. - Replay-mode kernel-vs-application failures from PR #6 follow-up attempts also recorded; logs at docs/research/2026-05-02-beta-region-breakdown/ncu-attempt{1,2}/. Production parity gap recorded as audit follow-up, not a conclusion. The slope is valid evidence for the W_O lever; absolute-timing mismatch needs a separate denominator + launch-shape audit. Refs: PR #6 (Navi-AI-Lab/nvllm), summary at benchmarks/nvllm/traces/cute_paged_attn/2026-05-03-w-o-k-parallel-harness/summary.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lets the 2026-05-03 parity audit hold active_wo_ctas constant (= wo_split * num_kv_heads) while varying total cooperative-grid size (slice_ctas * num_kv_heads). GATHER_CTAS effective-bytes term scales with this value. Used by benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-parity-gap/ to factorialize: A: slice_ctas=16, wo_split=1 → 64 grid / 4 active W_O B: slice_ctas=8, wo_split=1 → 32 grid / 4 active W_O C: slice_ctas=8, wo_split=8 → 32 grid / 32 active W_O Validated via parser.error: slice_ctas >= wo_split, slice_ctas >= 1. config.json now records slice_ctas, gather_ctas, total_grid_ctas_per_seq, active_wo_ctas; effective-bytes formula doc string uses run-actual gather_ctas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A: 4 active W_O CTAs in production-shaped 64 CTA grid (slice_ctas=16) B: 4 active W_O CTAs in harness 32 CTA grid (slice_ctas=8) C: 32 active W_O CTAs in harness 32 CTA grid (wo_split=8) A↔B isolates total-grid effect at fixed active count. B↔C isolates active-count effect at fixed total grid. Each config's nsys .nsys-rep, device-event timing.csv, and bit-exact correctness gate (max_abs=0 vs reference_split_order) committed under benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-parity-gap/. Audit verdict: harness B→C slope is 8.34× stable median; production gather is single-CTA gated so absolute harness μs does not transfer linearly to production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Original Task 9 plan ("parameterize host Phase 1 mask helpers on wo_split") was already subsumed by the Task 4+5 combined dispatch. Repurposed to address the kernel-side cleanups flagged by the Task 8 spec+quality review: #2 (Important): R11 timing/spin/exit gates now use wo_split_const instead of self.wo_split, matching the W_O block (Task 8). Both are bound from int(self.wo_split) in the same JIT compile call, but mixing the two in the same kernel body forced readers to verify equivalence. Now uniform across the kernel body. #3 (Minor): Hoisted single pre_wo_consumer_active = (bx>0 && bx<wo_split_const && by<num_kv_heads) above the R11 entry; reused at entry timing, spin gate, and exit timing. Removes the duplicate pre_wo_consumer_active2 copy-paste artifact. #4 (Minor): Dropped "# NEW:" prefix from the wo_split cache-key inline comment — the marker would go stale at PR. #5 (Real, fixed in same diff via the L253 comment block): bound- restriction comment now points to docs/research/2026-05-03-w-o-k- parallel-harness/torch_reference.py (the committed path) instead of /tmp/wo_split_repro_workdir/torch_reference.py (machine-local transient). #6 (Minor): Added 3-line comment block before the new pre_wo_consumer_active declaration explaining bx==0 producers skip R11 because their attn_output reads are intra-CTA — the cross-CTA safety derivation that the spec reviewer pointed out was undocumented. Deferred to merge-prep (per user direction): - #1: total_ctas_per_seq_attn dead-arg cleanup (Task 12 PR-prep) - #7: cutlass.const_expr gate on wo_split=1 producer fence/atomic (revisit if Task 10/11 evidence shows wo_split=1 overhead matters) Pure refactor — bit-exact gate at wo_split=1 AND wo_split=8 still passes with max_abs == 0.0 against reference_split_order. Cache MISS on first launch (wo_split_const reference and mask hoist change the PTX even though numerics are identical at runtime). Task 9 of 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Natfii and others added 3 commits May 3, 2026 16:48

Natfii merged commit 45476de into main May 4, 2026

This was referenced May 4, 2026

feat(wo_split=8): opt-in K-parallel W_O GEMV prototype #8

Closed

evidence(wo_split prod soak): full {1,2,4,8} sweep + writeup #9

Merged

Natfii deleted the evidence/wo-k-parallel-harness branch May 7, 2026 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evidence(W_O K-parallel): validation harness + 8.39x sweep + NCU memory-bound#7

evidence(W_O K-parallel): validation harness + 8.39x sweep + NCU memory-bound#7
Natfii merged 3 commits into
mainfrom
evidence/wo-k-parallel-harness

Natfii commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Natfii commented May 3, 2026

Summary

Scope

Headline numbers

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant