evidence(W_O K-parallel): validation harness + 8.39x sweep + NCU memory-bound#7
Merged
Merged
Conversation
…d endpoint Standalone validation harness for the W_O K-parallel lever proposed by PR #6's per-region timing breakdown. Built in isolation from production to satisfy PR #6's matrix gate ("Conditional. Pursue only if NCU shows memory-bound classification"). Harness (docs/research/2026-05-03-w-o-k-parallel-harness/): - microkernel.py: W_O kernel with wo_split Constexpr and per-K-group block-scale dequant; production-equivalent FP4 + FP8 scale path - torch_reference.py: per-variant FP32 reference mirroring each variant's reduction tree (chained for split=1, slot-id-ordered for splits) - run_harness.py: configurable sweep + NCU re-exec; enforces correctness gate before timing - run_sweep.sh: docker-wrapped 4-variant sweep (CTAs in {4,8,16,32}) Sweep evidence (benchmarks/.../2026-05-03-w-o-k-parallel-harness/): - 50 launches per variant, all bit-exact (max_abs=0.0) against reference_split_order(wo_split) - Means: 13754 / 5176 / 2693 / 1639 us -> speedups 1.00x / 2.66x / 5.11x / 8.39x NCU classification (kernel_cutlass__wo_kernel_body_________________0): - wo_split=1 (4 W_O CTAs): latency-limited 8.06% peak DRAM bw, 1.17% SM busy, 98.33% no-eligible warps - wo_split=8 (32 W_O CTAs): memory-bound 55.95% peak DRAM bw, 6.51% SM busy, 91.86% no-eligible warps - The W_O lever converts a latency-limited kernel into a memory-bound one. Matrix gate now passes at the scaled endpoint. NCU run log: - First attempt aborted at 5h31m (unfiltered profiling generated 2.7 GB report). Aborted-attempt metadata preserved at ncu/ncu_unfiltered_aborted/wo_split_1/ (.ncu-rep discarded). - Rerun used --kernel-name regex:wo_kernel_body, --launch-count 1, dropped --target-processes all. Both endpoints in ~60s, reports 324K and 525K. - Replay-mode kernel-vs-application failures from PR #6 follow-up attempts also recorded; logs at docs/research/2026-05-02-beta-region-breakdown/ncu-attempt{1,2}/. Production parity gap recorded as audit follow-up, not a conclusion. The slope is valid evidence for the W_O lever; absolute-timing mismatch needs a separate denominator + launch-shape audit. Refs: PR #6 (Navi-AI-Lab/nvllm), summary at benchmarks/nvllm/traces/cute_paged_attn/2026-05-03-w-o-k-parallel-harness/summary.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets the 2026-05-03 parity audit hold active_wo_ctas constant (= wo_split * num_kv_heads) while varying total cooperative-grid size (slice_ctas * num_kv_heads). GATHER_CTAS effective-bytes term scales with this value. Used by benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-parity-gap/ to factorialize: A: slice_ctas=16, wo_split=1 → 64 grid / 4 active W_O B: slice_ctas=8, wo_split=1 → 32 grid / 4 active W_O C: slice_ctas=8, wo_split=8 → 32 grid / 32 active W_O Validated via parser.error: slice_ctas >= wo_split, slice_ctas >= 1. config.json now records slice_ctas, gather_ctas, total_grid_ctas_per_seq, active_wo_ctas; effective-bytes formula doc string uses run-actual gather_ctas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A: 4 active W_O CTAs in production-shaped 64 CTA grid (slice_ctas=16) B: 4 active W_O CTAs in harness 32 CTA grid (slice_ctas=8) C: 32 active W_O CTAs in harness 32 CTA grid (wo_split=8) A↔B isolates total-grid effect at fixed active count. B↔C isolates active-count effect at fixed total grid. Each config's nsys .nsys-rep, device-event timing.csv, and bit-exact correctness gate (max_abs=0 vs reference_split_order) committed under benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-parity-gap/. Audit verdict: harness B→C slope is 8.34× stable median; production gather is single-CTA gated so absolute harness μs does not transfer linearly to production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Natfii
added a commit
that referenced
this pull request
May 4, 2026
Original Task 9 plan ("parameterize host Phase 1 mask helpers on
wo_split") was already subsumed by the Task 4+5 combined dispatch.
Repurposed to address the kernel-side cleanups flagged by the Task 8
spec+quality review:
#2 (Important): R11 timing/spin/exit gates now use wo_split_const
instead of self.wo_split, matching the W_O block (Task 8). Both
are bound from int(self.wo_split) in the same JIT compile call,
but mixing the two in the same kernel body forced readers to
verify equivalence. Now uniform across the kernel body.
#3 (Minor): Hoisted single pre_wo_consumer_active = (bx>0 &&
bx<wo_split_const && by<num_kv_heads) above the R11 entry; reused
at entry timing, spin gate, and exit timing. Removes the duplicate
pre_wo_consumer_active2 copy-paste artifact.
#4 (Minor): Dropped "# NEW:" prefix from the wo_split cache-key inline
comment — the marker would go stale at PR.
#5 (Real, fixed in same diff via the L253 comment block): bound-
restriction comment now points to docs/research/2026-05-03-w-o-k-
parallel-harness/torch_reference.py (the committed path) instead
of /tmp/wo_split_repro_workdir/torch_reference.py (machine-local
transient).
#6 (Minor): Added 3-line comment block before the new pre_wo_consumer_active
declaration explaining bx==0 producers skip R11 because their
attn_output reads are intra-CTA — the cross-CTA safety derivation
that the spec reviewer pointed out was undocumented.
Deferred to merge-prep (per user direction):
- #1: total_ctas_per_seq_attn dead-arg cleanup (Task 12 PR-prep)
- #7: cutlass.const_expr gate on wo_split=1 producer fence/atomic
(revisit if Task 10/11 evidence shows wo_split=1 overhead matters)
Pure refactor — bit-exact gate at wo_split=1 AND wo_split=8 still
passes with max_abs == 0.0 against reference_split_order. Cache MISS
on first launch (wo_split_const reference and mask hoist change the
PTX even though numerics are identical at runtime).
Task 9 of 12.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
max_abs=0.0) for all four split-order variants against per-variant FP32 reference; sweep reaches 8.39× speedup at 32 W_O CTAs (vs 4 baseline).Curated evidence:
benchmarks/nvllm/traces/cute_paged_attn/2026-05-03-w-o-k-parallel-harness/summary.md.Scope
--replay-mode kernel(deadlocks cooperative grid barrier) and--replay-mode application(long-runningvllm serveblocks application-replay), with decision-record line that R2-R4 coupling carries W_O to validation despite the matrix gate not yet being met at that point.docs/research/2026-05-02-beta-region-breakdown/ncu-attempt{1,2}/— provenance for the failed in-process attempts that drove the pivot.docs/research/2026-05-03-w-o-k-parallel-harness/:microkernel.py— W_O kernel withwo_splitConstexpr + per-K-group block-scale dequant (production-equivalent FP4 + FP8 path)torch_reference.py— three FP32 references;reference_split_order(wo_split)is AUTHORITATIVE per variant (mirrors each kernel's reduction tree)run_harness.py— configurable sweep + NCU re-exec, fail-fast correctness gaterun_sweep.sh— docker-wrapped 4-variant sweep (CTAs ∈ {4, 8, 16, 32})README.md— design contract + ratification recordbenchmarks/nvllm/traces/cute_paged_attn/2026-05-03-w-o-k-parallel-harness/:ncu/wo_split_1/andncu/wo_split_8/— bounded NCU runs (filtered by--kernel-name regex:wo_kernel_body)ncu/ncu_unfiltered_aborted/wo_split_1/— first-attempt metadata only (the 2.7 GB partial.ncu-repwas discarded; first attempt aborted at 5h31m due to unfiltered profiling of non-target kernels)summary.md— curated headline numbers + classification verdictsTotal new evidence: ~1.1 MB (NCU reports 324 KB + 525 KB; everything else is text).
Headline numbers
NCU at the two endpoints (same kernel symbol, same grid, only the
bx < wo_splitgate differs):Test plan
max_abs=0.0againstreference_split_order(wo_split)).ncu-repin tree)__pycache__in harness treebash docs/research/2026-05-03-w-o-k-parallel-harness/run_sweep.shsummary.md§ How to reproduceNotes
docs/research/2026-05-03-w-o-k-parallel-harness/README.md§1b.🤖 Generated with Claude Code