Skip to content

evidence(W_O K-parallel): validation harness + 8.39x sweep + NCU memory-bound#7

Merged
Natfii merged 3 commits into
mainfrom
evidence/wo-k-parallel-harness
May 4, 2026
Merged

evidence(W_O K-parallel): validation harness + 8.39x sweep + NCU memory-bound#7
Natfii merged 3 commits into
mainfrom
evidence/wo-k-parallel-harness

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented May 3, 2026

Summary

  • W_O K-parallel validation passed. Bit-exact correctness (max_abs=0.0) for all four split-order variants against per-variant FP32 reference; sweep reaches 8.39× speedup at 32 W_O CTAs (vs 4 baseline).
  • NCU classification: scaled endpoint becomes memory-bound (55.95% peak DRAM bw, 6.51% SM busy) while the baseline is latency-limited (8.06% / 1.17%). The W_O lever converts a latency-limited kernel into a memory-bound one — PR evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck #6's matrix gate now passes at the scaled endpoint.
  • Production parity gap recorded as audit follow-up, not a conclusion — the slope is valid evidence; absolute-timing mismatch needs a separate denominator + launch-shape audit.

Curated evidence: benchmarks/nvllm/traces/cute_paged_attn/2026-05-03-w-o-k-parallel-harness/summary.md.

Scope

  • Updated PR evidence(β-coop region breakdown): 36% K-reducible, W_O is the bottleneck #6 breakdown summaries (mirror pair) — adds the "NCU pivot" subsection recording the structural failures of --replay-mode kernel (deadlocks cooperative grid barrier) and --replay-mode application (long-running vllm serve blocks application-replay), with decision-record line that R2-R4 coupling carries W_O to validation despite the matrix gate not yet being met at that point.
  • NCU attempt logs under docs/research/2026-05-02-beta-region-breakdown/ncu-attempt{1,2}/ — provenance for the failed in-process attempts that drove the pivot.
  • New harness under docs/research/2026-05-03-w-o-k-parallel-harness/:
    • microkernel.py — W_O kernel with wo_split Constexpr + per-K-group block-scale dequant (production-equivalent FP4 + FP8 path)
    • torch_reference.py — three FP32 references; reference_split_order(wo_split) is AUTHORITATIVE per variant (mirrors each kernel's reduction tree)
    • run_harness.py — configurable sweep + NCU re-exec, fail-fast correctness gate
    • run_sweep.sh — docker-wrapped 4-variant sweep (CTAs ∈ {4, 8, 16, 32})
    • README.md — design contract + ratification record
  • Evidence artifacts under benchmarks/nvllm/traces/cute_paged_attn/2026-05-03-w-o-k-parallel-harness/:
    • 4 variant scratchpads (50 launches each, timing.csv + 4 correctness JSONs)
    • ncu/wo_split_1/ and ncu/wo_split_8/ — bounded NCU runs (filtered by --kernel-name regex:wo_kernel_body)
    • ncu/ncu_unfiltered_aborted/wo_split_1/ — first-attempt metadata only (the 2.7 GB partial .ncu-rep was discarded; first attempt aborted at 5h31m due to unfiltered profiling of non-target kernels)
    • summary.md — curated headline numbers + classification verdicts

Total new evidence: ~1.1 MB (NCU reports 324 KB + 525 KB; everything else is text).

Headline numbers

wo_split W_O CTAs Mean elapsed (μs) Speedup NCU classification
1 4 13754 1.00× latency-limited
2 8 5176 2.66× (unprofiled)
4 16 2693 5.11× (unprofiled)
8 32 1639 8.39× memory-bound

NCU at the two endpoints (same kernel symbol, same grid, only the bx < wo_split gate differs):

Metric wo_split=1 wo_split=8
Max Bandwidth (% peak DRAM) 8.06% 55.95%
SM Busy 1.17% 6.51%
No-Eligible Warps (% cycles) 98.33% 91.86%
Eligible Warps / Scheduler 0.02 0.08

Test plan

  • Sweep correctness gate at all 4 splits (max_abs=0.0 against reference_split_order(wo_split))
  • NCU on baseline endpoint (wo_split=1)
  • NCU on scaled endpoint (wo_split=8)
  • Verify aborted-attempt metadata preserved (no 2.7 GB .ncu-rep in tree)
  • DRAM headroom corrected to ~44% (matches 55.95% peak)
  • No __pycache__ in harness tree
  • Reviewer reproduce: bash docs/research/2026-05-03-w-o-k-parallel-harness/run_sweep.sh
  • Reviewer reproduce: NCU command in summary.md § How to reproduce

Notes

  • AI-assisted: harness design + implementation + NCU run iteration done with Claude Opus 4.7. Curated commit message + summary.md authored under direct user direction with explicit framing constraints (slope vs absolute parity, classification wording).
  • No production code changed. This is harness + evidence only. Production integration (β-coop W_O K-parallel) is the next ticket; design notes in docs/research/2026-05-03-w-o-k-parallel-harness/README.md §1b.

🤖 Generated with Claude Code

Natfii and others added 3 commits May 3, 2026 16:48
…d endpoint

Standalone validation harness for the W_O K-parallel lever proposed by
PR #6's per-region timing breakdown. Built in isolation from production
to satisfy PR #6's matrix gate ("Conditional. Pursue only if NCU shows
memory-bound classification").

Harness (docs/research/2026-05-03-w-o-k-parallel-harness/):
- microkernel.py: W_O kernel with wo_split Constexpr and per-K-group
  block-scale dequant; production-equivalent FP4 + FP8 scale path
- torch_reference.py: per-variant FP32 reference mirroring each
  variant's reduction tree (chained for split=1, slot-id-ordered for
  splits)
- run_harness.py: configurable sweep + NCU re-exec; enforces correctness
  gate before timing
- run_sweep.sh: docker-wrapped 4-variant sweep (CTAs in {4,8,16,32})

Sweep evidence (benchmarks/.../2026-05-03-w-o-k-parallel-harness/):
- 50 launches per variant, all bit-exact (max_abs=0.0) against
  reference_split_order(wo_split)
- Means: 13754 / 5176 / 2693 / 1639 us -> speedups 1.00x / 2.66x / 5.11x
  / 8.39x

NCU classification (kernel_cutlass__wo_kernel_body_________________0):
- wo_split=1 (4 W_O CTAs): latency-limited
  8.06% peak DRAM bw, 1.17% SM busy, 98.33% no-eligible warps
- wo_split=8 (32 W_O CTAs): memory-bound
  55.95% peak DRAM bw, 6.51% SM busy, 91.86% no-eligible warps
- The W_O lever converts a latency-limited kernel into a memory-bound
  one. Matrix gate now passes at the scaled endpoint.

NCU run log:
- First attempt aborted at 5h31m (unfiltered profiling generated 2.7 GB
  report). Aborted-attempt metadata preserved at
  ncu/ncu_unfiltered_aborted/wo_split_1/ (.ncu-rep discarded).
- Rerun used --kernel-name regex:wo_kernel_body, --launch-count 1,
  dropped --target-processes all. Both endpoints in ~60s, reports
  324K and 525K.
- Replay-mode kernel-vs-application failures from PR #6 follow-up
  attempts also recorded; logs at
  docs/research/2026-05-02-beta-region-breakdown/ncu-attempt{1,2}/.

Production parity gap recorded as audit follow-up, not a conclusion.
The slope is valid evidence for the W_O lever; absolute-timing
mismatch needs a separate denominator + launch-shape audit.

Refs: PR #6 (Navi-AI-Lab/nvllm), summary at
benchmarks/nvllm/traces/cute_paged_attn/2026-05-03-w-o-k-parallel-harness/summary.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets the 2026-05-03 parity audit hold active_wo_ctas constant
(= wo_split * num_kv_heads) while varying total cooperative-grid
size (slice_ctas * num_kv_heads). GATHER_CTAS effective-bytes term
scales with this value.

Used by benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-parity-gap/
to factorialize:
  A: slice_ctas=16, wo_split=1 → 64 grid / 4 active W_O
  B: slice_ctas=8,  wo_split=1 → 32 grid / 4 active W_O
  C: slice_ctas=8,  wo_split=8 → 32 grid / 32 active W_O

Validated via parser.error: slice_ctas >= wo_split, slice_ctas >= 1.
config.json now records slice_ctas, gather_ctas, total_grid_ctas_per_seq,
active_wo_ctas; effective-bytes formula doc string uses run-actual
gather_ctas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A: 4 active W_O CTAs in production-shaped 64 CTA grid (slice_ctas=16)
B: 4 active W_O CTAs in harness 32 CTA grid (slice_ctas=8)
C: 32 active W_O CTAs in harness 32 CTA grid (wo_split=8)

A↔B isolates total-grid effect at fixed active count.
B↔C isolates active-count effect at fixed total grid.

Each config's nsys .nsys-rep, device-event timing.csv, and bit-exact
correctness gate (max_abs=0 vs reference_split_order) committed under
benchmarks/nvllm/traces/wo_k_parallel_audit/2026-05-03-parity-gap/.

Audit verdict: harness B→C slope is 8.34× stable median; production
gather is single-CTA gated so absolute harness μs does not transfer
linearly to production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii merged commit 45476de into main May 4, 2026
Natfii added a commit that referenced this pull request May 4, 2026
Original Task 9 plan ("parameterize host Phase 1 mask helpers on
wo_split") was already subsumed by the Task 4+5 combined dispatch.
Repurposed to address the kernel-side cleanups flagged by the Task 8
spec+quality review:

#2 (Important): R11 timing/spin/exit gates now use wo_split_const
    instead of self.wo_split, matching the W_O block (Task 8). Both
    are bound from int(self.wo_split) in the same JIT compile call,
    but mixing the two in the same kernel body forced readers to
    verify equivalence. Now uniform across the kernel body.

#3 (Minor): Hoisted single pre_wo_consumer_active = (bx>0 &&
    bx<wo_split_const && by<num_kv_heads) above the R11 entry; reused
    at entry timing, spin gate, and exit timing. Removes the duplicate
    pre_wo_consumer_active2 copy-paste artifact.

#4 (Minor): Dropped "# NEW:" prefix from the wo_split cache-key inline
    comment — the marker would go stale at PR.

#5 (Real, fixed in same diff via the L253 comment block): bound-
    restriction comment now points to docs/research/2026-05-03-w-o-k-
    parallel-harness/torch_reference.py (the committed path) instead
    of /tmp/wo_split_repro_workdir/torch_reference.py (machine-local
    transient).

#6 (Minor): Added 3-line comment block before the new pre_wo_consumer_active
    declaration explaining bx==0 producers skip R11 because their
    attn_output reads are intra-CTA — the cross-CTA safety derivation
    that the spec reviewer pointed out was undocumented.

Deferred to merge-prep (per user direction):
- #1: total_ctas_per_seq_attn dead-arg cleanup (Task 12 PR-prep)
- #7: cutlass.const_expr gate on wo_split=1 producer fence/atomic
       (revisit if Task 10/11 evidence shows wo_split=1 overhead matters)

Pure refactor — bit-exact gate at wo_split=1 AND wo_split=8 still
passes with max_abs == 0.0 against reference_split_order. Cache MISS
on first launch (wo_split_const reference and mask hoist change the
PTX even though numerics are identical at runtime).

Task 9 of 12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii deleted the evidence/wo-k-parallel-harness branch May 7, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant