Conversation
stack-info: PR: #2455, branch: drisspg/stack/33
df5bcb8 to
c370bc8
Compare
|
Let me add the actual data for provenance on how I found this heuristic and testing I did |
stack-info: PR: #2455, branch: drisspg/stack/33 Made-with: Cursor
c370bc8 to
915faec
Compare
stack-info: PR: #2455, branch: drisspg/stack/33 Made-with: Cursor
915faec to
d59734b
Compare
|
are there scripts to reproduce the numbers so that next time we improve clc we could adjust the heuristic? |
stack-info: PR: #2455, branch: drisspg/stack/33 Made-with: Cursor
d59734b to
df70abe
Compare
|
Added script |
stack-info: PR: #2455, branch: drisspg/stack/33 Made-with: Cursor
df70abe to
5f748fa
Compare
|
@drisspg It seems curious simply switching to the CLC scheduler can bump TFLOPS from 1131 to 1955 for |
|
@Edenzzzz okay good call out, is saw that measurement earlier and thought I patched it. Basically was doing upper left causal attention instead of lower right. We do see a nice 1.6x speedup but tflops are actually ~98.364 let me put up a pr. I can also add the visualizer script as well |
CLC Scheduler Heuristic
Human note: I had Pi summarize all the things we ran. We pair-wrote compile + run scripts around the benchmark helper from Inductor / transformer-nuggets that flushes L2 and returns sample statistics, then bootstrapped a paired 95% confidence interval around
speedup_on_vs_off.Shipped heuristic
Disable CLC (use STATIC scheduling) when either:
q_heads == kv_headsOtherwise, keep the CLC path available.
Decision metric
All comparisons use:
speedup_on_vs_offci95_excludes_1xspeedup_on_vs_off > 1.0speedup_on_vs_off < 1.0ci95_excludes_1x = TrueAggregate summary by workload
Dense
Aggregate result
Dense is mixed overall, but the shipped heuristic explicitly gates the dense noncausal bucket.
Dense by causal
Dense causal is a strong net win (+14.02%), while dense noncausal is a net regression (-2.53%). The mixed aggregate (+5.74%) is the average of a very positive causal story and a clearly negative noncausal one.
Dense by head mode
mhagqa2gqa4gqa8mqaAll dense head modes are net positive in mean Δ%, but all have significant losses. MQA is the strongest positive; MHA is weakest but still positive because dense causal MHA wins dominate.
Representative dense slowdowns with raw latency / TFLOPS
These rows were rerun in isolation in the nightly env using the same nuggets stats path as the main profile flow.
mha_noncausal_q16384_k8192_h128mha_noncausal_q4096_k8192_h128mha_noncausal_q8192_k16384_h96mha_noncausal_h128_16kmha_noncausal_h128_8kTakeaway: the isolated nightly reruns still show real dense noncausal MHA regressions, but the exact magnitude varies by shape. The strongest quoted regressions remain in the long h128 cases.
Representative dense wins
These rows were also rerun in isolation in the nightly env using the nuggets stats path.
mqa_causal_q16384_k1024_h64mqa_causal_q16384_k1024_h96gqa8_causal_q16384_k1024_h64mqa_causal_q8192_k1024_h64Dense tree fit
To avoid learning a recompilation-dependent rule, this tree was refit without sequence-length features.
Features used:
is_mhaq_per_kvdis_causalStrict label:
ci95_excludes_1x and speedup_on_vs_off > 1.0Depth-2 tree:
Training accuracy:
Takeaway: once sequence length is removed, the dense tree collapses to a coarse causal vs noncausal split. That supports the shipped dense-noncausal gate, but is still too coarse to justify anything broader in this PR.
Varlen
Aggregate result
Varlen is net positive overall, but the effect depends strongly on head mode, causal setting, and sequence pattern.
Varlen by head mode
mhagqa2gqa4gqa8mqaMHA is the only varlen head-mode bucket that trends negative. This is the primary evidence for the shipped heuristic.
Varlen MHA by causal
Varlen MHA leans negative in both causal and noncausal settings. The heuristic (disable CLC for all varlen MHA) is correct regardless of causal flag.
Varlen by causal
Varlen by pattern
uniformstaircaselongtailbimodalspikyloss_shapeRepresentative varlen wins with raw latency / TFLOPS
These rows were rerun in isolation in the nightly env using the nuggets stats path.
varlen_uniform_gqa4_causal_h128_b32_t32k_kv2xvarlen_uniform_gqa4_noncausal_h128_b32_t32k_kv1xvarlen_longtail_gqa8_noncausal_h128_b32_t32k_kv1xvarlen_uniform_mqa_noncausal_h128_b32_t32k_kv1xVarlen tree fit
To avoid learning a recompilation-dependent rule, this tree was refit without sequence-length features.
Features used:
is_mhaq_per_kvdis_causalis_uniformis_loss_shapeis_spikyStrict label:
ci95_excludes_1x and speedup_on_vs_off > 1.0Depth-2 tree:
Training accuracy:
Takeaway: once sequence length is removed, the tree cleanly recovers the main heuristic signal: varlen MHA falls on the negative side, while non-MHA varlen is the positive side, especially for noncausal workloads. This is the best match to the shipped heuristic.
Block-sparse
Aggregate result
Block-sparse is close to neutral overall, but structured by mask family, head mode, and sparsity statistics.
Block-sparse by mask
block_causalblock_diagonalsliding_windowBlock-sparse by head mode
mhagqa4mqaRepresentative block-sparse wins with raw latency / TFLOPS
These rows were rerun in isolation in the nightly env using the block-sparse nuggets stats path.
block_causal_gqa4_h128_q1024_k1024_b64_sq256_tm128_tn128_nt384block_causal_mha_h128_q1024_k1024_b64_sq256_tm128_tn128_nt384block_causal_mqa_h128_q1024_k1024_b64_sq256_tm128_tn128_nt384block_causal_mha_h128_q2048_k2048_b32_sq256_tm128_tn128_nt384Representative block-sparse losses with raw latency / TFLOPS
These rows were rerun in isolation in the nightly env using the block-sparse nuggets stats path.
block_causal_mqa_h64_q256_k1024_b64_sq256_tm128_tn128_nt384block_causal_mha_h64_q8192_k32768_b2_sq256_tm128_tn128_nt384block_causal_mqa_h64_q8192_k32768_b2_sq256_tm128_tn128_nt384block_causal_gqa4_h64_q8192_k32768_b2_sq256_tm128_tn128_nt384Block-sparse tree fit
To avoid learning a recompilation-dependent rule, this tree was refit without sequence-length features.
Features used:
is_block_causalis_sliding_windowis_block_diagonalis_mhaq_per_kvdis_causalis_w128is_w1024Strict label:
ci95_excludes_1x and speedup_on_vs_off > 1.0Depth-2 tree:
Training accuracy:
What this means in the context of the masks we actually swept:
block_causalis still the clearest negative mask overall (129wins /257losses, mean-1.80%)sliding_windowremains the clearest positive mask overall (225wins /147losses, mean+0.77%)block_diagonalremains mild / mixed (73wins /47losses, mean+0.18%)Takeaway: block-sparse still looks too structured and mask-specific to reduce to one simple shipped heuristic.
The useful reviewer takeaway is not the exact
mask_blocksthreshold. It is:block_causalis where regressions clustersliding_windowis where wins are more commonNet heuristic impact
With the shipped heuristic (CLC disabled for varlen MHA and dense noncausal):
Geometric mean speedup across the 6135 CLC-enabled cases: 1.035x.
TFLOPS note
All TFLOPS values in this document are effective attention throughput (
algorithmic_flops / wall_clock_time), not hardware tensor-core peak utilization. Small/fast cases with nontrivial FLOP counts can produce values above the GPU's rated peak — this reflects skipped work (causal masking, sparsity) or measurement overhead, not actual hardware utilization.Raw data and verification
CSV links for the full sweeps and isolated reruns
All major quoted wins and losses in this document were rerun in isolation in the nightly env to verify that the reported direction and approximate magnitude still hold outside the full sweep.
CSV gist bundle:
Conclusion
237wins /374losses, mean-0.20%); negative in both causal and noncausal67wins /269losses, mean-2.53%; clear regression signal257wins /100losses, mean+14.02%HUMAN: I want to basically just autotune over this flag. For blocksparisty