Skip to content

Add more optimal dk order optional arg#2512

Open
drisspg wants to merge 1 commit into
mainfrom
drisspg/stack/37
Open

Add more optimal dk order optional arg#2512
drisspg wants to merge 1 commit into
mainfrom
drisspg/stack/37

Conversation

@drisspg

@drisspg drisspg commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator

Stacked PRs:


Add more optimal dk order optional arg.

What does this do

Builds on the last PR. In the last PR we only allow for 1 iteration order through column space, LtoR or RtoL (spt or not as where spt is defined for causal)

This updates / changes the way we expressed iteration order along kv columns in the backwards, to allow for an arbitrary permutation. I like this for a few reasons. Its optional and for common things asc or dsc is good. You can basically craft the optimal schedule for whatever blocksparse impl you have.

I had claude do a rough greedy scheduling herusitic for presumed semaphor wait time and we can see it does help. Although sliding window is still getting instanly slow determ times vs non determ time still.

Sliding window perf issue

For the sliding window slowdown, this case is annoying but has a funnish solution that recovers a lot of the perf.

I did write this but honestly easier to read in shell form
IReminder:

We currently always walk partial then full blocks. So for sliding_window:512, 4096x4096:

The semaphore value is defined for a particular dQ's m_block. And then we have:

  dQ_semaphore[m] = progress counter for dQ m_block
  rank(m, n)      = the value that needs to be waited before

For the column n = 4, assuming we are using ascending order (spt = False):

partial:
  m = [0, 4]
  rank(m, 4) = [4, 0]
 Right edge of window needs to wait for all previous blocks,
  # so we get rank(0, 4) = 4.
  # Left edge is first to go, so rank(4, 4) = 0 and goes first.

full:
  m = [1, 2, 3]
  rank(m, 4) = [4, 4, 2]
  # last full blocks and then middle middle block,
  # would be cleaner if 128 did straddle kv columns

walk partial then full:

  1. (m=0, n=4), rank(0, 4)=4
  2. (m=4, n=4), rank(4, 4)=0
  3. (m=1, n=4), rank(1, 4)=4
  4. (m=2, n=4), rank(2, 4)=4
  5. (m=3, n=4), rank(3, 4)=2

So the CTA for n=4 first does:

  wait dQ_semaphore[0] == rank(0, 4) == 4

That means it cannot proceed until contributors with ranks 0, 1, 2, and 3 for m=0 have already written

But in the same CTA's remaining work, we have immediate work blocking their window;

  (m=4, n=4), rank(4, 4)=0
  wait dQ_semaphore[4] == 0

The problem is that the CTA is blocked at item 1:

  (m=0, n=4), rank=4

and therefore never reaches item 2 until later

  (m=4, n=4), rank=0


#### A simplish fix
We could add even moreee metdata but instead if we mark all as partial we walk every block in the same list. I then had codex sort the partial list in rank dependency order such that we walk in the most unblocked way.  That gives the delta below

| shape | window | nondet | original best det | unified prototype best | original best vs nondet | unified best vs nondet | improvement |
|---|---:|---:|---:|---:|---:|---:|---:|
| 4096x8192 | 512 | 0.0508 | 0.300 | 0.0625 | +490.2% | +23.1% | 4.8x |
| 4096x8192 | 1024 | 0.0757 | 0.272 | 0.0913 | +259.4% | +20.6% | 3.0x |
| 4096x8192 | 2048 | 0.1221 | 0.268 | 0.1518 | +119.2% | +24.3% | 1.8x |
| 8192x8192 | 512 | 0.0838 | 0.497 | 0.1103 | +493.2% | +31.6% | 4.5x |
| 8192x8192 | 1024 | 0.1177 | 0.571 | 0.1688 | +384.9% | +43.4% | 3.4x |
| 8192x8192 | 2048 | 0.1852 | 0.492 | 0.2575 | +165.6% | +39.1% | 1.9x |


### The numbers

## sliding_window:512

| shape | nondet | asc | desc | greedy | outside_in | best deterministic | best deterministic vs nondet |
|---|---:|---:|---:|---:|---:|---|---:|
| 4096x4096 | 0.046 | 0.299 (+546.1%) | 0.381 (+723.5%) | 0.295 (+538.2%) | 0.295 (+538.2%) | outside_in | +538.2% |
| 4096x8192 | 0.051 | 0.303 (+497.8%) | 0.402 (+692.7%) | 0.300 (+490.2%) | 0.410 (+706.7%) | greedy | +490.2% |
| 8192x8192 | 0.084 | 0.606 (+623.4%) | 0.765 (+812.8%) | 0.608 (+625.8%) | 0.497 (+493.2%) | outside_in | +493.2% |

## sliding_window:1024

| shape | nondet | asc | desc | greedy | outside_in | best deterministic | best deterministic vs nondet |
|---|---:|---:|---:|---:|---:|---|---:|
| 4096x4096 | 0.064 | 0.271 (+324.5%) | 0.622 (+873.2%) | 0.257 (+301.6%) | 0.581 (+808.7%) | greedy | +301.6% |
| 4096x8192 | 0.076 | 0.289 (+281.8%) | 0.687 (+806.6%) | 0.272 (+259.4%) | 0.712 (+839.7%) | greedy | +259.4% |
| 8192x8192 | 0.118 | 0.581 (+393.5%) | 1.400 (+1089.6%) | 0.571 (+384.9%) | 0.973 (+726.7%) | greedy | +384.9% |

## sliding_window:2048

| shape | nondet | asc | desc | greedy | outside_in | best deterministic | best deterministic vs nondet |
|---|---:|---:|---:|---:|---:|---|---:|
| 4096x4096 | 0.093 | 0.225 (+141.6%) | 0.743 (+697.7%) | 0.790 (+748.2%) | 1.389 (+1391.0%) | asc | +141.6% |
| 4096x8192 | 0.122 | 0.268 (+119.2%) | 0.952 (+679.7%) | 0.833 (+582.1%) | 1.013 (+729.2%) | asc | +119.2% |
| 8192x8192 | 0.185 | 0.526 (+184.1%) | 2.304 (+1144.5%) | 0.492 (+165.6%) | 1.924 (+938.9%) | greedy | +165.6% |

## prefix_lm

| shape | nondet | asc | desc | greedy | outside_in | best deterministic | best deterministic vs nondet |
|---|---:|---:|---:|---:|---:|---|---:|
| 4096x4096 | 0.102 | 0.223 (+118.6%) | 0.136 (+33.2%) | 0.138 (+34.9%) | 0.203 (+99.2%) | desc | +33.2% |
| 4096x8192 | 0.104 | 0.224 (+115.1%) | 0.137 (+31.6%) | 0.141 (+35.9%) | 0.226 (+117.8%) | desc | +31.6% |
| 8192x8192 | 0.259 | 0.439 (+69.7%) | 0.333 (+28.7%) | 0.351 (+35.5%) | 0.476 (+83.8%) | desc | +28.7% |

## dilated_sliding_window

| shape | nondet | asc | desc | greedy | outside_in | best deterministic | best deterministic vs nondet |
|---|---:|---:|---:|---:|---:|---|---:|
| 4096x4096 | 0.035 | 0.226 (+553.9%) | 0.041 (+18.7%) | 0.228 (+559.1%) | 0.139 (+300.3%) | desc | +18.7% |
| 4096x8192 | 0.038 | 0.226 (+497.1%) | 0.044 (+15.9%) | 0.228 (+501.8%) | 0.227 (+500.1%) | desc | +15.9% |
| 8192x8192 | 0.060 | 0.437 (+622.9%) | 0.072 (+19.0%) | 0.442 (+630.5%) | 0.247 (+308.1%) | desc | +19.0% |

## document

| shape | nondet | asc | desc | greedy | outside_in | best deterministic | best deterministic vs nondet |
|---|---:|---:|---:|---:|---:|---|---:|
| 4096x4096 | 0.095 | 0.587 (+519.7%) | 0.221 (+133.3%) | 0.338 (+257.4%) | 0.197 (+108.2%) | outside_in | +108.2% |
| 4096x8192 | 0.322 | 0.453 (+40.8%) | 0.483 (+50.1%) | 0.323 (+0.4%) | 0.489 (+51.8%) | greedy | +0.4% |
| 8192x8192 | 0.264 | 1.095 (+314.3%) | 0.264 (-0.1%) | 0.612 (+131.4%) | 0.466 (+76.3%) | desc | -0.1% |

## Repeatability columns

The non-deterministic rows are expected to have `dq_repeat_equal=False`; deterministic orders should have repeat equality. See raw CSV below for the exact `dq/dk/dv` repeat flags.

## Raw rows

```csv
mask,shape,order,kv_mode,tile_m,tile_n,status,elapsed_ms,benchmark_backend,compile_or_error_s,dq_repeat_equal,dk_repeat_equal,dv_repeat_equal,dq_baseline_equal,dk_baseline_equal,dv_baseline_equal,vs_nondet_pct,error
sliding_window:512,4096x4096,nondet,mha,128,128,ok,0.04622267089843763,transformer-nuggets,0.4902233809698373,False,True,True,,,,,
sliding_window:512,4096x4096,asc,mha,128,128,ok,0.29864482456140357,transformer-nuggets,0.09140752186067402,True,True,True,False,True,True,546.1003199438612,
sliding_window:512,4096x4096,desc,mha,128,128,ok,0.3806309047619052,transformer-nuggets,0.01047898386605084,True,True,True,False,True,True,723.4723294078856,
sliding_window:512,4096x4096,greedy,mha,128,128,ok,0.29501161724137853,transformer-nuggets,0.13665749784559011,True,True,True,False,True,True,538.2400919444709,
sliding_window:512,4096x4096,outside_in,mha,128,128,ok,0.2949789897260275,transformer-nuggets,0.013363356003537774,True,True,True,False,True,True,538.1695042551036,
sliding_window:512,4096x8192,nondet,mha,128,128,ok,0.050767706717123724,transformer-nuggets,0.09303479106165469,False,True,True,,,,,
sliding_window:512,4096x8192,asc,mha,128,128,ok,0.30349216491228076,transformer-nuggets,0.09230755409225821,True,True,True,False,True,True,497.8055432035385,
sliding_window:512,4096x8192,desc,mha,128,128,ok,0.40243451801801816,transformer-nuggets,0.010005079908296466,True,True,True,False,True,True,692.6978468032293,
sliding_window:512,4096x8192,greedy,mha,128,128,ok,0.2996461637630655,transformer-nuggets,0.01310505298897624,True,True,True,False,True,True,490.22985897843637,
sliding_window:512,4096x8192,outside_in,mha,128,128,ok,0.40956845662100466,transformer-nuggets,0.012381092179566622,True,True,True,False,True,True,706.749965885024,
sliding_window:512,8192x8192,nondet,mha,128,128,ok,0.08379574230769274,transformer-nuggets,0.18733508489094675,False,True,True,,,,,
sliding_window:512,8192x8192,asc,mha,128,128,ok,0.6061971842105253,transformer-nuggets,0.18735832697711885,True,True,True,False,True,True,623.4224168390406,
sliding_window:512,8192x8192,desc,mha,128,128,ok,0.7648487622950821,transformer-nuggets,0.01758081209845841,True,True,True,False,True,True,812.7537285684579,
sliding_window:512,8192x8192,greedy,mha,128,128,ok,0.6081797631578947,transformer-nuggets,0.024311767891049385,True,True,True,False,True,True,625.7883830477884,
sliding_window:512,8192x8192,outside_in,mha,128,128,ok,0.49710897826086936,transformer-nuggets,0.023487363941967487,True,True,True,False,True,True,493.2389457635165,
sliding_window:1024,4096x4096,nondet,mha,128,128,ok,0.06392472892938522,transformer-nuggets,0.19574950612150133,False,True,True,,,,,
sliding_window:1024,4096x4096,asc,mha,128,128,ok,0.2713483601286173,transformer-nuggets,0.1963577908463776,True,True,True,False,True,True,324.48104931092263,
sliding_window:1024,4096x4096,desc,mha,128,128,ok,0.6221463221476504,transformer-nuggets,0.01095246896147728,True,True,True,False,True,True,873.248275850974,
sliding_window:1024,4096x4096,greedy,mha,128,128,ok,0.2567444846625772,transformer-nuggets,0.014698527986183763,True,True,True,False,True,True,301.63562515249976,
sliding_window:1024,4096x4096,outside_in,mha,128,128,ok,0.5808995974842771,transformer-nuggets,0.014374539954587817,True,True,True,False,True,True,808.7243813360099,
sliding_window:1024,4096x8192,nondet,mha,128,128,ok,0.07572937635705677,transformer-nuggets,0.20322511112317443,False,True,True,,,,,
sliding_window:1024,4096x8192,asc,mha,128,128,ok,0.28910568120805386,transformer-nuggets,0.2023914318997413,True,True,True,False,True,True,281.7616031128372,
sliding_window:1024,4096x8192,desc,mha,128,128,ok,0.6865319558823527,transformer-nuggets,0.010414100019261241,True,True,True,False,True,True,806.5596323485092,
sliding_window:1024,4096x8192,greedy,mha,128,128,ok,0.27214853205128187,transformer-nuggets,0.013259588973596692,True,True,True,False,True,True,259.3698312899443,
sliding_window:1024,4096x8192,outside_in,mha,128,128,ok,0.711646122137404,transformer-nuggets,0.013060010969638824,True,True,True,False,True,True,839.722676154179,
sliding_window:1024,8192x8192,nondet,mha,128,128,ok,0.11769248360655726,transformer-nuggets,0.4357586270198226,False,True,True,,,,,
sliding_window:1024,8192x8192,asc,mha,128,128,ok,0.5808448805031443,transformer-nuggets,0.4362804980482906,True,True,True,False,True,True,393.5275921654375,
sliding_window:1024,8192x8192,desc,mha,128,128,ok,1.4000846521739125,transformer-nuggets,0.018197747180238366,True,True,True,False,True,True,1089.612632234321,
sliding_window:1024,8192x8192,greedy,mha,128,128,ok,0.5707406894409939,transformer-nuggets,0.024889016058295965,True,True,True,False,True,True,384.9423446181698,
sliding_window:1024,8192x8192,outside_in,mha,128,128,ok,0.9729278350515459,transformer-nuggets,0.02433210308663547,True,True,True,False,True,True,726.6694739011685,
sliding_window:2048,4096x4096,nondet,mha,128,128,ok,0.09315740863309342,transformer-nuggets,0.4367756089195609,False,True,True,,,,,
sliding_window:2048,4096x4096,asc,mha,128,128,ok,0.22503002493074756,transformer-nuggets,0.4367867030669004,True,True,True,False,True,True,141.55891435005788,
sliding_window:2048,4096x4096,desc,mha,128,128,ok,0.7431265161290327,transformer-nuggets,0.013046219944953918,True,True,True,False,True,True,697.7105922470271,
sliding_window:2048,4096x4096,greedy,mha,128,128,ok,0.7901529243697492,transformer-nuggets,0.018390445038676262,True,True,True,False,True,True,748.1911808880584,
sliding_window:2048,4096x4096,outside_in,mha,128,128,ok,1.3890126666666658,transformer-nuggets,0.0176748251542449,True,True,True,False,True,True,1391.0383264710417,
sliding_window:2048,4096x8192,nondet,mha,128,128,ok,0.12214682653061215,transformer-nuggets,0.4753209240734577,False,True,True,,,,,
sliding_window:2048,4096x8192,asc,mha,128,128,ok,0.2677093069620257,transformer-nuggets,0.4748999569565058,True,True,True,False,True,True,119.17008780816181,
sliding_window:2048,4096x8192,desc,mha,128,128,ok,0.9523686600000011,transformer-nuggets,0.012378948973491788,True,True,True,False,True,True,679.6916932273477,
sliding_window:2048,4096x8192,greedy,mha,128,128,ok,0.8331287522123892,transformer-nuggets,0.014809184009209275,True,True,True,False,True,True,582.0715493607953,
sliding_window:2048,4096x8192,outside_in,mha,128,128,ok,1.0128369148936172,transformer-nuggets,0.014914975967258215,True,True,True,False,True,True,729.1962580294972,
sliding_window:2048,8192x8192,nondet,mha,128,128,ok,0.18516839534883728,transformer-nuggets,1.1646779170259833,False,True,True,,,,,
sliding_window:2048,8192x8192,asc,mha,128,128,ok,0.5261025172413787,transformer-nuggets,1.163378849858418,True,True,True,False,True,True,184.12111918465263,
sliding_window:2048,8192x8192,desc,mha,128,128,ok,2.304367333333334,transformer-nuggets,0.02008672198280692,True,True,True,False,True,True,1144.4711901251585,
sliding_window:2048,8192x8192,greedy,mha,128,128,ok,0.4918657500000002,transformer-nuggets,0.027896487154066563,True,True,True,False,True,True,165.6315885188605,
sliding_window:2048,8192x8192,outside_in,mha,128,128,ok,1.923656919999998,transformer-nuggets,0.028059690026566386,True,True,True,False,True,True,938.8689259719705,
prefix_lm,4096x4096,nondet,mha,128,128,ok,0.10216664383561623,transformer-nuggets,0.27894692216068506,False,True,True,,,,,
prefix_lm,4096x4096,asc,mha,128,128,ok,0.22336688857938736,transformer-nuggets,0.27876819390803576,True,True,True,False,True,True,118.62995611246615,
prefix_lm,4096x4096,desc,mha,128,128,ok,0.13612027528089846,transformer-nuggets,0.010232013883069158,True,True,True,False,True,True,33.233578172454045,
prefix_lm,4096x4096,greedy,mha,128,128,ok,0.13786308851224124,transformer-nuggets,0.01293408707715571,True,True,True,False,True,True,34.93943163490793,
prefix_lm,4096x4096,outside_in,mha,128,128,ok,0.2034941593830338,transformer-nuggets,0.020685909781605005,True,True,True,False,True,True,99.17866707107576,
prefix_lm,4096x8192,nondet,mha,128,128,ok,0.10392853753753753,transformer-nuggets,0.2784547309856862,False,True,True,,,,,
prefix_lm,4096x8192,asc,mha,128,128,ok,0.22359366937669423,transformer-nuggets,0.2790649039670825,True,True,True,False,True,True,115.14174515920166,
prefix_lm,4096x8192,desc,mha,128,128,ok,0.13677210019646371,transformer-nuggets,0.009699536953121424,True,True,True,False,True,True,31.602063722934194,
prefix_lm,4096x8192,greedy,mha,128,128,ok,0.14122528166351647,transformer-nuggets,0.012238900875672698,True,True,True,False,True,True,35.88691326721294,
prefix_lm,4096x8192,outside_in,mha,128,128,ok,0.22636513114754173,transformer-nuggets,0.021253932965919375,True,True,True,False,True,True,117.80844463993523,
prefix_lm,8192x8192,nondet,mha,128,128,ok,0.2588054844720501,transformer-nuggets,1.7804784018080682,False,True,True,,,,,
prefix_lm,8192x8192,asc,mha,128,128,ok,0.43931072549019634,transformer-nuggets,1.781405413057655,True,True,True,False,True,True,69.74552389659272,
prefix_lm,8192x8192,desc,mha,128,128,ok,0.332981631782946,transformer-nuggets,0.017654774012044072,True,True,True,False,True,True,28.6609642226908,
prefix_lm,8192x8192,greedy,mha,128,128,ok,0.3507314796747965,transformer-nuggets,0.02211369900032878,True,True,True,False,True,True,35.519338158644786,
prefix_lm,8192x8192,outside_in,mha,128,128,ok,0.4755588052631576,transformer-nuggets,0.0551132180262357,True,True,True,False,True,True,83.75144028855229,
dilated_sliding_window,4096x4096,nondet,mha,128,128,ok,0.03463334528688563,transformer-nuggets,0.044118347112089396,False,True,True,,,,,
dilated_sliding_window,4096x4096,asc,mha,128,128,ok,0.2264675482093663,transformer-nuggets,0.03599392808973789,True,True,True,False,True,True,553.9002984938945,
dilated_sliding_window,4096x4096,desc,mha,128,128,ok,0.04111081457800508,transformer-nuggets,0.010065943002700806,True,True,True,False,True,True,18.70298476067869,
dilated_sliding_window,4096x4096,greedy,mha,128,128,ok,0.22825427146814356,transformer-nuggets,0.013578269863501191,True,True,True,False,True,True,559.0592666616443,
dilated_sliding_window,4096x4096,outside_in,mha,128,128,ok,0.1386209962825277,transformer-nuggets,0.012799076037481427,True,True,True,False,True,True,300.25297912823436,
dilated_sliding_window,4096x8192,nondet,mha,128,128,ok,0.037896038753158987,transformer-nuggets,0.03552258387207985,False,True,True,,,,,
dilated_sliding_window,4096x8192,asc,mha,128,128,ok,0.22627937329700257,transformer-nuggets,0.03531468287110329,True,True,True,False,True,True,497.10560982614595,
dilated_sliding_window,4096x8192,desc,mha,128,128,ok,0.04393887001897544,transformer-nuggets,0.009657200891524553,True,True,True,False,True,True,15.945812450681874,
dilated_sliding_window,4096x8192,greedy,mha,128,128,ok,0.22806117260274003,transformer-nuggets,0.013363915961235762,True,True,True,False,True,True,501.80741868100665,
dilated_sliding_window,4096x8192,outside_in,mha,128,128,ok,0.22741495616438373,transformer-nuggets,0.013162227813154459,True,True,True,False,True,True,500.10218388703385,
dilated_sliding_window,8192x8192,nondet,mha,128,128,ok,0.06048756145833368,transformer-nuggets,0.06949487701058388,False,True,True,,,,,
dilated_sliding_window,8192x8192,asc,mha,128,128,ok,0.4372800922330098,transformer-nuggets,0.06926358910277486,True,True,True,False,True,True,622.9256423805848,
dilated_sliding_window,8192x8192,desc,mha,128,128,ok,0.07195734522293007,transformer-nuggets,0.017776072025299072,True,True,True,False,True,True,18.96221882328195,
dilated_sliding_window,8192x8192,greedy,mha,128,128,ok,0.44185895098039274,transformer-nuggets,0.024443425936624408,True,True,True,False,True,True,630.4955602893057,
dilated_sliding_window,8192x8192,outside_in,mha,128,128,ok,0.24684925806451616,transformer-nuggets,0.023590977070853114,True,True,True,False,True,True,308.0992060401643,
document,4096x4096,nondet,mha,128,128,ok,0.09469101749271108,transformer-nuggets,0.04899658286012709,False,True,True,,,,,
document,4096x4096,asc,mha,128,128,ok,0.5867700192307692,transformer-nuggets,0.4574677790515125,True,True,True,False,False,False,519.6680897171016,
document,4096x4096,desc,mha,128,128,ok,0.22089753532608694,transformer-nuggets,0.013949768850579858,True,True,True,False,False,False,133.28246033800482,
document,4096x4096,greedy,mha,128,128,ok,0.3384052109374998,transformer-nuggets,0.019038630183786154,True,True,True,False,False,False,257.37836586616976,
document,4096x4096,outside_in,mha,128,128,ok,0.1971344328358207,transformer-nuggets,0.01567365205846727,True,True,True,False,False,False,108.18704672911056,
document,4096x8192,nondet,mha,128,128,ok,0.32178129304029296,transformer-nuggets,0.3272648409474641,False,True,True,,,,,
document,4096x8192,asc,mha,128,128,ok,0.45292540909090945,transformer-nuggets,0.08122129016555846,True,True,True,False,False,False,40.75566817807361,
document,4096x8192,desc,mha,128,128,ok,0.4829109308510641,transformer-nuggets,0.015852599870413542,True,True,True,False,False,False,50.074271343858,
document,4096x8192,greedy,mha,128,128,ok,0.32293821804511263,transformer-nuggets,0.01712565217167139,True,True,True,False,False,False,0.35953768284311405,
document,4096x8192,outside_in,mha,128,128,ok,0.488562654054054,transformer-nuggets,0.01799341617152095,True,True,True,False,False,False,51.8306578477441,
document,8192x8192,nondet,mha,128,128,ok,0.2642979071207433,transformer-nuggets,0.17286459100432694,False,True,True,,,,,
document,8192x8192,asc,mha,128,128,ok,1.0949953793103446,transformer-nuggets,0.3275444330647588,True,True,True,False,False,False,314.3034620437236,
document,8192x8192,desc,mha,128,128,ok,0.26416322741433,transformer-nuggets,0.020844314014539123,True,True,True,False,False,False,-0.050957537984497314,
document,8192x8192,greedy,mha,128,128,ok,0.6116064105960264,transformer-nuggets,0.03223660006187856,True,True,True,False,False,False,131.4079658287331,
document,8192x8192,outside_in,mha,128,128,ok,0.46592000000000017,transformer-nuggets,0.027290958911180496,True,True,True,False,False,False,76.28592109401258,

drisspg added a commit that referenced this pull request Apr 28, 2026
stack-info: PR: #2512, branch: drisspg/stack/37
Comment thread flash_attn/cute/block_sparsity.py Outdated
drisspg added a commit to drisspg/flash-attention that referenced this pull request Apr 28, 2026
stack-info: PR: Dao-AILab#2512, branch: drisspg/stack/37
@drisspg drisspg marked this pull request as draft April 28, 2026 16:56
@drisspg drisspg changed the base branch from drisspg/stack/16 to main April 28, 2026 16:56
drisspg added a commit that referenced this pull request Apr 28, 2026
stack-info: PR: #2512, branch: drisspg/stack/37
@drisspg drisspg changed the base branch from main to drisspg/stack/16 April 28, 2026 16:56
@drisspg drisspg marked this pull request as ready for review April 28, 2026 16:56
@drisspg drisspg marked this pull request as draft April 28, 2026 17:03
@drisspg drisspg changed the base branch from drisspg/stack/16 to main April 28, 2026 17:03
drisspg added a commit that referenced this pull request Apr 28, 2026
stack-info: PR: #2512, branch: drisspg/stack/37
@drisspg drisspg changed the base branch from main to drisspg/stack/16 April 28, 2026 17:04
@drisspg drisspg marked this pull request as ready for review April 28, 2026 17:04

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf9ab5eae8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread flash_attn/cute/block_sparsity.py
@drisspg drisspg marked this pull request as draft April 28, 2026 17:43
@drisspg drisspg changed the base branch from drisspg/stack/16 to main April 28, 2026 17:43
drisspg added a commit that referenced this pull request Apr 28, 2026
stack-info: PR: #2512, branch: drisspg/stack/37
@drisspg drisspg changed the base branch from main to drisspg/stack/16 April 28, 2026 17:43
@drisspg drisspg marked this pull request as ready for review April 28, 2026 17:43
@drisspg drisspg marked this pull request as draft April 28, 2026 17:47
@drisspg drisspg changed the base branch from drisspg/stack/16 to main April 28, 2026 17:47
drisspg added a commit that referenced this pull request Apr 28, 2026
stack-info: PR: #2512, branch: drisspg/stack/37
@drisspg drisspg changed the base branch from main to drisspg/stack/16 April 28, 2026 17:47
@drisspg drisspg marked this pull request as ready for review April 28, 2026 17:47
@drisspg drisspg marked this pull request as draft April 28, 2026 17:51
@drisspg drisspg changed the base branch from drisspg/stack/16 to main April 28, 2026 17:52
drisspg added a commit that referenced this pull request Apr 28, 2026
stack-info: PR: #2512, branch: drisspg/stack/37
@drisspg drisspg changed the base branch from main to drisspg/stack/16 April 28, 2026 17:52
@drisspg drisspg marked this pull request as ready for review April 28, 2026 17:52
@drisspg drisspg marked this pull request as draft April 28, 2026 18:01
@drisspg drisspg changed the base branch from drisspg/stack/16 to main April 28, 2026 18:01
@drisspg drisspg changed the base branch from main to drisspg/stack/16 April 28, 2026 18:01
@drisspg drisspg marked this pull request as ready for review April 28, 2026 18:01

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1897d6400

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread flash_attn/cute/block_sparsity.py
drisspg added a commit that referenced this pull request May 4, 2026
stack-info: PR: #2512, branch: drisspg/stack/37
@drisspg drisspg force-pushed the drisspg/stack/37 branch from a1897d6 to bdc1ef4 Compare May 4, 2026 22:31
@drisspg drisspg changed the base branch from drisspg/stack/16 to main May 4, 2026 22:31
drisspg added a commit to drisspg/flash-attention that referenced this pull request May 4, 2026
stack-info: PR: Dao-AILab#2512, branch: drisspg/stack/37
Comment thread flash_attn/cute/block_sparsity.py Outdated
values = (*values, None, None, None, None)
values = (*values, None, None, None, None, None)
elif len(values) == 4:
values = (*values, None, None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect, for example in the case where full_block_cnt, full_block_idx, dq_write_order_full, dq_kv_order are None: we'd get

(mask_block_cnt, mask_block_idx, dq_write_order, None, None, None, None)

instead of

(mask_block_cnt, mask_block_idx, None, None, dq_write_order, None, None)

@geruome

geruome commented May 14, 2026

Copy link
Copy Markdown
Contributor

I had the same idea here.

Instead of walking partial blocks first and then full blocks, we can merge them into one list and launch/visit them in a specified order. This feels closer to the dense-attention pattern.

stack-info: PR: #2512, branch: drisspg/stack/37
@drisspg drisspg force-pushed the drisspg/stack/37 branch from bdc1ef4 to b121a57 Compare June 8, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants