[Ai-assisted] CLC work stealing by drisspg · Pull Request #2218 · Dao-AILab/flash-attention

drisspg · 2026-01-31T01:30:19Z

Stacked PRs:

CLC work stealing

Not for land yet; the scheduling leaking abstraction all over the place is bad, im going to find a better way to encapsulate that.

I had to help alot on this one even though claude and codex did most of the setup.

Example work steal "trace" recreated from printf logs

Perf run

We would expect the highest gain for the most imbalanced workloads with the current scheduling for flex; We kind of see that e.g. alibi + causal are the same and dont currently have lpt schedule set. Document mask also sees a nice boost which makes sense

I dont really know why noop (fully dense fa4 with no sparsity and no scoremod) takes a hit for hdim 64 but only on NON - GQA path, the pattern looks too regular to be chance

What is needed to figure out before land

A more unified API and a mechanism for turning on with the env var. think we should universally enable for Flex use cases. I think fwd only is fine for now but likely will want bwd integration
I have been debugging the weirdest race condition for the 128x128 test. I have narrowed it down somewhat. My current working theory is that repsonse_ptr gets allocated with random smem data. We have num_tiles < num_sms, so only initial work is needed. We query clc, if we print after the consumer_wait in warp15 we see that we clc says no more work all invalid. The other consumer warps are not properly syncing and some of them end up pulling the random reponse data before its actually beeen populated. Racecheck shows me some error but not helping find the source of this race..
The register spills for no-op MHA case are weird. I also spent some time debugging. NCU points pretty much to a huge registor spill. However there is not really any good reason for this to be happening in this case and not others (AFAIK). I dumped the ptx and then compiled with 13.1 ptxas and it showed no spills in the sass (claude was helping here). Im not 100% convinced this is just random ptxas edge case but leaning that way. Also the ptxas patch thing didnt seem to be realy working .. something else to look at.

stack-info: PR: #2218, branch: drisspg/stack/8

tridao · 2026-01-31T06:57:57Z

I like the direction, CLC is the right thing to do

tridao · 2026-01-31T06:59:55Z

Cc @tzadouri who’s thinking about scheduling and persistence

stack-info: PR: #2218, branch: drisspg/stack/8

drisspg · 2026-02-05T05:24:15Z

@jayhshah this is what im getting with the benchmark:

This is an aws node, and im somewhat dubious of how badly it thermally throttles. But I also have some ncu compares of dense and you can potentially less tail latency with the clc in the pm samples

Normal schedule:

CLC:

NCU reported roughly 3% faster in this particular case

jayhshah · 2026-02-05T20:00:09Z

I threw up a branch dynamic-persistent-with-semaphore with the classic (pre CLC) way of doing dynamic persistent scheduler with a semaphore, it should be useful for ablations and eventually the sm90 kernel. Benchmark shows the expected improvement for small sequence length (holding batch * seqlen constant) but regression against SingleTileLPTScheduler for large seqlen; this is more pronounced for MHA than GQA. It's also clear that the LPT swizzle needs to be tweaked for large seqlen, at least batch and head should be treated differently.

dynamic_persistent_bench.txt

chatgpt-codex-connector · 2026-02-08T01:44:38Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

chatgpt-codex-connector · 2026-02-08T02:39:46Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

flash_attn/cute/flash_fwd_sm100.py

drisspg · 2026-03-07T05:42:17Z

Alright got 2cta working, still just forward. One thing I noticed is we are using 2cta even for M<256 which doesnt seem right

flash_attn/cute/flash_fwd_sm100.py

flash_attn/cute/interface.py

drisspg · 2026-03-10T01:27:05Z

Okay a few follow ups; one thing I found really helpful is these log messages + script to construct the trace from logs. I think having a FA4_LOG=0,1,2,3 and different types of logging, e.g 1 is host side log should have no perf impacts, and 3 is full debug prints in kernel using cute.printf make sense to me. I can remove from this PR but curious if others agree / think its worth adding this mechanism

Being able to explicitly disable 2cta and CLC is quite convenient, currently you need to opt in .

Also I was bad with naming and use FA4_<var> instead of FLASH_ATTENTION_<var> it is much easier to dtype :)

I think there are wins on dense cases.

But regardless I think this is an a mostly landable state and worth some input

flash_attn/cute/flash_fwd_sm100.py

drisspg · 2026-03-13T03:52:16Z

Okay

back to better state -> install the 13 verison of cutedsl explicitly

flash_attn/cute/interface.py

flash_attn/cute/flash_fwd_sm100.py

drisspg · 2026-03-18T22:11:04Z

tests/cute/conftest.py

    os.environ["CUDA_VISIBLE_DEVICES"] = gpu_ids[worker_num % len(gpu_ids)]

 def pytest_collection_finish(session):
+    if not session.config.option.collectonly:


cleans up logs a little

flash_attn/cute/flash_fwd_sm100.py

tridao · 2026-03-20T12:49:01Z

LGTM let's merge when it's ready

stack-info: PR: #2218, branch: drisspg/stack/8 Made-with: Cursor

drisspg added a commit that referenced this pull request Jan 31, 2026

[Draft][Ai-assisted] CLC work stealing

d58e003

stack-info: PR: #2218, branch: drisspg/stack/8

drisspg force-pushed the drisspg/stack/8 branch from f711f1d to d58e003 Compare January 31, 2026 01:30

drisspg mentioned this pull request Jan 31, 2026

[CUTE]Bump to Cutedsl #2216

Merged

drisspg marked this pull request as draft February 1, 2026 22:23

drisspg changed the base branch from drisspg/stack/7 to main February 1, 2026 22:23

drisspg force-pushed the drisspg/stack/8 branch from d58e003 to 784f382 Compare February 1, 2026 22:23

drisspg added a commit that referenced this pull request Feb 1, 2026

[Draft][Ai-assisted] CLC work stealing

784f382

stack-info: PR: #2218, branch: drisspg/stack/8

drisspg changed the base branch from main to drisspg/stack/7 February 1, 2026 22:23

drisspg marked this pull request as ready for review February 1, 2026 22:23

drisspg marked this pull request as draft February 3, 2026 18:42

drisspg changed the base branch from drisspg/stack/7 to main February 3, 2026 18:42

drisspg added a commit that referenced this pull request Feb 3, 2026

[Draft][Ai-assisted] CLC work stealing

bb52b18

stack-info: PR: #2218, branch: drisspg/stack/8

drisspg force-pushed the drisspg/stack/8 branch from 784f382 to bb52b18 Compare February 3, 2026 18:42

drisspg changed the base branch from main to drisspg/stack/7 February 3, 2026 18:43

drisspg marked this pull request as ready for review February 3, 2026 18:43

drisspg marked this pull request as draft February 3, 2026 21:53

drisspg changed the base branch from drisspg/stack/7 to main February 3, 2026 21:53

drisspg added a commit that referenced this pull request Feb 3, 2026

[Draft][Ai-assisted] CLC work stealing

0f3f1e4

stack-info: PR: #2218, branch: drisspg/stack/8

drisspg force-pushed the drisspg/stack/8 branch from bb52b18 to 0f3f1e4 Compare February 3, 2026 21:54

drisspg changed the base branch from main to drisspg/stack/7 February 3, 2026 21:54

drisspg marked this pull request as ready for review February 3, 2026 21:54

drisspg marked this pull request as draft February 3, 2026 21:57

drisspg changed the base branch from drisspg/stack/7 to main February 3, 2026 21:57

drisspg added a commit that referenced this pull request Feb 3, 2026

[Draft][Ai-assisted] CLC work stealing

0c8bed0

stack-info: PR: #2218, branch: drisspg/stack/8

drisspg force-pushed the drisspg/stack/8 branch from 0f3f1e4 to 0c8bed0 Compare February 3, 2026 21:57

drisspg changed the base branch from main to drisspg/stack/7 February 3, 2026 21:57

drisspg marked this pull request as ready for review February 3, 2026 21:57

drisspg marked this pull request as draft February 3, 2026 21:59

drisspg mentioned this pull request Feb 8, 2026

pytest-dist round robin to gpus #2241

Merged

drisspg mentioned this pull request Feb 8, 2026

Fix Hopper tests #2242

Merged

drisspg mentioned this pull request Mar 6, 2026

[Fwd,Sm100] Extract named barriers #2309

Merged