Skip to content

[Ai-assisted] CLC work stealing#2218

Merged
drisspg merged 1 commit intomainfrom
drisspg/stack/8
Mar 28, 2026
Merged

[Ai-assisted] CLC work stealing#2218
drisspg merged 1 commit intomainfrom
drisspg/stack/8

Conversation

@drisspg
Copy link
Copy Markdown
Collaborator

@drisspg drisspg commented Jan 31, 2026

Stacked PRs:


CLC work stealing

Not for land yet; the scheduling leaking abstraction all over the place is bad, im going to find a better way to encapsulate that.

I had to help alot on this one even though claude and codex did most of the setup.

Example work steal "trace" recreated from printf logs
image

Perf run

We would expect the highest gain for the most imbalanced workloads with the current scheduling for flex; We kind of see that e.g. alibi + causal are the same and dont currently have lpt schedule set. Document mask also sees a nice boost which makes sense

image

I dont really know why noop (fully dense fa4 with no sparsity and no scoremod) takes a hit for hdim 64 but only on NON - GQA path, the pattern looks too regular to be chance

What is needed to figure out before land

  1. A more unified API and a mechanism for turning on with the env var. think we should universally enable for Flex use cases. I think fwd only is fine for now but likely will want bwd integration
  2. I have been debugging the weirdest race condition for the 128x128 test. I have narrowed it down somewhat. My current working theory is that repsonse_ptr gets allocated with random smem data. We have num_tiles < num_sms, so only initial work is needed. We query clc, if we print after the consumer_wait in warp15 we see that we clc says no more work all invalid. The other consumer warps are not properly syncing and some of them end up pulling the random reponse data before its actually beeen populated. Racecheck shows me some error but not helping find the source of this race..
  3. The register spills for no-op MHA case are weird. I also spent some time debugging. NCU points pretty much to a huge registor spill. However there is not really any good reason for this to be happening in this case and not others (AFAIK). I dumped the ptx and then compiled with 13.1 ptxas and it showed no spills in the sass (claude was helping here). Im not 100% convinced this is just random ptxas edge case but leaning that way. Also the ptxas patch thing didnt seem to be realy working .. something else to look at.

drisspg added a commit that referenced this pull request Jan 31, 2026
stack-info: PR: #2218, branch: drisspg/stack/8
@drisspg drisspg mentioned this pull request Jan 31, 2026
@tridao
Copy link
Copy Markdown
Member

tridao commented Jan 31, 2026

I like the direction, CLC is the right thing to do

@tridao
Copy link
Copy Markdown
Member

tridao commented Jan 31, 2026

Cc @tzadouri who’s thinking about scheduling and persistence

@drisspg drisspg marked this pull request as draft February 1, 2026 22:23
@drisspg drisspg changed the base branch from drisspg/stack/7 to main February 1, 2026 22:23
drisspg added a commit that referenced this pull request Feb 1, 2026
stack-info: PR: #2218, branch: drisspg/stack/8
@drisspg drisspg changed the base branch from main to drisspg/stack/7 February 1, 2026 22:23
@drisspg drisspg marked this pull request as ready for review February 1, 2026 22:23
@drisspg drisspg marked this pull request as draft February 3, 2026 18:42
@drisspg drisspg changed the base branch from drisspg/stack/7 to main February 3, 2026 18:42
drisspg added a commit that referenced this pull request Feb 3, 2026
stack-info: PR: #2218, branch: drisspg/stack/8
@drisspg drisspg changed the base branch from main to drisspg/stack/7 February 3, 2026 18:43
@drisspg drisspg marked this pull request as ready for review February 3, 2026 18:43
@drisspg drisspg marked this pull request as draft February 3, 2026 21:53
@drisspg drisspg changed the base branch from drisspg/stack/7 to main February 3, 2026 21:53
drisspg added a commit that referenced this pull request Feb 3, 2026
stack-info: PR: #2218, branch: drisspg/stack/8
@drisspg drisspg changed the base branch from main to drisspg/stack/7 February 3, 2026 21:54
@drisspg drisspg marked this pull request as ready for review February 3, 2026 21:54
@drisspg drisspg marked this pull request as draft February 3, 2026 21:57
@drisspg drisspg changed the base branch from drisspg/stack/7 to main February 3, 2026 21:57
drisspg added a commit that referenced this pull request Feb 3, 2026
stack-info: PR: #2218, branch: drisspg/stack/8
@drisspg drisspg changed the base branch from main to drisspg/stack/7 February 3, 2026 21:57
@drisspg drisspg marked this pull request as ready for review February 3, 2026 21:57
@drisspg drisspg marked this pull request as draft February 3, 2026 21:59
@drisspg
Copy link
Copy Markdown
Collaborator Author

drisspg commented Feb 5, 2026

@jayhshah this is what im getting with the benchmark:
image

This is an aws node, and im somewhat dubious of how badly it thermally throttles. But I also have some ncu compares of dense and you can potentially less tail latency with the clc in the pm samples

Normal schedule:
image

CLC:
image

NCU reported roughly 3% faster in this particular case

@jayhshah
Copy link
Copy Markdown
Collaborator

jayhshah commented Feb 5, 2026

I threw up a branch dynamic-persistent-with-semaphore with the classic (pre CLC) way of doing dynamic persistent scheduler with a semaphore, it should be useful for ablations and eventually the sm90 kernel. Benchmark shows the expected improvement for small sequence length (holding batch * seqlen constant) but regression against SingleTileLPTScheduler for large seqlen; this is more pronounced for MHA than GQA. It's also clear that the LPT swizzle needs to be tweaked for large seqlen, at least batch and head should be treated differently.

dynamic_persistent_bench.txt

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@drisspg
Copy link
Copy Markdown
Collaborator Author

drisspg commented Mar 7, 2026

Alright got 2cta working, still just forward. One thing I noticed is we are using 2cta even for M<256 which doesnt seem right

@drisspg
Copy link
Copy Markdown
Collaborator Author

drisspg commented Mar 10, 2026

Okay a few follow ups; one thing I found really helpful is these log messages + script to construct the trace from logs. I think having a FA4_LOG=0,1,2,3 and different types of logging, e.g 1 is host side log should have no perf impacts, and 3 is full debug prints in kernel using cute.printf make sense to me. I can remove from this PR but curious if others agree / think its worth adding this mechanism

Being able to explicitly disable 2cta and CLC is quite convenient, currently you need to opt in .

Also I was bad with naming and use FA4_<var> instead of FLASH_ATTENTION_<var> it is much easier to dtype :)

image I think there are wins on dense cases.

But regardless I think this is an a mostly landable state and worth some input

This was referenced Mar 10, 2026
@drisspg
Copy link
Copy Markdown
Collaborator Author

drisspg commented Mar 13, 2026

Okay
image

back to better state -> install the 13 verison of cutedsl explicitly

@drisspg drisspg mentioned this pull request Mar 13, 2026
os.environ["CUDA_VISIBLE_DEVICES"] = gpu_ids[worker_num % len(gpu_ids)]

def pytest_collection_finish(session):
if not session.config.option.collectonly:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleans up logs a little

@tridao
Copy link
Copy Markdown
Member

tridao commented Mar 20, 2026

LGTM let's merge when it's ready

stack-info: PR: #2218, branch: drisspg/stack/8
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants