Conversation
stack-info: PR: #2218, branch: drisspg/stack/8
f711f1d to
d58e003
Compare
|
I like the direction, CLC is the right thing to do |
|
Cc @tzadouri who’s thinking about scheduling and persistence |
d58e003 to
784f382
Compare
stack-info: PR: #2218, branch: drisspg/stack/8
stack-info: PR: #2218, branch: drisspg/stack/8
784f382 to
bb52b18
Compare
stack-info: PR: #2218, branch: drisspg/stack/8
bb52b18 to
0f3f1e4
Compare
stack-info: PR: #2218, branch: drisspg/stack/8
0f3f1e4 to
0c8bed0
Compare
|
@jayhshah this is what im getting with the benchmark: This is an aws node, and im somewhat dubious of how badly it thermally throttles. But I also have some ncu compares of dense and you can potentially less tail latency with the clc in the pm samples NCU reported roughly 3% faster in this particular case |
|
I threw up a branch dynamic-persistent-with-semaphore with the classic (pre CLC) way of doing dynamic persistent scheduler with a semaphore, it should be useful for ablations and eventually the sm90 kernel. Benchmark shows the expected improvement for small sequence length (holding batch * seqlen constant) but regression against SingleTileLPTScheduler for large seqlen; this is more pronounced for MHA than GQA. It's also clear that the LPT swizzle needs to be tweaked for large seqlen, at least batch and head should be treated differently. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Alright got 2cta working, still just forward. One thing I noticed is we are using 2cta even for M<256 which doesnt seem right |
| os.environ["CUDA_VISIBLE_DEVICES"] = gpu_ids[worker_num % len(gpu_ids)] | ||
|
|
||
| def pytest_collection_finish(session): | ||
| if not session.config.option.collectonly: |
There was a problem hiding this comment.
cleans up logs a little
|
LGTM let's merge when it's ready |
stack-info: PR: #2218, branch: drisspg/stack/8 Made-with: Cursor





Stacked PRs:
CLC work stealing
Not for land yet; the scheduling leaking abstraction all over the place is bad, im going to find a better way to encapsulate that.
I had to help alot on this one even though claude and codex did most of the setup.
Example work steal "trace" recreated from printf logs

Perf run
We would expect the highest gain for the most imbalanced workloads with the current scheduling for flex; We kind of see that e.g. alibi + causal are the same and dont currently have lpt schedule set. Document mask also sees a nice boost which makes sense
I dont really know why noop (fully dense fa4 with no sparsity and no scoremod) takes a hit for hdim 64 but only on NON - GQA path, the pattern looks too regular to be chance
What is needed to figure out before land