Skip to content

Conversation

@GoldenStain
Copy link
Contributor

@GoldenStain GoldenStain commented Jan 7, 2026

The T.copy() operations I removed seem to be redundant, after removing it, I observed Slight performance improvement on L20.

Summary by CodeRabbit

  • Refactor
    • Improved Sparse MLA kernel forward pass efficiency by streamlining memory management. Removed intermediate buffer stages and now directs final computation results directly to output buffers, reducing memory overhead and enhancing kernel performance.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

The forward Sparse MLA kernel was optimized by eliminating intermediate shared buffer writes. Final accumulator and log-sum-exp computation results now directly write to output buffers instead of copying through intermediate shared buffers, reducing memory traffic.

Changes

Cohort / File(s) Summary
Sparse MLA Kernel Output Optimization
examples/deepseek_v32/sparse_mla_fwd.py
Removed intermediate shared buffer copies in tail of forward pass; final O_shared and Lse_shared results now write directly to Output and Lse buffers, bypassing unnecessary intermediate writes

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A path uncluttered, buffers swept clean,
No needless copies in between,
Direct to output, swift and lean,
Memory flows like a pristine stream! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title accurately describes the main change: removing redundant T.copy() operations from the sparse_mla_fwd.py file to improve performance.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@GoldenStain GoldenStain changed the title remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py [Example] remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py Jan 8, 2026
@GoldenStain GoldenStain changed the title [Example] remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py [Example]remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py Jan 8, 2026
@LeiWang1999
Copy link
Member

Thanks for your report! would you mind try enable T.copy(o_shared, O) and checkout the performance on L20? as I think this pass can be somehow faster in Hopper-like device as we can utilize tma store

@SiriusNEO SiriusNEO changed the title [Example]remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py [Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py Jan 8, 2026
@GoldenStain
Copy link
Contributor Author

Tested on commit aca9218.
Using config

if __name__ == "__main__":
    test_sparse_mla_fwd(
        B=1,
        S=4096,
        SKV=4096,
        H=16,
        HKV=1,
        DQK=576,
        DV=512,
        topk=2048,
        dtype=torch.bfloat16,
        check_correctness=True,
        block_I=64,
        num_stages=1,
        threads=256,
    )

With both copy enabled:

Average time: 5.444 ms
fwd io bandwidth =  1.7752413908000924
fwd tflops =  53.65173981084723

With both copy disabled:

assert_tensors_similar passed
Average time: 5.377 ms
fwd io bandwidth =  1.797354982309293
fwd tflops =  54.32006168756974

Only enable the acc_o copy:

Average time: 5.395 ms
fwd io bandwidth =  1.7911188668007116
fwd tflops =  54.13159241886596

@LeiWang1999
Copy link
Member

I see, LGTM, Thanks!

@LeiWang1999 LeiWang1999 merged commit 5e347e3 into tile-ai:main Jan 12, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants