-
Notifications
You must be signed in to change notification settings - Fork 446
[Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py
#1634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughThe forward Sparse MLA kernel was optimized by eliminating intermediate shared buffer writes. Final accumulator and log-sum-exp computation results now directly write to output buffers instead of copying through intermediate shared buffers, reducing memory traffic. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
examples/deepseek_v32/sparse_mla_fwd.pyexamples/deepseek_v32/sparse_mla_fwd.py
examples/deepseek_v32/sparse_mla_fwd.pyexamples/deepseek_v32/sparse_mla_fwd.py
|
Thanks for your report! would you mind try enable |
examples/deepseek_v32/sparse_mla_fwd.pyexamples/deepseek_v32/sparse_mla_fwd.py
|
Tested on commit aca9218. if __name__ == "__main__":
test_sparse_mla_fwd(
B=1,
S=4096,
SKV=4096,
H=16,
HKV=1,
DQK=576,
DV=512,
topk=2048,
dtype=torch.bfloat16,
check_correctness=True,
block_I=64,
num_stages=1,
threads=256,
)With both copy enabled: With both copy disabled: Only enable the acc_o copy: |
|
I see, LGTM, Thanks! |
The
T.copy()operations I removed seem to be redundant, after removing it, I observed Slight performance improvement on L20.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.