[Example] Remove redundant T.copy in `examples/deepseek_v32/sparse_mla_fwd.py` #1634

GoldenStain · 2026-01-07T18:11:04Z

The T.copy() operations I removed seem to be redundant, after removing it, I observed Slight performance improvement on L20.

Summary by CodeRabbit

Refactor
- Improved Sparse MLA kernel forward pass efficiency by streamlining memory management. Removed intermediate buffer stages and now directs final computation results directly to output buffers, reducing memory overhead and enhancing kernel performance.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2026-01-07T18:11:14Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-01-07T18:11:17Z

📝 Walkthrough

Walkthrough

The forward Sparse MLA kernel was optimized by eliminating intermediate shared buffer writes. Final accumulator and log-sum-exp computation results now directly write to output buffers instead of copying through intermediate shared buffers, reducing memory traffic.

Changes

Cohort / File(s)	Summary
Sparse MLA Kernel Output Optimization `examples/deepseek_v32/sparse_mla_fwd.py`	Removed intermediate shared buffer copies in tail of forward pass; final O_shared and Lse_shared results now write directly to Output and Lse buffers, bypassing unnecessary intermediate writes

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A path uncluttered, buffers swept clean,
No needless copies in between,
Direct to output, swift and lean,
Memory flows like a pristine stream! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title accurately describes the main change: removing redundant T.copy() operations from the sparse_mla_fwd.py file to improve performance.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

LeiWang1999 · 2026-01-08T05:54:23Z

Thanks for your report! would you mind try enable T.copy(o_shared, O) and checkout the performance on L20? as I think this pass can be somehow faster in Hopper-like device as we can utilize tma store

GoldenStain · 2026-01-08T10:49:18Z

Tested on commit aca9218.
Using config

if __name__ == "__main__":
    test_sparse_mla_fwd(
        B=1,
        S=4096,
        SKV=4096,
        H=16,
        HKV=1,
        DQK=576,
        DV=512,
        topk=2048,
        dtype=torch.bfloat16,
        check_correctness=True,
        block_I=64,
        num_stages=1,
        threads=256,
    )

With both copy enabled:

Average time: 5.444 ms
fwd io bandwidth =  1.7752413908000924
fwd tflops =  53.65173981084723

With both copy disabled:

assert_tensors_similar passed
Average time: 5.377 ms
fwd io bandwidth =  1.797354982309293
fwd tflops =  54.32006168756974

Only enable the acc_o copy:

Average time: 5.395 ms
fwd io bandwidth =  1.7911188668007116
fwd tflops =  54.13159241886596

LeiWang1999 · 2026-01-12T04:47:49Z

I see, LGTM, Thanks!

remove redundant T.copy

8772944

GoldenStain changed the title ~~remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py~~ [Example] remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py Jan 8, 2026

GoldenStain changed the title ~~[Example] remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py~~ [Example]remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py Jan 8, 2026

LeiWang1999 approved these changes Jan 8, 2026

View reviewed changes

SiriusNEO changed the title ~~[Example]remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py~~ [Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py Jan 8, 2026

LeiWang1999 merged commit 5e347e3 into tile-ai:main Jan 12, 2026
3 checks passed

kurisu6912 mentioned this pull request Feb 11, 2026

[LoopVectorize] Loop Independent Var Optimization in IfThenElse Expr kurisu6912/tilelang#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Example] Remove redundant T.copy in `examples/deepseek_v32/sparse_mla_fwd.py` #1634

[Example] Remove redundant T.copy in `examples/deepseek_v32/sparse_mla_fwd.py` #1634

Uh oh!

GoldenStain commented Jan 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

coderabbitai bot commented Jan 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

LeiWang1999 commented Jan 8, 2026

Uh oh!

GoldenStain commented Jan 8, 2026

Uh oh!

LeiWang1999 commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py #1634

[Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py #1634

Uh oh!

Conversation

GoldenStain commented Jan 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

coderabbitai bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

LeiWang1999 commented Jan 8, 2026

Uh oh!

GoldenStain commented Jan 8, 2026

Uh oh!

LeiWang1999 commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Example] Remove redundant T.copy in `examples/deepseek_v32/sparse_mla_fwd.py` #1634

[Example] Remove redundant T.copy in `examples/deepseek_v32/sparse_mla_fwd.py` #1634

GoldenStain commented Jan 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 7, 2026 •

edited

Loading