Conversation
|
Related Documentation No published documentation to review for changes on this repository. |
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance optimizations for Triton MLA, particularly for long context lengths. The changes include enabling CUDA graph support for decode, which can reduce kernel launch overhead. A more adaptive heuristic for calculating num_kv_splits is introduced, which should improve parallelism on different hardware. The core Triton attention kernel has been optimized by improving memory access patterns for better coalescing, reordering loads to hide latency, and adding cache modifier hints for the Triton compiler. These are solid, expert-level optimizations that should lead to the claimed performance improvements. The changes look correct and well-implemented.
|
Hi @koush, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @koush, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
benchmarks on Kimi k2.5 Summary
|
76a0247 to
8e64508
Compare
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
8e64508 to
0d93dc7
Compare
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
|
@tjtanaa I tested -dcp with cuda graph and the model worked fine, but I did notice performance degradation compared to not using cuda graphs. I'm not sure what the expected behavior is supposed to be. Regardless, since dcp is the recommended way to run it, as it saves kv cache space by preventing data duplication, and cuda graphs reduced performance, I backed out that part of my change. |
|
Testing with -dcp 8 and 80k context
On main branch, I'm seeing that dcp doesn't degrade as catastrophically on long context where tp alone does. In any case the patch still provides notable performance improvements, particularly when dcp is lower or unset. |
Signed-off-by: Koushik Dutta <koushd@gmail.com>
Signed-off-by: Koushik Dutta <koushd@gmail.com>
mgoin
left a comment
There was a problem hiding this comment.
Okay nice work, this looks reasonable to me now. Can you run an accuracy evaluation? Just a simple gsm8k is good. Another point would be to test GLM-4.7-Flash since it has a unique head size compared to deepseek/kimi MLA
I assume you mean this? This? https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_lm_eval.html |
I'm running this like |
Is this what you need? |
Purpose
Triton MLA on sm120 performance degrades on batch size 1 as context length increases.
This perf issue has been bugging me for a while as it made deepseek and Kimi k2 unusable, and when Kimi k2.5 was released I finally got around to digging into it. I'm familiar with w/ cuda but this is my first foray into triton.
The primary issues are suboptimal kv splitting during low batch count resulting in underutilized SM and unnecessary load of Q vector, since c_kv contains both.
Test Plan
Test Kimi 2.5, Deepseek v2, Qwen 235B.
Test Result
Models load and generate correctly, Kimi 2.5 shows notable improvement as seen below.
Testing with -dcp 8 and 80k context
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.