[Feature] Batch invariant: Enable TRITON_MLA without prefix-caching#29125
[Feature] Batch invariant: Enable TRITON_MLA without prefix-caching#29125
TRITON_MLA without prefix-caching#29125Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Code Review
This pull request enables TRITON_MLA for batch invariance by disabling prefix caching, which is a reasonable workaround for the current limitation. The changes are well-contained, and the tests are updated accordingly. I have one suggestion to improve user experience by logging when prefix caching is disabled implicitly.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: yewentao256 <zhyanwentao@126.com>
vllm/v1/core/sched/scheduler.py
Outdated
| enable_caching = bool(self.cache_config.enable_prefix_caching) | ||
|
|
||
| # TODO(wentao): fix prefix caching for batch invariance of TRITON_MLA. | ||
| if vllm_is_batch_invariant() and envs.VLLM_ATTENTION_BACKEND == "TRITON_MLA": |
There was a problem hiding this comment.
We should not be checking the environment variable as this isn't set when we choose the backend automatically. I think this only works for your test since you set the environment variable i.e. monkeypatch.setenv("VLLM_ATTENTION_BACKEND", backend)
This might need to wait for Matt's work to make a proper AttentionConfig #26315
Also, this decision should not be happening in the scheduler as it is too late. This should happen after attention selection to update self.cache_config.enable_prefix_caching itself
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
@mgoin CC, we hope to land this first so that user could have an option for MLA models |
…vllm-project#29125) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
Enable
TRITON_MLAwithout prefix-cachingIt involves very large change if we want to support the kernel with prefix-caching, I will have a follow up issue for this.
Test
Other unit tests
R1
VLLM_TEST_MODEL=deepseek-ai/DeepSeek-R1 VLLM_TP_SIZE=8 pytest test_online_batch_invariance.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN[TRITON_MLA]