Skip to content

[Feature] Batch invariant: Enable TRITON_MLA without prefix-caching#29125

Merged
mgoin merged 15 commits intomainfrom
wentao-TritonMLA-support-without-prefix-caching
Dec 9, 2025
Merged

[Feature] Batch invariant: Enable TRITON_MLA without prefix-caching#29125
mgoin merged 15 commits intomainfrom
wentao-TritonMLA-support-without-prefix-caching

Conversation

@yewentao256
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 commented Nov 20, 2025

Purpose

Enable TRITON_MLA without prefix-caching

It involves very large change if we want to support the kernel with prefix-caching, I will have a follow up issue for this.

Test

Other unit tests

(wentao) wentao@dgxB200-09:~/vllm-source/tests/v1/determinism$ pytest
==================================== test session starts ====================================
platform linux -- Python 3.12.3, pytest-9.0.1, pluggy-1.6.0
rootdir: /home/wentao/vllm-source
configfile: pyproject.toml
plugins: anyio-4.11.0
collected 105 items                                                                         

test_batch_invariance.py .............                                                [ 12%]
test_online_batch_invariance.py ...                                                   [ 15%]
test_rms_norm_batch_invariant.py .................................................... [ 64%]
.....................................                                                 [100%]

======================= 105 passed, 22 warnings in 647.81s (0:10:47) ========================

R1

VLLM_TEST_MODEL=deepseek-ai/DeepSeek-R1 VLLM_TP_SIZE=8 pytest test_online_batch_invariance.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN[TRITON_MLA]

test_online_batch_invariance.py .                                                     [100%]

========================= 1 passed, 2 warnings in 209.41s (0:03:29) =========================

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables TRITON_MLA for batch invariance by disabling prefix caching, which is a reasonable workaround for the current limitation. The changes are well-contained, and the tests are updated accordingly. I have one suggestion to improve user experience by logging when prefix caching is disabled implicitly.

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025
enable_caching = bool(self.cache_config.enable_prefix_caching)

# TODO(wentao): fix prefix caching for batch invariance of TRITON_MLA.
if vllm_is_batch_invariant() and envs.VLLM_ATTENTION_BACKEND == "TRITON_MLA":
Copy link
Copy Markdown
Member

@mgoin mgoin Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be checking the environment variable as this isn't set when we choose the backend automatically. I think this only works for your test since you set the environment variable i.e. monkeypatch.setenv("VLLM_ATTENTION_BACKEND", backend)
This might need to wait for Matt's work to make a proper AttentionConfig #26315

Also, this decision should not be happening in the scheduler as it is too late. This should happen after attention selection to update self.cache_config.enable_prefix_caching itself

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed this and could you take another look? @mgoin
I don't think we need to wait until #26315 as this is needed for batch invariant MLA

@mergify
Copy link
Copy Markdown

mergify bot commented Nov 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yewentao256.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 21, 2025
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@mergify mergify bot removed the needs-rebase label Nov 24, 2025
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Nov 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yewentao256.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 26, 2025
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@mergify mergify bot removed the needs-rebase label Nov 26, 2025
@yewentao256
Copy link
Copy Markdown
Member Author

@mgoin CC, we hope to land this first so that user could have an option for MLA models

Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mgoin mgoin merged commit d941709 into main Dec 9, 2025
52 checks passed
@mgoin mgoin deleted the wentao-TritonMLA-support-without-prefix-caching branch December 9, 2025 00:31
@mgoin mgoin restored the wentao-TritonMLA-support-without-prefix-caching branch December 9, 2025 00:32
@yewentao256 yewentao256 deleted the wentao-TritonMLA-support-without-prefix-caching branch December 9, 2025 00:44
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…vllm-project#29125)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants