[Feature] Batch invariant: Enable `TRITON_MLA` without prefix-caching by yewentao256 · Pull Request #29125 · vllm-project/vllm

yewentao256 · 2025-11-20T22:40:27Z

Purpose

Enable TRITON_MLA without prefix-caching

It involves very large change if we want to support the kernel with prefix-caching, I will have a follow up issue for this.

Test

Other unit tests

(wentao) wentao@dgxB200-09:~/vllm-source/tests/v1/determinism$ pytest
==================================== test session starts ====================================
platform linux -- Python 3.12.3, pytest-9.0.1, pluggy-1.6.0
rootdir: /home/wentao/vllm-source
configfile: pyproject.toml
plugins: anyio-4.11.0
collected 105 items                                                                         

test_batch_invariance.py .............                                                [ 12%]
test_online_batch_invariance.py ...                                                   [ 15%]
test_rms_norm_batch_invariant.py .................................................... [ 64%]
.....................................                                                 [100%]

======================= 105 passed, 22 warnings in 647.81s (0:10:47) ========================

R1

VLLM_TEST_MODEL=deepseek-ai/DeepSeek-R1 VLLM_TP_SIZE=8 pytest test_online_batch_invariance.py::test_logprobs_bitwise_batch_invariance_bs1_vs_bsN[TRITON_MLA]

test_online_batch_invariance.py .                                                     [100%]

========================= 1 passed, 2 warnings in 209.41s (0:03:29) =========================

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request enables TRITON_MLA for batch invariance by disabling prefix caching, which is a reasonable workaround for the current limitation. The changes are well-contained, and the tests are updated accordingly. I have one suggestion to improve user experience by logging when prefix caching is disabled implicitly.

vllm/v1/core/sched/scheduler.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/core/sched/scheduler.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin · 2025-11-21T16:15:40Z

vllm/v1/core/sched/scheduler.py

+        enable_caching = bool(self.cache_config.enable_prefix_caching)
+
+        # TODO(wentao): fix prefix caching for batch invariance of TRITON_MLA.
+        if vllm_is_batch_invariant() and envs.VLLM_ATTENTION_BACKEND == "TRITON_MLA":


We should not be checking the environment variable as this isn't set when we choose the backend automatically. I think this only works for your test since you set the environment variable i.e. monkeypatch.setenv("VLLM_ATTENTION_BACKEND", backend)
This might need to wait for Matt's work to make a proper AttentionConfig #26315

Also, this decision should not be happening in the scheduler as it is too late. This should happen after attention selection to update self.cache_config.enable_prefix_caching itself

I have fixed this and could you take another look? @mgoin
I don't think we need to wait until #26315 as this is needed for batch invariant MLA

mergify · 2025-11-21T16:23:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yewentao256.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mergify · 2025-11-26T16:56:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yewentao256.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 · 2025-12-03T23:24:17Z

@mgoin CC, we hope to land this first so that user could have an option for MLA models

mgoin

LGTM

tests/v1/determinism/test_batch_invariance.py

…vllm-project#29125) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

enable TRITON_MLA without pre-fix-caching

65ab1a6

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners November 20, 2025 22:40

mergify bot added the v1 label Nov 20, 2025

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

add warning

1a2b501

Signed-off-by: yewentao256 <zhyanwentao@126.com>

chatgpt-codex-connector bot reviewed Nov 20, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

remove code

85a07f6

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 mentioned this pull request Nov 20, 2025

[Feature]: Batch invariant: Enable TRITON_MLA with prefix-caching #29126

Open

1 task

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025

mgoin requested changes Nov 21, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 21, 2025

Jakub227 approved these changes Nov 21, 2025

View reviewed changes

Merge branch 'main' into wentao-TritonMLA-support-without-prefix-caching

afa4413

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mergify bot removed the needs-rebase label Nov 24, 2025

address comments

1abfa06

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested a review from LucasWilkinson as a code owner November 24, 2025 22:45

Merge branch 'main' into wentao-TritonMLA-support-without-prefix-caching

f93720b

mergify bot added the needs-rebase label Nov 26, 2025

yewentao256 added 2 commits November 26, 2025 13:28

Merge branch 'main' into wentao-TritonMLA-support-without-prefix-caching

15502a2

Signed-off-by: yewentao256 <zhyanwentao@126.com>

add flashinfer to disable

3c92a12

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mergify bot removed the needs-rebase label Nov 26, 2025

Merge branch 'main' into wentao-TritonMLA-support-without-prefix-caching

5ab2285

yewentao256 and others added 6 commits December 1, 2025 15:50

fix backends

817285b

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'main' into wentao-TritonMLA-support-without-prefix-caching

2e1d0f8

fix flashinfer

5a787d3

Signed-off-by: yewentao256 <zhyanwentao@126.com>

remove bool

da5aeab

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'main' into wentao-TritonMLA-support-without-prefix-caching

6c913b8

Merge branch 'main' into wentao-TritonMLA-support-without-prefix-caching

9ecc612

mgoin approved these changes Dec 9, 2025

View reviewed changes

tests/v1/determinism/test_batch_invariance.py Show resolved Hide resolved

mgoin merged commit d941709 into main Dec 9, 2025
52 checks passed

mgoin deleted the wentao-TritonMLA-support-without-prefix-caching branch December 9, 2025 00:31

mgoin restored the wentao-TritonMLA-support-without-prefix-caching branch December 9, 2025 00:32

yewentao256 deleted the wentao-TritonMLA-support-without-prefix-caching branch December 9, 2025 00:44

This was referenced Dec 10, 2025

[Feature]: Batch Invariant Feature and Performance Optimization #27433

Open

[Refactor] Small refactor for group topk #30562

Merged

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[Feature] Batch invariant: Enable TRITON_MLA without prefix-caching (…

3f193a1

…vllm-project#29125) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

frankwang28 mentioned this pull request Feb 3, 2026

[Feature] Enable TRITON_ATTN for Batch Invariance #33688

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Batch invariant: Enable `TRITON_MLA` without prefix-caching#29125

[Feature] Batch invariant: Enable `TRITON_MLA` without prefix-caching#29125
mgoin merged 15 commits intomainfrom
wentao-TritonMLA-support-without-prefix-caching

yewentao256 commented Nov 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

mgoin Nov 21, 2025 •

edited

Loading

Uh oh!

yewentao256 Dec 2, 2025

Uh oh!

mergify bot commented Nov 21, 2025

Uh oh!

mergify bot commented Nov 26, 2025

Uh oh!

yewentao256 commented Dec 3, 2025

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yewentao256 commented Nov 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Other unit tests

R1

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mgoin Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yewentao256 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 21, 2025

Uh oh!

mergify bot commented Nov 26, 2025

Uh oh!

yewentao256 commented Dec 3, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yewentao256 commented Nov 20, 2025 •

edited by github-actions bot

Loading

mgoin Nov 21, 2025 •

edited

Loading