[Bugfix] Fix sparse MLA metadata building by MatthewBonanni · Pull Request #33579 · vllm-project/vllm

MatthewBonanni · 2026-02-02T18:39:44Z

Purpose

Fix #33546
#33284 broke sparse MLA by moving logic from the backend to the layer without properly accounting for sparse backends.

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 -ep

Test Result

Main: crashes during startup

AttributeError: 'FlashMLASparseMetadata' object has no attribute 'num_decodes'

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9553|±  |0.0057|
|     |       |strict-match    |     5|exact_match|↑  |0.9545|±  |0.0057|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

gemini-code-assist

Code Review

This pull request addresses a bug in sparse Multi-Head Latent Attention (MLA) by correctly handling metadata for sparse backends. The changes introduce a separate logic path for sparse implementations, bypassing the prefill/decode split that caused issues. For sparse backends, all tokens are now correctly routed through the forward_mqa path, which aligns with their design. The refactoring is clear and effectively resolves the bug. The changes look good.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

vllm/model_executor/layers/attention/mla_attention.py

mergify · 2026-02-02T20:11:33Z

Hi @MatthewBonanni, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

vllm/model_executor/layers/attention/mla_attention.py

huydhn · 2026-02-03T01:30:09Z

cc @zou3519 as this failure will probably show up in vLLM benchmark run on PyTorch CI until this PR is merged. Here is an example failure https://github.com/pytorch/pytorch-integration-testing/actions/runs/21584614705/job/62189623833#step:19:28492

chaunceyjiang · 2026-02-03T02:38:48Z

vllm/model_executor/layers/attention/mla_attention.py

+        if is_sparse_impl:
+            has_decode = True
+            has_prefill = False
+            num_decode_tokens = q.size(0)


I’m new to this area, so I have a possibly naive question. Why is q.size(0) equal to num_decode_tokens?

Where should I start tracing this?

This is because currently the MLA sparse implementation uses purely the MQA pathway for both prefill and decode, i.e. q.size(0) (memory bandwidth optimized, this is only used for decodes for dense MLA)

Sorry the naming here is a bit confusing; basically the MQA pathway is more memory efficient and the MHA pathway is more memory bandwidth efficient. So it makes sense to use MQA for decode where attention is memory bond and MHA for prefill that is more compute bound. However, sparsity changes this calculus and currently we only have an MQA pathway partly due to kernel support and partly because with sparsity and longer contexts it can make sense to use MQA for the prefill too. Hence, the renaming from forward_decode -> forward_mqa and forward_prefill -> forward_mha to relax the associations with prefill/decode. There is still some legacy naming here that will likely need to be refactored in future PRs.

@MatthewBonanni to avoid confusion can we get rid of has_decode, has_prefill, num_decode_tokens and instead do num_mqa_tokens and num_mha_tokens then do

if num_mha_tokens > 0: ... if num_mqa_tokens > 0: ...

I think this may cause less confusion

Done in eda02c7

LucasWilkinson

thanks for fixing this, overall makes sense to me but I think we should consider: https://github.com/vllm-project/vllm/pull/33579/changes#r2757360444

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

LucasWilkinson

LGTM thanks for the cleanups!

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix

1a9584c

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

mergify bot added the bug Something isn't working label Feb 2, 2026

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

MatthewBonanni marked this pull request as ready for review February 2, 2026 19:14

MatthewBonanni requested a review from LucasWilkinson as a code owner February 2, 2026 19:14

MatthewBonanni mentioned this pull request Feb 2, 2026

[Bugfix] fixed AttributeError: 'FlashMLASparseMetadata' object has no attribute 'num_decodes' #33547

Closed

5 tasks

robertgshaw2-redhat added this to the v0.15.1 Hotfix milestone Feb 2, 2026

Cleanup

35753a4

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

pavanimajety reviewed Feb 2, 2026

View reviewed changes

vllm/model_executor/layers/attention/mla_attention.py Outdated Show resolved Hide resolved

Cleanup

50f375f

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

pavanimajety reviewed Feb 2, 2026

View reviewed changes

vllm/model_executor/layers/attention/mla_attention.py Show resolved Hide resolved

pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 2, 2026

chaunceyjiang reviewed Feb 3, 2026

View reviewed changes

LucasWilkinson reviewed Feb 3, 2026

View reviewed changes

Rename

eda02c7

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

LucasWilkinson approved these changes Feb 3, 2026

View reviewed changes

LucasWilkinson enabled auto-merge (squash) February 3, 2026 15:35

LucasWilkinson removed this from the v0.15.1 Hotfix milestone Feb 3, 2026

Merge branch 'main' into fix_sparse

66adb9c

pavanimajety approved these changes Feb 3, 2026

View reviewed changes

mgoin approved these changes Feb 3, 2026

View reviewed changes

vllm-bot merged commit bd8da29 into vllm-project:main Feb 3, 2026
46 of 47 checks passed

MatthewBonanni deleted the fix_sparse branch February 4, 2026 00:15

gameofdimension pushed a commit to gameofdimension/vllm that referenced this pull request Feb 5, 2026

[Bugfix] Fix sparse MLA metadata building (vllm-project#33579)

34edf82

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Bugfix] Fix sparse MLA metadata building (vllm-project#33579)

d1d6fa7

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix sparse MLA metadata building#33579

[Bugfix] Fix sparse MLA metadata building#33579
vllm-bot merged 5 commits intovllm-project:mainfrom
MatthewBonanni:fix_sparse

MatthewBonanni commented Feb 2, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Feb 2, 2026

Uh oh!

Uh oh!

huydhn commented Feb 3, 2026

Uh oh!

chaunceyjiang Feb 3, 2026

Uh oh!

LucasWilkinson Feb 3, 2026

Uh oh!

LucasWilkinson Feb 3, 2026

Uh oh!

MatthewBonanni Feb 3, 2026

Uh oh!

LucasWilkinson left a comment

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Conversation

MatthewBonanni commented Feb 2, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Feb 2, 2026

Uh oh!

Uh oh!

huydhn commented Feb 3, 2026

Uh oh!

chaunceyjiang Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

MatthewBonanni commented Feb 2, 2026 •

edited by github-actions bot

Loading