Skip to content

[Bugfix] Fix sparse MLA metadata building#33579

Merged
vllm-bot merged 5 commits intovllm-project:mainfrom
MatthewBonanni:fix_sparse
Feb 3, 2026
Merged

[Bugfix] Fix sparse MLA metadata building#33579
vllm-bot merged 5 commits intovllm-project:mainfrom
MatthewBonanni:fix_sparse

Conversation

@MatthewBonanni
Copy link
Copy Markdown
Collaborator

@MatthewBonanni MatthewBonanni commented Feb 2, 2026

Purpose

Fix #33546
#33284 broke sparse MLA by moving logic from the backend to the layer without properly accounting for sparse backends.

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2 -tp 8 -ep

Test Result

Main: crashes during startup

AttributeError: 'FlashMLASparseMetadata' object has no attribute 'num_decodes'

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9553|±  |0.0057|
|     |       |strict-match    |     5|exact_match|↑  |0.9545|±  |0.0057|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@mergify mergify bot added the bug Something isn't working label Feb 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in sparse Multi-Head Latent Attention (MLA) by correctly handling metadata for sparse backends. The changes introduce a separate logic path for sparse implementations, bypassing the prefill/decode split that caused issues. For sparse backends, all tokens are now correctly routed through the forward_mqa path, which aligns with their design. The refactoring is clear and effectively resolves the bug. The changes look good.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 2, 2026

Hi @MatthewBonanni, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@pavanimajety pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 2, 2026
@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented Feb 3, 2026

cc @zou3519 as this failure will probably show up in vLLM benchmark run on PyTorch CI until this PR is merged. Here is an example failure https://github.com/pytorch/pytorch-integration-testing/actions/runs/21584614705/job/62189623833#step:19:28492

if is_sparse_impl:
has_decode = True
has_prefill = False
num_decode_tokens = q.size(0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m new to this area, so I have a possibly naive question. Why is q.size(0) equal to num_decode_tokens?

Where should I start tracing this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because currently the MLA sparse implementation uses purely the MQA pathway for both prefill and decode, i.e. q.size(0) (memory bandwidth optimized, this is only used for decodes for dense MLA)

Sorry the naming here is a bit confusing; basically the MQA pathway is more memory efficient and the MHA pathway is more memory bandwidth efficient. So it makes sense to use MQA for decode where attention is memory bond and MHA for prefill that is more compute bound. However, sparsity changes this calculus and currently we only have an MQA pathway partly due to kernel support and partly because with sparsity and longer contexts it can make sense to use MQA for the prefill too. Hence, the renaming from forward_decode -> forward_mqa and forward_prefill -> forward_mha to relax the associations with prefill/decode. There is still some legacy naming here that will likely need to be refactored in future PRs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MatthewBonanni to avoid confusion can we get rid of has_decode, has_prefill, num_decode_tokens and instead do num_mqa_tokens and num_mha_tokens then do

if num_mha_tokens > 0:
     ...

if num_mqa_tokens > 0:
     ...

I think this may cause less confusion

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in eda02c7

Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing this, overall makes sense to me but I think we should consider: https://github.com/vllm-project/vllm/pull/33579/changes#r2757360444

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the cleanups!

@LucasWilkinson LucasWilkinson enabled auto-merge (squash) February 3, 2026 15:35
@LucasWilkinson LucasWilkinson removed this from the v0.15.1 Hotfix milestone Feb 3, 2026
@vllm-bot vllm-bot merged commit bd8da29 into vllm-project:main Feb 3, 2026
46 of 47 checks passed
@MatthewBonanni MatthewBonanni deleted the fix_sparse branch February 4, 2026 00:15
gameofdimension pushed a commit to gameofdimension/vllm that referenced this pull request Feb 5, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: felix01.yu <felix01.yu@vipshop.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][DeepSeekV32]: AttributeError: 'FlashMLASparseMetadata' object has no attribute 'num_decodes'

8 participants