[Attention] Move MLA `forward` from backend to layer by MatthewBonanni · Pull Request #33284 · vllm-project/vllm

MatthewBonanni · 2026-01-28T22:19:16Z

Purpose

Refactor MLA by moving forward from the backend to the layer to facilitate prefill-decode splitting.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

gemini-code-assist

Code Review

This pull request refactors the Multi-Head Latent Attention (MLA) implementation by moving the main forward logic from the backend-specific MLACommonImpl to the MLAAttention layer. This is a good architectural change that improves modularity and clarity. The backend implementations now provide more granular forward_mha and forward_mqa methods for prefill and decode paths, respectively. My review focuses on ensuring the correctness and safety of the refactored code.

vllm/model_executor/layers/attention/mla_attention.py

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

LucasWilkinson

LGTM

vllm/model_executor/layers/attention/mla_attention.py

vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse.py

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

ProExpertProg

Not sure it's wise to replicate huge chunks of the layer in the mock layer from a maintenance perspective? What are we trying to avoid?

vllm/v1/attention/backend.py

tests/v1/attention/test_mla_backends.py

MatthewBonanni · 2026-01-30T16:39:05Z

Per offline discussion with @ProExpertProg, the test will be refactored in a follow-up. The current intent of the test is to focus on backend logic, so this change requires mocking the layer logic.

Follow-up will make a separate test for the layer and make the backend tests more atomic

gshtras

Tested on ROCm for DS FP8 and FP4; GPT-OSS

vllm/model_executor/layers/attention/mla_attention.py

) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Pai <416932041@qq.com>

kebe7jun · 2026-02-05T06:34:15Z

vllm/model_executor/layers/attention/mla_attention.py

+            if fp8_attention:
+                assert mqa_ql_nope.shape[0] == mqa_q_pe.shape[0]
+                assert mqa_ql_nope.shape[1] == mqa_q_pe.shape[1]
+                mqa_q = self._decode_concat_quant_fp8_op(


Hi @MatthewBonanni This may cause the issue: #33859

Thanks for catching this! Here's a fix: #33932

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

MatthewBonanni added 5 commits January 28, 2026 16:21

Delete MLACommonBaseImpl

bb669d3

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Move forward() method to layer

a16736b

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

update custom op

7b27ce9

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Rename forward_prefill to forward_mha and forward_decode to forward_mqa

a9145b3

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Add AttentionImplBase

8ae1b4c

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni changed the title ~~[Attention] Move MLA forward from backend to layer~~ [WIP][Attention] Move MLA forward from backend to layer Jan 28, 2026

mergify bot added nvidia rocm Related to AMD ROCm v1 labels Jan 28, 2026

github-project-automation bot added this to AMD and NVIDIA Jan 28, 2026

github-project-automation bot moved this to Todo in AMD Jan 28, 2026

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

vllm/model_executor/layers/attention/mla_attention.py Show resolved Hide resolved

MatthewBonanni added 3 commits January 28, 2026 17:32

Add SparseMLAAttentionImpl

5d60248

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Implement forward_mqa instead of forward

417bbc0

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix pre-commit

0d681dc

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni marked this pull request as ready for review January 29, 2026 14:49

MatthewBonanni requested review from LucasWilkinson, WoosukKwon, alexm-redhat, njhill, pavanimajety, tjtanaa, youkaichao and zhuohan123 as code owners January 29, 2026 14:49

MatthewBonanni changed the title ~~[WIP][Attention] Move MLA forward from backend to layer~~ [Attention] Move MLA forward from backend to layer Jan 29, 2026

LucasWilkinson approved these changes Jan 29, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 29, 2026

ProExpertProg approved these changes Jan 29, 2026

View reviewed changes

vllm/model_executor/layers/attention/mla_attention.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/attention/mla_attention.py Show resolved Hide resolved

vllm/v1/attention/backends/mla/rocm_aiter_mla_sparse.py Show resolved Hide resolved

Rename _forward to forward_impl

42805f3

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni added 6 commits January 29, 2026 11:45

Remove dead code

1739420

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Merge branch 'main' into mla_refactor

38fbe33

Fix process_weights_after_loading

f81ef29

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Update tests

cfd2f4f

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix tests

b0db431

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix

080e108

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

ProExpertProg reviewed Jan 30, 2026

View reviewed changes

vllm/v1/attention/backend.py Show resolved Hide resolved

tests/v1/attention/test_mla_backends.py Show resolved Hide resolved

gshtras approved these changes Jan 30, 2026

View reviewed changes

ProExpertProg enabled auto-merge (squash) January 30, 2026 18:43

vllm-bot merged commit aaa901a into vllm-project:main Jan 31, 2026
49 of 51 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 31, 2026

github-project-automation bot moved this from Todo to Done in AMD Jan 31, 2026

MatthewBonanni deleted the mla_refactor branch January 31, 2026 04:19

ProExpertProg mentioned this pull request Jan 31, 2026

[Feature]: Extract KV-Cache update from all attention backends #32335

Open

18 tasks

chaunceyjiang reviewed Feb 2, 2026

View reviewed changes

vllm/model_executor/layers/attention/mla_attention.py Show resolved Hide resolved

MatthewBonanni mentioned this pull request Feb 2, 2026

[Bugfix] Fix sparse MLA metadata building #33579

Merged

5 tasks

dw2761 mentioned this pull request Feb 3, 2026

feat(mla): extract KV-cache update #33250

Closed

PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026

[Attention] Move MLA forward from backend to layer (vllm-project#33284

e633e9d

) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Pai <416932041@qq.com>

pawel-olejniczak mentioned this pull request Feb 3, 2026

[FIX_FOR_VLLM_LATEST] Fix the MultiModalKwargsItem to align with vllm changes vllm-project/vllm-gaudi#903

Closed

kebe7jun reviewed Feb 5, 2026

View reviewed changes

Meihan-chen mentioned this pull request Feb 5, 2026

[main2main] upgrade vllm main 0202 vllm-project/vllm-ascend#6560

Merged

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Attention] Move MLA forward from backend to layer (vllm-project#33284

b7b58b5

) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention] Move MLA `forward` from backend to layer#33284

[Attention] Move MLA `forward` from backend to layer#33284
vllm-bot merged 16 commits intovllm-project:mainfrom
MatthewBonanni:mla_refactor

MatthewBonanni commented Jan 28, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

Uh oh!

MatthewBonanni commented Jan 30, 2026 •

edited

Loading

Uh oh!

gshtras left a comment

Uh oh!

Uh oh!

Uh oh!

kebe7jun Feb 5, 2026 •

edited

Loading

Uh oh!

MatthewBonanni Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

MatthewBonanni commented Jan 28, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MatthewBonanni commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gshtras left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kebe7jun Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

MatthewBonanni commented Jan 28, 2026 •

edited by github-actions bot

Loading

MatthewBonanni commented Jan 30, 2026 •

edited

Loading

kebe7jun Feb 5, 2026 •

edited

Loading