Skip to content

[Attention] Move MLA forward from backend to layer#33284

Merged
vllm-bot merged 16 commits intovllm-project:mainfrom
MatthewBonanni:mla_refactor
Jan 31, 2026
Merged

[Attention] Move MLA forward from backend to layer#33284
vllm-bot merged 16 commits intovllm-project:mainfrom
MatthewBonanni:mla_refactor

Conversation

@MatthewBonanni
Copy link
Copy Markdown
Collaborator

@MatthewBonanni MatthewBonanni commented Jan 28, 2026

Purpose

Refactor MLA by moving forward from the backend to the layer to facilitate prefill-decode splitting.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@MatthewBonanni MatthewBonanni changed the title [Attention] Move MLA forward from backend to layer [WIP][Attention] Move MLA forward from backend to layer Jan 28, 2026
@mergify mergify bot added nvidia rocm Related to AMD ROCm v1 labels Jan 28, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Jan 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Multi-Head Latent Attention (MLA) implementation by moving the main forward logic from the backend-specific MLACommonImpl to the MLAAttention layer. This is a good architectural change that improves modularity and clarity. The backend implementations now provide more granular forward_mha and forward_mqa methods for prefill and decode paths, respectively. My review focuses on ensuring the correctness and safety of the refactored code.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@MatthewBonanni MatthewBonanni marked this pull request as ready for review January 29, 2026 14:49
@MatthewBonanni MatthewBonanni changed the title [WIP][Attention] Move MLA forward from backend to layer [Attention] Move MLA forward from backend to layer Jan 29, 2026
Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 29, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Copy link
Copy Markdown
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's wise to replicate huge chunks of the layer in the mock layer from a maintenance perspective? What are we trying to avoid?

@MatthewBonanni
Copy link
Copy Markdown
Collaborator Author

MatthewBonanni commented Jan 30, 2026

Per offline discussion with @ProExpertProg, the test will be refactored in a follow-up. The current intent of the test is to focus on backend logic, so this change requires mocking the layer logic.

Follow-up will make a separate test for the layer and make the backend tests more atomic

Copy link
Copy Markdown
Collaborator

@gshtras gshtras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on ROCm for DS FP8 and FP4; GPT-OSS

@ProExpertProg ProExpertProg enabled auto-merge (squash) January 30, 2026 18:43
@vllm-bot vllm-bot merged commit aaa901a into vllm-project:main Jan 31, 2026
49 of 51 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 31, 2026
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Jan 31, 2026
@MatthewBonanni MatthewBonanni deleted the mla_refactor branch January 31, 2026 04:19
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Pai <416932041@qq.com>
if fp8_attention:
assert mqa_ql_nope.shape[0] == mqa_q_pe.shape[0]
assert mqa_ql_nope.shape[1] == mqa_q_pe.shape[1]
mqa_q = self._decode_concat_quant_fp8_op(
Copy link
Copy Markdown
Contributor

@kebe7jun kebe7jun Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MatthewBonanni This may cause the issue: #33859

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! Here's a fix: #33932

wangxiyuan added a commit to vllm-project/vllm-ascend that referenced this pull request Feb 5, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
vllm-project/vllm#32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to vllm-project/vllm#33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
vllm-project/vllm#33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
vllm-project/vllm#32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
vllm-project/vllm#32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to vllm-project/vllm#27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
vllm-project/vllm#33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
vllm-project/vllm#32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants