[Attention] Mamba attention module refactor - LINEAR by wangxiyuan · Pull Request #43556 · vllm-project/vllm

wangxiyuan · 2026-05-25T03:02:01Z

Purpose

following #41126

This is the 2nd PR for mamba attention module refactor.

This PR merge BailingMoELinearAttention and MiniMaxText01LinearAttention into model_executor/layers/mamba/linear.

After this PR:

Model	mamba type	pluggable	location	Used by
BailingMoELinearAttention	linaer_attention	Yes	model_executor/layers/mamba/gdn/bailing_linear_attn.py	BailingMoeV25ForCausalLM
MiniMaxText01LinearAttention	linaer_attention	Yes	model_executor/layers/mamba/gdn/minimax_linear_attn.py	MiniMaxText01ForCausalLM

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

gemini-code-assist

Code Review

This pull request refactors the Mamba linear attention implementation by introducing a LinearAttention base class and reorganizing the module structure. It updates the BailingMoELinearAttention and MiniMaxText01LinearAttention layers to use a unified vllm_config for initialization and consolidates shared logic. A review comment highlights that BailingMoELinearAttention accesses self.kv_cache without explicit initialization in its constructor, which may cause AttributeError in standalone contexts or unit tests where the engine's injection mechanism is not present.

gemini-code-assist · 2026-05-25T03:07:38Z

+
+        # Get KV cache and state indices
+        if attn_metadata is not None:
+            kv_cache = self.kv_cache[0]


The attribute self.kv_cache is accessed here but it is not explicitly initialized in the __init__ method of BailingMoELinearAttention or its base class LinearAttention. While vLLM's engine typically injects this attribute for MambaBase layers, relying on implicit injection without a fallback or a clear definition in the class hierarchy can lead to AttributeError during initialization or in environments where the engine's injection mechanism is bypassed (e.g., unit tests or standalone usage).

@gemini-code-assist this is just a code copy from bailing_moe_linear.py‎ to bailing_linear_attn.py. I think your review suggestion doesn't relate to this PR change.

The self.kv_cache attribute is indeed expected to be managed by the vLLM engine for layers inheriting from MambaBase or using the LinearAttention interface. Since this PR refactors existing code from bailing_moe_linear.py to bailing_linear_attn.py, the assumption is that the engine's injection mechanism remains consistent with the previous implementation. If you are encountering issues in specific environments (like unit tests), ensure that the forward_context is properly initialized or that the layer is being invoked within the expected vLLM execution pipeline.

ZJY0516

LGTM. Could you also provide GPU test result, just in case

wangxiyuan · 2026-05-26T08:05:42Z

sure,I'll post the result later

tjtanaa · 2026-05-26T14:46:42Z

There seems to have accuracy regression on this branch compared with main 445ded18c1184a5a44d0f41010d614adbd107ca7

This branch

local-completions ({'model': 'MiniMaxAI/MiniMax-M2.5', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 256, 'max_retries': 10, 'max_
gen_toks': 2048, 'max_length': 1048576, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto                             
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|                                                                               
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|                                                                               
|gsm8k|      3|flexible-extract|     5|exact_match|_  |0.8976|_  |0.0083|                                                                               
|     |       |strict-match    |     5|exact_match|_  |0.8916|_  |0.0086|

main 445ded18c1184a5a44d0f41010d614adbd107ca7

local-completions ({'model': 'MiniMaxAI/MiniMax-M2.5', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 256, 'max_retries': 10, 'max_gen_toks': 2048, 'max_length': 1048576, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|_  |0.9249|_  |0.0073|
|     |       |strict-match    |     5|exact_match|_  |0.9166|_  |0.0076|

tjtanaa · 2026-05-26T14:56:06Z

The accuracy is fine after sync your branch with main

local-completions ({'model': 'MiniMaxAI/MiniMax-M2.5', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 256, 'max_retries': 10, 'max_gen_toks': 2048, 'max_length': 1048576, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|_  |0.9204|_  |0.0075|
|     |       |strict-match    |     5|exact_match|_  |0.9151|_  |0.0077|

@wangxiyuan please rebase your PR with main then only start any testing. Please provide the accuracy scores.

mergify · 2026-05-27T00:26:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wangxiyuan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-05-27T09:18:46Z

Hi @wangxiyuan, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

wangxiyuan · 2026-05-29T01:02:19Z

@tjtanaa @ZJY0516 test on A100 for accuracy with GSM8K（5-shot）

Ling-2.6-flash

Filter	PR #43556	main	Δ (main − PR)
flexible-extract	0.7991	0.8044	+0.0053
strict-match	0.7703	0.7771	+0.0068

MiniMax-M2.5

Filter	PR #43556	main	Δ (main − PR)
flexible-extract	0.9249	0.9249	0.0000
strict-match	0.9227	0.9219	−0.0008

tjtanaa

LGTM. Let's get @ZJY0516 final approval.

viiccwen · 2026-06-05T05:47:46Z

Hello @wangxiyuan @ZJY0516, I found API return-type docstring issue, and already opened issue and PR!
pls take a look in ur free time, thx! : )

wangxiyuan requested review from WoosukKwon, mgoin, tdoublep, tlrmchlsmth, tomeras91, yewentao256 and zyongye as code owners May 25, 2026 03:02

mergify Bot added the v1 label May 25, 2026

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

ZJY0516 approved these changes May 26, 2026

View reviewed changes

mergify Bot added the needs-rebase label May 27, 2026

wangxiyuan force-pushed the mamba_refactor_2 branch from e845176 to 6556d9a Compare May 27, 2026 08:56

mergify Bot removed the needs-rebase label May 27, 2026

[Attention] Mamba attention module refactor - LINEAR

44eb3f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

wangxiyuan force-pushed the mamba_refactor_2 branch from 6556d9a to 44eb3f9 Compare May 27, 2026 09:30

tjtanaa approved these changes May 29, 2026

View reviewed changes

ZJY0516 approved these changes May 29, 2026

View reviewed changes

ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label May 29, 2026

mergify Bot and others added 6 commits May 29, 2026 02:31

Merge branch 'main' into mamba_refactor_2

9e23d9b

Merge branch 'main' into mamba_refactor_2

13cc714

Merge branch 'main' into mamba_refactor_2

ab99efb

Merge branch 'main' into mamba_refactor_2

1656d09

Merge branch 'main' into mamba_refactor_2

6429d11

Merge branch 'main' into mamba_refactor_2

89a85ee

ZJY0516 merged commit 9061935 into vllm-project:main Jun 4, 2026
63 checks passed

Uh oh!

Conversation

wangxiyuan commented May 25, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

wangxiyuan May 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented May 26, 2026

Uh oh!

tjtanaa commented May 26, 2026

Uh oh!

tjtanaa commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented May 27, 2026

Uh oh!

mergify Bot commented May 27, 2026

Uh oh!

wangxiyuan commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ling-2.6-flash

MiniMax-M2.5

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viiccwen commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wangxiyuan commented May 25, 2026 •

edited by github-actions Bot

Loading

tjtanaa commented May 26, 2026 •

edited

Loading

wangxiyuan commented May 29, 2026 •

edited

Loading