Skip to content

[Attention] Mamba attention module refactor - LINEAR#43556

Merged
ZJY0516 merged 7 commits into
vllm-project:mainfrom
wangxiyuan:mamba_refactor_2
Jun 4, 2026
Merged

[Attention] Mamba attention module refactor - LINEAR#43556
ZJY0516 merged 7 commits into
vllm-project:mainfrom
wangxiyuan:mamba_refactor_2

Conversation

@wangxiyuan
Copy link
Copy Markdown
Contributor

@wangxiyuan wangxiyuan commented May 25, 2026

Purpose

following #41126

This is the 2nd PR for mamba attention module refactor.

This PR merge BailingMoELinearAttention and MiniMaxText01LinearAttention into model_executor/layers/mamba/linear.

After this PR:

Model mamba type pluggable location Used by
BailingMoELinearAttention linaer_attention Yes model_executor/layers/mamba/gdn/bailing_linear_attn.py BailingMoeV25ForCausalLM
MiniMaxText01LinearAttention linaer_attention Yes model_executor/layers/mamba/gdn/minimax_linear_attn.py MiniMaxText01ForCausalLM

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Mamba linear attention implementation by introducing a LinearAttention base class and reorganizing the module structure. It updates the BailingMoELinearAttention and MiniMaxText01LinearAttention layers to use a unified vllm_config for initialization and consolidates shared logic. A review comment highlights that BailingMoELinearAttention accesses self.kv_cache without explicit initialization in its constructor, which may cause AttributeError in standalone contexts or unit tests where the engine's injection mechanism is not present.


# Get KV cache and state indices
if attn_metadata is not None:
kv_cache = self.kv_cache[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The attribute self.kv_cache is accessed here but it is not explicitly initialized in the __init__ method of BailingMoELinearAttention or its base class LinearAttention. While vLLM's engine typically injects this attribute for MambaBase layers, relying on implicit injection without a fallback or a clear definition in the class hierarchy can lead to AttributeError during initialization or in environments where the engine's injection mechanism is bypassed (e.g., unit tests or standalone usage).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist this is just a code copy from bailing_moe_linear.py‎ to bailing_linear_attn.py. I think your review suggestion doesn't relate to this PR change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self.kv_cache attribute is indeed expected to be managed by the vLLM engine for layers inheriting from MambaBase or using the LinearAttention interface. Since this PR refactors existing code from bailing_moe_linear.py to bailing_linear_attn.py, the assumption is that the engine's injection mechanism remains consistent with the previous implementation. If you are encountering issues in specific environments (like unit tests), ensure that the forward_context is properly initialized or that the layer is being invoked within the expected vLLM execution pipeline.

Copy link
Copy Markdown
Member

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you also provide GPU test result, just in case

@wangxiyuan
Copy link
Copy Markdown
Contributor Author

sure,I'll post the result later

@tjtanaa
Copy link
Copy Markdown
Member

tjtanaa commented May 26, 2026

There seems to have accuracy regression on this branch compared with main 445ded18c1184a5a44d0f41010d614adbd107ca7

This branch

local-completions ({'model': 'MiniMaxAI/MiniMax-M2.5', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 256, 'max_retries': 10, 'max_
gen_toks': 2048, 'max_length': 1048576, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto                             
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|                                                                               
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|                                                                               
|gsm8k|      3|flexible-extract|     5|exact_match|_  |0.8976|_  |0.0083|                                                                               
|     |       |strict-match    |     5|exact_match|_  |0.8916|_  |0.0086|        

main 445ded18c1184a5a44d0f41010d614adbd107ca7

local-completions ({'model': 'MiniMaxAI/MiniMax-M2.5', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 256, 'max_retries': 10, 'max_gen_toks': 2048, 'max_length': 1048576, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|_  |0.9249|_  |0.0073|
|     |       |strict-match    |     5|exact_match|_  |0.9166|_  |0.0076|

@tjtanaa
Copy link
Copy Markdown
Member

tjtanaa commented May 26, 2026

The accuracy is fine after sync your branch with main

local-completions ({'model': 'MiniMaxAI/MiniMax-M2.5', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 256, 'max_retries': 10, 'max_gen_toks': 2048, 'max_length': 1048576, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|_  |0.9204|_  |0.0075|
|     |       |strict-match    |     5|exact_match|_  |0.9151|_  |0.0077|

@wangxiyuan please rebase your PR with main then only start any testing. Please provide the accuracy scores.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wangxiyuan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 27, 2026

Hi @wangxiyuan, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
@wangxiyuan
Copy link
Copy Markdown
Contributor Author

wangxiyuan commented May 29, 2026

@tjtanaa @ZJY0516 test on A100 for accuracy with GSM8K(5-shot)

Ling-2.6-flash

Filter PR #43556 main Δ (main − PR)
flexible-extract 0.7991 0.8044 +0.0053
strict-match 0.7703 0.7771 +0.0068

MiniMax-M2.5

Filter PR #43556 main Δ (main − PR)
flexible-extract 0.9249 0.9249 0.0000
strict-match 0.9227 0.9219 −0.0008

Copy link
Copy Markdown
Member

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's get @ZJY0516 final approval.

@ZJY0516 ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label May 29, 2026
@ZJY0516 ZJY0516 merged commit 9061935 into vllm-project:main Jun 4, 2026
63 checks passed
@viiccwen
Copy link
Copy Markdown
Contributor

viiccwen commented Jun 5, 2026

Hello @wangxiyuan @ZJY0516, I found API return-type docstring issue, and already opened issue and PR!
pls take a look in ur free time, thx! : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants