[Attention] Mamba attention module refactor - LINEAR#43556
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the Mamba linear attention implementation by introducing a LinearAttention base class and reorganizing the module structure. It updates the BailingMoELinearAttention and MiniMaxText01LinearAttention layers to use a unified vllm_config for initialization and consolidates shared logic. A review comment highlights that BailingMoELinearAttention accesses self.kv_cache without explicit initialization in its constructor, which may cause AttributeError in standalone contexts or unit tests where the engine's injection mechanism is not present.
|
|
||
| # Get KV cache and state indices | ||
| if attn_metadata is not None: | ||
| kv_cache = self.kv_cache[0] |
There was a problem hiding this comment.
The attribute self.kv_cache is accessed here but it is not explicitly initialized in the __init__ method of BailingMoELinearAttention or its base class LinearAttention. While vLLM's engine typically injects this attribute for MambaBase layers, relying on implicit injection without a fallback or a clear definition in the class hierarchy can lead to AttributeError during initialization or in environments where the engine's injection mechanism is bypassed (e.g., unit tests or standalone usage).
There was a problem hiding this comment.
@gemini-code-assist this is just a code copy from bailing_moe_linear.py to bailing_linear_attn.py. I think your review suggestion doesn't relate to this PR change.
There was a problem hiding this comment.
The self.kv_cache attribute is indeed expected to be managed by the vLLM engine for layers inheriting from MambaBase or using the LinearAttention interface. Since this PR refactors existing code from bailing_moe_linear.py to bailing_linear_attn.py, the assumption is that the engine's injection mechanism remains consistent with the previous implementation. If you are encountering issues in specific environments (like unit tests), ensure that the forward_context is properly initialized or that the layer is being invoked within the expected vLLM execution pipeline.
ZJY0516
left a comment
There was a problem hiding this comment.
LGTM. Could you also provide GPU test result, just in case
|
sure,I'll post the result later |
|
There seems to have accuracy regression on this branch compared with main This branch main |
|
The accuracy is fine after sync your branch with main @wangxiyuan please rebase your PR with main then only start any testing. Please provide the accuracy scores. |
|
This pull request has merge conflicts that must be resolved before it can be |
e845176 to
6556d9a
Compare
|
Hi @wangxiyuan, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
6556d9a to
44eb3f9
Compare
|
@tjtanaa @ZJY0516 test on A100 for accuracy with GSM8K(5-shot) Ling-2.6-flash
MiniMax-M2.5
|
|
Hello @wangxiyuan @ZJY0516, I found API return-type docstring issue, and already opened issue and PR! |
Purpose
following #41126
This is the 2nd PR for mamba attention module refactor.
This PR merge
BailingMoELinearAttentionandMiniMaxText01LinearAttentionintomodel_executor/layers/mamba/linear.After this PR:
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.