[`Mamba2`] Fix caching, slow path, and multi-gpu #35154

vasqu · 2024-12-08T18:16:11Z

What does this PR do?

Kind of a follow-up to #34901 as there are some issues in the current code:

Caching
Dt clamping
Following Mamba1 standards a bit closer
Multi-Gpu fix when caching
Slow path fix (supersedes [Mamba2] Fix slow path #34901)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@molbap @ArthurZucker

vasqu

Just some comments for clarification

src/transformers/models/mamba2/modeling_mamba2.py

vasqu · 2024-12-08T18:26:05Z

tests/models/mamba2/test_modeling_mamba2.py

+        # Only left padding is valid
+        attention_mask = torch.ones(size=(self.batch_size, self.seq_length), device=input_ids.device, dtype=torch.long)
+        attention_mask[0, :1] = 0


Added a mask, maybe for some other tests as well.

Alright, is it intended that it is only tuned out for the first element of the batch?

Tbh, that was pretty willy nilly from me; could definitely be changed just wanted to debug and see if stuff works

tests/models/mamba2/test_modeling_mamba2.py

…generate (gives total ids + mask at each step)

vasqu

Some more comments for the cache

tests/models/mamba2/test_modeling_mamba2.py

src/transformers/models/mamba2/modeling_mamba2.py

vasqu · 2024-12-08T19:58:44Z

Integration tests will probably need an update but I don't have a GPU for the 7B atm.

Edit: If you could update these integration tests, then gladly :D especially since I'm on vacay very soon

src/transformers/models/mamba2/modeling_mamba2.py

molbap · 2024-12-09T10:21:35Z

Hey @vasqu thanks! Taking a look in a min

molbap

Hey @vasqu thanks a bunch! left a couple questions/comments but looks good

src/transformers/models/mamba2/modeling_mamba2.py

molbap · 2024-12-09T09:52:17Z

src/transformers/models/mamba2/modeling_mamba2.py

-                (batch_size, self.num_heads, self.head_dim, self.ssm_state_size),
-                device=hidden_states.device, dtype=dtype
+        # 2. Convolution sequence transformation
+        if cache_params is not None and cache_position is not None and cache_position[0] > 0:


currently will break torch compile, FWIW

I think the triton kernels themself are not easy to compile atm either way but this should definitely be handled properly in the future. FYI, you would need to register fake ops for torch to make it properly work which would entail some separate mamba2 utils for the kernel - see https://github.com/facebookresearch/lingua/tree/main/apps/mamba/component

yeah not actionable immediately but would be nice to have in a near future! thanks

Definitely! Would love to see it :)

molbap · 2024-12-09T11:50:00Z

src/transformers/models/mamba2/modeling_mamba2.py

        batch_size, seq_len, _ = input_states.shape
        dtype = input_states.dtype
-        # Gated MLP's linear projection
-        projected_states =  self.in_proj(input_states.squeeze(1))


I'm not sure about this - seems improvable yes, but the squeeze is a no-op unless seq_len == 1, so in caching situation indeed. So we're ending up with a [batch_size, H] tensor instead of a [batch_size, seq_len, H] tensor. Then, we're splitting this one on the last dimension, so it should be fine

src/transformers/models/mamba2/modeling_mamba2.py

tests/models/mamba2/test_modeling_mamba2.py

molbap · 2024-12-09T14:37:41Z

tests/models/mamba2/test_modeling_mamba2.py

+        # Only left padding is valid
+        attention_mask = torch.ones(size=(self.batch_size, self.seq_length), device=input_ids.device, dtype=torch.long)
+        attention_mask[0, :1] = 0


Alright, is it intended that it is only tuned out for the first element of the batch?

molbap · 2024-12-09T14:41:50Z

src/transformers/models/mamba2/modeling_mamba2.py

+        input_states = remove_padding_influence(input_states, attention_mask)
+        projected_states = self.in_proj(input_states)
+        d_mlp = (projected_states.shape[-1] - 2 * self.intermediate_size - 2 * self.n_groups * self.ssm_state_size-self.num_heads) // 2
+        _, _, gate, hidden_states_B_C, dt = projected_states.split(


nice, now it's aligned with cuda kernel forward in naming. TBH the whole split is the same for cuda and torch so could be factored out?

Sounds like a refactor :D would leave this to a separate PR and focus on making things work first

yeah for sure!

src/transformers/models/mamba2/modeling_mamba2.py

molbap · 2024-12-09T16:47:48Z

also I added the slow label - feel free to launch a commit with message "[run-slow] mamba2" so we can trigger the slow CI! that way we make sure multi-gpu is indeed fixed

vasqu · 2024-12-09T17:18:01Z

@molbap Yup, added an empty commit - will get to the comments/review a bit later 🫡

(I could expect some failures on the integration tests, not sure let's see)

vasqu · 2024-12-09T17:45:50Z

Attempt 2 at multi gpu, at least a different error :p

vasqu · 2024-12-09T18:05:57Z

Things that remain:

Mulit-GPU sigh
Supersede slow path fix with same tests included?
Compile compatibility (future)
Refactor some stuff (future)

Otherwise, ready to go @molbap

Edit: Hub seems to have some unrelated issues

src/transformers/models/mamba2/modeling_mamba2.py

vasqu · 2024-12-19T13:14:47Z

Hey 👋 I don't need direct credit, I just think that the list given in the docstring is misleading:

transformers/src/transformers/models/bamba/modular_bamba.py

Lines 214 to 218 in 667ed56

    
               The are a few differences between this and Mamba2Mixer: 
        
               - The variable use_precomputed_states is slightly different due to the HybridCache structure 
        
               - There's a few non-obvious bugs fixed with batching in the slow path that exist in main 
        
               - Some extra variables that our layer doesn't need have been removed 
        
               - We ported most of the refactors in https://github.com/huggingface/transformers/pull/35154, which is (as of Dec 18, 2024) unmerged

The changes are mainly because of the cache + dropping some attributes.

ArthurZucker

Thanks a lot @vasqu 😉
Your last comment is completely aligned with our philosophy: if indeed bamba is now the same, we shall add bamba with modular, isolating the differences if there are any left! cc @molbap on this! Merry Christmas as well!

ArthurZucker · 2024-12-20T08:25:43Z

src/transformers/models/mamba2/modeling_mamba2.py

+        d_mlp = (
+            projected_states.shape[-1]
+            - 2 * self.intermediate_size
+            - 2 * self.n_groups * self.ssm_state_size
+            - self.num_heads
+        ) // 2
+


this is less readable, but I mean mamba in general is hard to read 😄

fixup mamba2 - caching and several other small fixes

1fc0151

vasqu commented Dec 8, 2024

View reviewed changes

vasqu added 3 commits December 8, 2024 19:38

fixup cached forward

d96c752

correct fix this time

b13d6ee

fixup cache - we do not need to extend the attn mask it's handled by …

1da890c

…generate (gives total ids + mask at each step)

vasqu commented Dec 8, 2024

View reviewed changes

tests/models/mamba2/test_modeling_mamba2.py Show resolved Hide resolved

src/transformers/models/mamba2/modeling_mamba2.py Show resolved Hide resolved

src/transformers/models/mamba2/modeling_mamba2.py Show resolved Hide resolved

vasqu added 2 commits December 8, 2024 21:15

remove unnecessary (un)squeeze

eb64f1e

fixup cache position

16ca5d8

vasqu commented Dec 8, 2024

View reviewed changes

src/transformers/models/mamba2/modeling_mamba2.py Show resolved Hide resolved

simplify a few things

f95eb09

molbap reviewed Dec 9, 2024

View reviewed changes

molbap added State space models Issues or PRs related to state space models such as mamba, mamba2 run-slow labels Dec 9, 2024

[run-slow] mamba2

f2bdce3

vasqu added 2 commits December 9, 2024 18:44

multi gpu attempt two

f56a805

[run-slow] mamba2

dcf8de1

vasqu added 3 commits December 9, 2024 18:59

[run-slow] mamba2

faee40b

Merge remote-tracking branch 'upstream/main' into fix-mamba2-caching

720baf6

[run-slow] mamba2

0da9454

vasqu commented Dec 9, 2024

View reviewed changes

src/transformers/models/mamba2/modeling_mamba2.py Outdated Show resolved Hide resolved

vasqu added 2 commits December 9, 2024 19:22

[run-slow] mamba2

4160edf

add newer slow path fix

4f62ae8

vasqu changed the title ~~[Mamba2] Fix Cache and several other small issues~~ [Mamba2] Fix caching, slow path, and multi-gpu Dec 10, 2024

[run-slow] mamba2

3739f8a

ArthurZucker approved these changes Dec 20, 2024

View reviewed changes

ArthurZucker merged commit 5a2aedc into huggingface:main Dec 20, 2024
6 checks passed

vasqu deleted the fix-mamba2-caching branch December 20, 2024 15:17

vasqu mentioned this pull request Dec 30, 2024

[Bug]: Bunches of Issues in Mamba and Mamba2 fla-org/flash-linear-attention#90

Closed

vasqu mentioned this pull request Jan 20, 2025

[Mamba2] Fixes for caching and multiple other small issues fla-org/flash-linear-attention#129

Merged

socket-security bot mentioned this pull request Jul 1, 2025

Bump transformers from 4.52.4 to 4.53.0 alphasecio/prompt-guard#36

Closed

This was referenced Jul 17, 2025

[Snyk] Fix for 2 vulnerabilities kingjay66/unilmf#259

Open

[Snyk] Security upgrade transformers from 4.5.1 to 4.52.0 kingjay66/unilmf#260

Open

socket-security bot mentioned this pull request Aug 1, 2025

Bump transformers from 4.53.2 to 4.54.1 alphasecio/prompt-guard#39

Merged

socket-security bot mentioned this pull request Aug 12, 2025

[Snyk] Security upgrade transformers from 4.5.1 to 4.53.0 kingjay66/unilmf#271

Open

socket-security bot mentioned this pull request Sep 1, 2025

Bump transformers from 4.55.0 to 4.56.0 alphasecio/prompt-guard#43

Closed

This was referenced Sep 25, 2025

[Snyk] Security upgrade transformers from 4.30.2 to 4.53.0 kingjay66/unilmf#278

Open

[Snyk] Security upgrade transformers from 2.10.0 to 4.53.0 kingjay66/unilmf#279

Open

[Snyk] Security upgrade transformers from 4.5.1 to 4.53.0 kingjay66/unilmf#281

Open

socket-security bot mentioned this pull request Nov 1, 2025

Bump transformers from 4.56.2 to 4.57.1 alphasecio/prompt-guard#47

Closed

[Mamba2] Fix caching, slow path, and multi-gpu #35154

[Mamba2] Fix caching, slow path, and multi-gpu #35154

Conversation

vasqu commented Dec 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu commented Dec 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

molbap commented Dec 9, 2024

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

molbap commented Dec 9, 2024

Uh oh!

vasqu commented Dec 9, 2024

Uh oh!

vasqu commented Dec 9, 2024

Uh oh!

vasqu commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vasqu commented Dec 19, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

[`Mamba2`] Fix caching, slow path, and multi-gpu #35154

[`Mamba2`] Fix caching, slow path, and multi-gpu #35154

vasqu commented Dec 8, 2024 •

edited

Loading

vasqu commented Dec 8, 2024 •

edited

Loading

vasqu commented Dec 9, 2024 •

edited

Loading