[Model][BugFix] Mamba/Jamba exceed mamba cache slots by mzusman · Pull Request #11414 · vllm-project/vllm

mzusman · 2024-12-22T13:50:02Z

In recent vLLM versions (since v0.6.4) running requests are not capped anymore by scheduler_config.max_num_seqs which causes an issue on Jamba/Mamba models on high loads.
Mamba models keeps a state inside the modeling file and defines the max running sequences/slots as max_num_seqs. exceeding this number of slots causes an error that crashes vLLM.

to solve that, I've introduced an envar that multiplies the mamba cache slots by a certain amount (x1.5 by default) .
The default is x1.5 as I've seen it's quite sufficient in my tests.

CC @tlrmchlsmth

causing an error inside the mamba cache manager, setting the max num seqs as twice as the max batch size, ensures that new requests will have spare space in the mamba cache manager Signed-off-by: mzusman <mor.zusmann@gmail.com>

Signed-off-by: mzusman <mor.zusmann@gmail.com>

github-actions · 2024-12-22T13:50:13Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

tlrmchlsmth

I think we should find a robust solution to this issue. Could you describe the situation that causes this a bit more (e.g. does it only happen for multistep or when running chunked prefill?).

Do you have a command to repro this problem? The commit where this started happening would help nail exactly what's happening

mergify · 2024-12-29T19:15:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mzusman.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

sfc-gh-aqiao · 2025-02-13T18:11:55Z

In recent vLLM versions (since v0.6.4) running requests are not capped anymore by scheduler_config.max_num_seqs

Dumb question, does this mean vLLM no longer respects --max-num-seqs?

tlrmchlsmth · 2025-02-13T22:34:00Z

I just noticed that the code to determine the max batch size (used for allocating the Mamba state) is a little sketchy when CUDA graphs are used:

vllm/vllm/model_executor/models/mamba.py

Lines 205 to 213 in bffddd9

    
           if self.scheduler_config is not None and \ 
        
               not self.model_config.enforce_eager: 
        
               if self.scheduler_config.max_num_seqs > \ 
        
                   vllm_config.compilation_config.max_capture_size: 
        
                   self.max_batch_size = \ 
        
                       vllm_config.compilation_config.max_capture_size 
        
               else: 
        
                   self.max_batch_size = vllm_config.pad_for_cudagraph( 
        
                       self.scheduler_config.max_num_seqs)

Could this be the problem? (I've simplified it in my Mamba2 support PR (#9292)

mzusman added 3 commits December 22, 2024 14:50

Format

48efd2f

Signed-off-by: mzusman <mor.zusmann@gmail.com>

Introduce this variable through envar

7d2fef5

Signed-off-by: mzusman <mor.zusmann@gmail.com>

DarkLight1337 requested a review from tlrmchlsmth December 23, 2024 05:58

tlrmchlsmth reviewed Dec 29, 2024

View reviewed changes

mergify bot added the needs-rebase label Dec 29, 2024

mzusman closed this Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model][BugFix] Mamba/Jamba exceed mamba cache slots#11414

[Model][BugFix] Mamba/Jamba exceed mamba cache slots#11414
mzusman wants to merge 3 commits intovllm-project:mainfrom
mzusman:max_batch_size_jamba_fix

mzusman commented Dec 22, 2024 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 22, 2024

Uh oh!

tlrmchlsmth left a comment

Uh oh!

mergify bot commented Dec 29, 2024

Uh oh!

sfc-gh-aqiao commented Feb 13, 2025

Uh oh!

tlrmchlsmth commented Feb 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mzusman commented Dec 22, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 22, 2024

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 29, 2024

Uh oh!

sfc-gh-aqiao commented Feb 13, 2025

Uh oh!

tlrmchlsmth commented Feb 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mzusman commented Dec 22, 2024 •

edited by github-actions bot

Loading