[Model][BugFix] Mamba/Jamba exceed mamba cache slots#11414
[Model][BugFix] Mamba/Jamba exceed mamba cache slots#11414mzusman wants to merge 3 commits intovllm-project:mainfrom
Conversation
causing an error inside the mamba cache manager, setting the max num seqs as twice as the max batch size, ensures that new requests will have spare space in the mamba cache manager Signed-off-by: mzusman <mor.zusmann@gmail.com>
Signed-off-by: mzusman <mor.zusmann@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
I think we should find a robust solution to this issue. Could you describe the situation that causes this a bit more (e.g. does it only happen for multistep or when running chunked prefill?).
Do you have a command to repro this problem? The commit where this started happening would help nail exactly what's happening
|
This pull request has merge conflicts that must be resolved before it can be |
Dumb question, does this mean vLLM no longer respects |
|
I just noticed that the code to determine the max batch size (used for allocating the Mamba state) is a little sketchy when CUDA graphs are used: vllm/vllm/model_executor/models/mamba.py Lines 205 to 213 in bffddd9 Could this be the problem? (I've simplified it in my Mamba2 support PR (#9292) |
In recent vLLM versions (since v0.6.4) running requests are not capped anymore by
scheduler_config.max_num_seqswhich causes an issue on Jamba/Mamba models on high loads.Mamba models keeps a state inside the modeling file and defines the max running sequences/slots as
max_num_seqs. exceeding this number of slots causes an error that crashes vLLM.to solve that, I've introduced an envar that multiplies the mamba cache slots by a certain amount (x1.5 by default) .
The default is x1.5 as I've seen it's quite sufficient in my tests.
CC @tlrmchlsmth