[Hybrid] Add mamba_block_size to Engine Args#27289
[Hybrid] Add mamba_block_size to Engine Args#27289tdoublep merged 15 commits intovllm-project:mainfrom
Conversation
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
|
cc: @tdoublep |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces the mamba_block_size argument, allowing users to customize this setting for prefix caching. The implementation correctly adds the new argument to EngineArgs and integrates it into the configuration logic. However, I've identified a critical issue that could lead to a server crash if the Mamba chunk size cannot be determined, and a high-severity issue where the validation for mamba_block_size is overly restrictive. I have provided code suggestions to address both of these points.
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
vllm/config/cache.py
Outdated
| models to ensure exact alignment with attention page size.""" | ||
| mamba_block_size: int | None = None | ||
| """Size of a contiguous cache block in number of tokens for mamba cache.""" | ||
| mamba_block_size: int | None = Field(default=None, gt=0, multiple_of=256) |
There was a problem hiding this comment.
do we have to force it to be multiple of 256? I think this limitation should be verified by each mamba attention backend.
There was a problem hiding this comment.
Fair point. I can add the validation as part of the __init__ of each mamba attention backend.
For Mamba1 we could for no allow multiple of 8 to align with the conv kernel
e.g.
if kv_cache_spec.block_size % BLOCK_M != 0:
raise ValueError(...)..For Mamba2 we could keep the same constraint until @tdoublep relaxes it as his comment mentions.
Is that what you had in mind? @heheda12345
There was a problem hiding this comment.
Shall we change this to multiple_of=8 as the lowest common denominator? Then in Mamba2 we add the additional constraint you mentioned?
There was a problem hiding this comment.
@heheda12345 In this PR I’m not adding support for Mamba1, since the mamba_block_size engine arg feature is only enabled when APC is on.
For now, I suggest we set multiple_of=8 as @hmellor suggested, as the general restriction. I think raising this error early is more user-friendly than waiting until the Mamba attention backend is called.
We can keep the Mamba2 constraint where it currently lives in config.py, since it’s still tightly coupled to that function until we manage to relax it and reduce dependencies.
Once we merge this PR I could add the restriction for Mamba1 as part of my Mamba1 APC PR.
What do you think?
There was a problem hiding this comment.
I still think multiple_of 8 isn't a general constraint. For UX, I think the only change is people needs to wait for a bit longer to get the error. We are also working on some refactor to make these block_size related checks happen earlier.
There was a problem hiding this comment.
Got it. So for now there is not restriction of the value of the block size. If the user decides to pass it when APC is enabled, we will use it. Otherwise, we fallback to the original path of calculating the block size
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
heheda12345
left a comment
There was a problem hiding this comment.
LGTM! Thank you very much!
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Head branch was pushed to by a user without write access
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
…fg/vllm into mamba_block_size_as_engine_arg
Signed-off-by: lizhiyuan <lizhiyuan@moonshot.cn> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Purpose
Now that prefix caching is enabled, we want the users to be able to control the size of the
mamba_block_sizeto better customize it to their use-case.At the moment, this flag can only be used with
--enable-prefix-caching, as without it we allocate it in size ofmax_model_lenfor backwards compatibility.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.