Skip to content

[Hybrid] Add mamba_block_size to Engine Args#27289

Merged
tdoublep merged 15 commits intovllm-project:mainfrom
Josephasafg:mamba_block_size_as_engine_arg
Oct 28, 2025
Merged

[Hybrid] Add mamba_block_size to Engine Args#27289
tdoublep merged 15 commits intovllm-project:mainfrom
Josephasafg:mamba_block_size_as_engine_arg

Conversation

@Josephasafg
Copy link
Contributor

@Josephasafg Josephasafg commented Oct 21, 2025

Purpose

Now that prefix caching is enabled, we want the users to be able to control the size of the mamba_block_size to better customize it to their use-case.

At the moment, this flag can only be used with --enable-prefix-caching, as without it we allocate it in size of max_model_len for backwards compatibility.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
@Josephasafg
Copy link
Contributor Author

cc: @tdoublep

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the mamba_block_size argument, allowing users to customize this setting for prefix caching. The implementation correctly adds the new argument to EngineArgs and integrates it into the configuration logic. However, I've identified a critical issue that could lead to a server crash if the Mamba chunk size cannot be determined, and a high-severity issue where the validation for mamba_block_size is overly restrictive. I have provided code suggestions to address both of these points.

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
@Josephasafg Josephasafg requested a review from hmellor October 22, 2025 10:18
models to ensure exact alignment with attention page size."""
mamba_block_size: int | None = None
"""Size of a contiguous cache block in number of tokens for mamba cache."""
mamba_block_size: int | None = Field(default=None, gt=0, multiple_of=256)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to force it to be multiple of 256? I think this limitation should be verified by each mamba attention backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I can add the validation as part of the __init__ of each mamba attention backend.

For Mamba1 we could for no allow multiple of 8 to align with the conv kernel

e.g.

if kv_cache_spec.block_size % BLOCK_M != 0:
    raise ValueError(...)..

For Mamba2 we could keep the same constraint until @tdoublep relaxes it as his comment mentions.

Is that what you had in mind? @heheda12345

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we change this to multiple_of=8 as the lowest common denominator? Then in Mamba2 we add the additional constraint you mentioned?

Copy link
Contributor Author

@Josephasafg Josephasafg Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heheda12345 In this PR I’m not adding support for Mamba1, since the mamba_block_size engine arg feature is only enabled when APC is on.

For now, I suggest we set multiple_of=8 as @hmellor suggested, as the general restriction. I think raising this error early is more user-friendly than waiting until the Mamba attention backend is called.

We can keep the Mamba2 constraint where it currently lives in config.py, since it’s still tightly coupled to that function until we manage to relax it and reduce dependencies.

Once we merge this PR I could add the restriction for Mamba1 as part of my Mamba1 APC PR.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think multiple_of 8 isn't a general constraint. For UX, I think the only change is people needs to wait for a bit longer to get the error. We are also working on some refactor to make these block_size related checks happen earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So for now there is not restriction of the value of the block size. If the user decides to pass it when APC is enabled, we will use it. Otherwise, we fallback to the original path of calculating the block size

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you very much!

@heheda12345 heheda12345 enabled auto-merge (squash) October 27, 2025 22:22
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 27, 2025
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
auto-merge was automatically disabled October 28, 2025 06:26

Head branch was pushed to by a user without write access

@tdoublep tdoublep self-requested a review October 28, 2025 12:19
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tdoublep tdoublep enabled auto-merge (squash) October 28, 2025 12:19
@tdoublep tdoublep merged commit 05181cc into vllm-project:main Oct 28, 2025
55 checks passed
tdoublep referenced this pull request Oct 30, 2025
Signed-off-by: lizhiyuan <lizhiyuan@moonshot.cn>
Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants