[Hybrid] Add mamba_block_size to Engine Args by Josephasafg · Pull Request #27289 · vllm-project/vllm

Josephasafg · 2025-10-21T19:36:19Z

Purpose

Now that prefix caching is enabled, we want the users to be able to control the size of the mamba_block_size to better customize it to their use-case.

At the moment, this flag can only be used with --enable-prefix-caching, as without it we allocate it in size of max_model_len for backwards compatibility.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

Josephasafg · 2025-10-21T19:38:24Z

cc: @tdoublep

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

vllm/model_executor/models/config.py

gemini-code-assist

Code Review

This pull request introduces the mamba_block_size argument, allowing users to customize this setting for prefix caching. The implementation correctly adds the new argument to EngineArgs and integrates it into the configuration logic. However, I've identified a critical issue that could lead to a server crash if the Mamba chunk size cannot be determined, and a high-severity issue where the validation for mamba_block_size is overly restrictive. I have provided code suggestions to address both of these points.

vllm/model_executor/models/config.py

vllm/engine/arg_utils.py

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

vllm/engine/arg_utils.py

vllm/config/cache.py

vllm/engine/arg_utils.py

vllm/model_executor/models/config.py

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

heheda12345 · 2025-10-22T16:37:09Z

vllm/config/cache.py

    models to ensure exact alignment with attention page size."""
-    mamba_block_size: int | None = None
-    """Size of a contiguous cache block in number of tokens for mamba cache."""
+    mamba_block_size: int | None = Field(default=None, gt=0, multiple_of=256)


do we have to force it to be multiple of 256? I think this limitation should be verified by each mamba attention backend.

Fair point. I can add the validation as part of the __init__ of each mamba attention backend.

For Mamba1 we could for no allow multiple of 8 to align with the conv kernel

e.g.

if kv_cache_spec.block_size % BLOCK_M != 0: raise ValueError(...)..

For Mamba2 we could keep the same constraint until @tdoublep relaxes it as his comment mentions.

Is that what you had in mind? @heheda12345

Yes exactly!

Shall we change this to multiple_of=8 as the lowest common denominator? Then in Mamba2 we add the additional constraint you mentioned?

@heheda12345 In this PR I’m not adding support for Mamba1, since the mamba_block_size engine arg feature is only enabled when APC is on.

For now, I suggest we set multiple_of=8 as @hmellor suggested, as the general restriction. I think raising this error early is more user-friendly than waiting until the Mamba attention backend is called.

We can keep the Mamba2 constraint where it currently lives in config.py, since it’s still tightly coupled to that function until we manage to relax it and reduce dependencies.

Once we merge this PR I could add the restriction for Mamba1 as part of my Mamba1 APC PR.

What do you think?

I still think multiple_of 8 isn't a general constraint. For UX, I think the only change is people needs to wait for a bit longer to get the error. We are also working on some refactor to make these block_size related checks happen earlier.

Got it. So for now there is not restriction of the value of the block size. If the user decides to pass it when APC is enabled, we will use it. Otherwise, we fallback to the original path of calculating the block size

vllm/engine/arg_utils.py

vllm/model_executor/models/config.py

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

heheda12345

LGTM! Thank you very much!

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

…fg/vllm into mamba_block_size_as_engine_arg

tdoublep

LGTM

Signed-off-by: lizhiyuan <lizhiyuan@moonshot.cn> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

Added mamba_block_size to engine args

0ebb091

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

chatgpt-codex-connector bot reviewed Oct 21, 2025

View reviewed changes

vllm/model_executor/models/config.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

vllm/model_executor/models/config.py Outdated Show resolved Hide resolved

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

Josephasafg added 2 commits October 21, 2025 22:57

Saved user input for hybrid

f9258b1

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

Removed condition

5a24cae

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

Josephasafg mentioned this pull request Oct 22, 2025

[V1] [Hybrid] Mamba1 Automatic Prefix Caching #26377

Merged

11 tasks

Updated docstring

ef63473

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

Josephasafg requested review from ProExpertProg, WoosukKwon, heheda12345, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 22, 2025 06:24

hmellor reviewed Oct 22, 2025

View reviewed changes

Josephasafg added 2 commits October 22, 2025 12:49

Fixed CR

69ae8be

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

Formatting

3ec0725

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

Josephasafg requested a review from hmellor October 22, 2025 10:18

heheda12345 reviewed Oct 22, 2025

View reviewed changes

hmellor reviewed Oct 23, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

vllm/model_executor/models/config.py Outdated Show resolved Hide resolved

CR fixes

36ed7b8

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

Josephasafg requested a review from heheda12345 October 23, 2025 08:57

Josephasafg added 2 commits October 27, 2025 19:23

Merge branch 'main' into mamba_block_size_as_engine_arg

eb0fe6c

Removed multiple of

e140b18

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

heheda12345 approved these changes Oct 27, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) October 27, 2025 22:22

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 27, 2025

Fixed condition where pure mamba models would get apc by default

99b144a

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

auto-merge was automatically disabled October 28, 2025 06:26
Head branch was pushed to by a user without write access

Josephasafg added 5 commits October 28, 2025 08:29

Merge branch 'main' into mamba_block_size_as_engine_arg

974a473

Fixed Reverted apc false to mamba

5bc0908

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

Merge branch 'mamba_block_size_as_engine_arg' of github.com:Josephasa…

bc5ea0f

…fg/vllm into mamba_block_size_as_engine_arg

Merge branch 'main' into mamba_block_size_as_engine_arg

2caa2f1

Merge branch 'main' into mamba_block_size_as_engine_arg

64950a9

tdoublep self-requested a review October 28, 2025 12:19

tdoublep approved these changes Oct 28, 2025

View reviewed changes

tdoublep enabled auto-merge (squash) October 28, 2025 12:19

tdoublep merged commit 05181cc into vllm-project:main Oct 28, 2025
55 checks passed

tdoublep mentioned this pull request Oct 29, 2025

[Tracking Issue]: Prefix Caching for Hybrid Models #26201

Open

14 tasks

tdoublep referenced this pull request Oct 30, 2025

[Model] Introduce Kimi Linear to vLLM (#27809)

4e68cc9

Signed-off-by: lizhiyuan <lizhiyuan@moonshot.cn> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>

tdoublep mentioned this pull request Oct 30, 2025

[Model] Introduce Kimi Linear to vLLM #27809

Merged

5 tasks

hmellor mentioned this pull request Oct 30, 2025

[Bugfix] Fix 2 precommit issues - (mamba_block_size, kv_cache_config) #27811

Merged

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[Hybrid] Add mamba_block_size to Engine Args (vllm-project#27289)

e691eda

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[Hybrid] Add mamba_block_size to Engine Args (vllm-project#27289)

a57d15b

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Hybrid] Add mamba_block_size to Engine Args (vllm-project#27289)

35812e2

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Hybrid] Add mamba_block_size to Engine Args (vllm-project#27289)

54fd88d

Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>

Uh oh!

Conversation

Josephasafg commented Oct 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Josephasafg commented Oct 21, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Josephasafg Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

hmellor Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Josephasafg Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Josephasafg Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Josephasafg commented Oct 21, 2025 •

edited by github-actions bot

Loading

Josephasafg Oct 23, 2025 •

edited

Loading