Skip to content

Default to 'align' mamba cache mode for Mamba-based models when speculative decoding is enabled#40454

Merged
benchislett merged 4 commits into
vllm-project:mainfrom
roikoren755:feat/nemotron-mtp-align-default
Apr 21, 2026
Merged

Default to 'align' mamba cache mode for Mamba-based models when speculative decoding is enabled#40454
benchislett merged 4 commits into
vllm-project:mainfrom
roikoren755:feat/nemotron-mtp-align-default

Conversation

@roikoren755

@roikoren755 roikoren755 commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Purpose

The 'all' mamba cache mode seems to be buggy at the moment, when combined with speculative decoding, at least when it comes to Nemotron models. For example - #39809. This PR defaults the mode to 'align', which might be less efficient in prefix caching, but works consistently, at least until we fix 'all' mode in combination with SpecDec.

Test Plan

All current tests pass.

Test Result

All current tests pass.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the configuration logic in vllm/model_executor/models/config.py to default the Mamba cache mode to 'align' when both prefix caching and speculative decoding are enabled. A critical issue was identified where defaulting to 'align' mode without also enabling chunked prefill results in a server crash due to internal assertions. The reviewer recommends automatically enabling chunked prefill whenever 'align' mode is selected as the default to prevent this regression.

Comment on lines +328 to +347
if (
model_config.supports_mamba_prefix_caching
and vllm_config.speculative_config is not None
):
cache_config.mamba_cache_mode = "align"
logger.warning(
"Mamba cache mode is set to 'align' for %s by default "
"when prefix caching and speculative decoding are enabled",
model_config.architecture,
)
else:
cache_config.mamba_cache_mode = (
"all" if model_config.supports_mamba_prefix_caching else "align"
)
logger.warning(
"Mamba cache mode is set to '%s' for %s by default "
"when prefix caching is enabled",
cache_config.mamba_cache_mode,
model_config.architecture,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Defaulting to align mode for Mamba cache will cause a server crash if chunked prefill is not enabled, due to the strict assertion at line 359. Since this PR increases the cases where align is used as a default (specifically when speculative decoding is enabled), we should ensure that enable_chunked_prefill is automatically enabled to avoid this regression in usability.

Note that this requirement applies whenever mamba_cache_mode is set to align. It would be ideal to handle this enablement consistently for all paths that lead to align mode.

                if (
                    model_config.supports_mamba_prefix_caching
                    and vllm_config.speculative_config is not None
                ):
                    cache_config.mamba_cache_mode = "align"
                    vllm_config.scheduler_config.enable_chunked_prefill = True
                    logger.warning(
                        "Mamba cache mode is set to 'align' for %s by default "
                        "when prefix caching and speculative decoding are enabled. "
                        "Chunked prefill has been enabled as it is required for 'align' mode.",
                        model_config.architecture,
                    )
                else:
                    cache_config.mamba_cache_mode = (
                        "all" if model_config.supports_mamba_prefix_caching else "align"
                    )
                    if cache_config.mamba_cache_mode == "align":
                        vllm_config.scheduler_config.enable_chunked_prefill = True
                    logger.warning(
                        "Mamba cache mode is set to '%s' for %s by default "
                        "when prefix caching is enabled",
                        cache_config.mamba_cache_mode,
                        model_config.architecture,
                    )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably isn't needed

@benchislett benchislett left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks!

@benchislett benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 21, 2026
@benchislett benchislett enabled auto-merge (squash) April 21, 2026 13:16
@benchislett benchislett merged commit f819265 into vllm-project:main Apr 21, 2026
57 checks passed
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request Apr 22, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Apr 23, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
yzong-rh pushed a commit to yzong-rh/vllm that referenced this pull request Apr 23, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Yifan <yzong@redhat.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
@roikoren755 roikoren755 deleted the feat/nemotron-mtp-align-default branch April 27, 2026 18:52
roikoren755 added a commit to roikoren755/vllm that referenced this pull request Apr 29, 2026
…en speculative decoding is enabled (vllm-project#40454)"

This reverts commit f819265.

Signed-off-by: Roi Koren <roik@nvidia.com>
Lafunamor pushed a commit to Lafunamor/vllm that referenced this pull request May 1, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Adrian <info@zzit.ch>
roikoren755 added a commit to roikoren755/vllm that referenced this pull request May 5, 2026
…en speculative decoding is enabled (vllm-project#40454)"

This reverts commit f819265.

Signed-off-by: Roi Koren <roik@nvidia.com>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
roikoren755 added a commit to roikoren755/vllm that referenced this pull request May 13, 2026
…en speculative decoding is enabled (vllm-project#40454)"

This reverts commit f819265.

Signed-off-by: Roi Koren <roik@nvidia.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
roikoren755 added a commit to roikoren755/vllm that referenced this pull request May 18, 2026
…en speculative decoding is enabled (vllm-project#40454)"

This reverts commit f819265.

Signed-off-by: Roi Koren <roik@nvidia.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
askliar pushed a commit to askliar/vllm that referenced this pull request May 21, 2026
…en speculative decoding is enabled (vllm-project#40454)"

This reverts commit f819265.

Signed-off-by: Roi Koren <roik@nvidia.com>
brian-dellabetta pushed a commit to neuralmagic/vllm that referenced this pull request May 29, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…lative decoding is enabled (vllm-project#40454)

Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants