Default to 'align' mamba cache mode for Mamba-based models when speculative decoding is enabled by roikoren755 · Pull Request #40454 · vllm-project/vllm

roikoren755 · 2026-04-21T10:30:13Z

Purpose

The 'all' mamba cache mode seems to be buggy at the moment, when combined with speculative decoding, at least when it comes to Nemotron models. For example - #39809. This PR defaults the mode to 'align', which might be less efficient in prefix caching, but works consistently, at least until we fix 'all' mode in combination with SpecDec.

Test Plan

All current tests pass.

Test Result

All current tests pass.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Roi Koren <roik@nvidia.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the configuration logic in vllm/model_executor/models/config.py to default the Mamba cache mode to 'align' when both prefix caching and speculative decoding are enabled. A critical issue was identified where defaulting to 'align' mode without also enabling chunked prefill results in a server crash due to internal assertions. The reviewer recommends automatically enabling chunked prefill whenever 'align' mode is selected as the default to prevent this regression.

gemini-code-assist · 2026-04-21T10:35:04Z

+                if (
+                    model_config.supports_mamba_prefix_caching
+                    and vllm_config.speculative_config is not None
+                ):
+                    cache_config.mamba_cache_mode = "align"
+                    logger.warning(
+                        "Mamba cache mode is set to 'align' for %s by default "
+                        "when prefix caching and speculative decoding are enabled",
+                        model_config.architecture,
+                    )
+                else:
+                    cache_config.mamba_cache_mode = (
+                        "all" if model_config.supports_mamba_prefix_caching else "align"
+                    )
+                    logger.warning(
+                        "Mamba cache mode is set to '%s' for %s by default "
+                        "when prefix caching is enabled",
+                        cache_config.mamba_cache_mode,
+                        model_config.architecture,
+                    )


Defaulting to align mode for Mamba cache will cause a server crash if chunked prefill is not enabled, due to the strict assertion at line 359. Since this PR increases the cases where align is used as a default (specifically when speculative decoding is enabled), we should ensure that enable_chunked_prefill is automatically enabled to avoid this regression in usability.

Note that this requirement applies whenever mamba_cache_mode is set to align. It would be ideal to handle this enablement consistently for all paths that lead to align mode.

if ( model_config.supports_mamba_prefix_caching and vllm_config.speculative_config is not None ): cache_config.mamba_cache_mode = "align" vllm_config.scheduler_config.enable_chunked_prefill = True logger.warning( "Mamba cache mode is set to 'align' for %s by default " "when prefix caching and speculative decoding are enabled. " "Chunked prefill has been enabled as it is required for 'align' mode.", model_config.architecture, ) else: cache_config.mamba_cache_mode = ( "all" if model_config.supports_mamba_prefix_caching else "align" ) if cache_config.mamba_cache_mode == "align": vllm_config.scheduler_config.enable_chunked_prefill = True logger.warning( "Mamba cache mode is set to '%s' for %s by default " "when prefix caching is enabled", cache_config.mamba_cache_mode, model_config.architecture, )

This probably isn't needed

benchislett

LGTM, Thanks!

…lative decoding is enabled (vllm-project#40454) Signed-off-by: Roi Koren <roik@nvidia.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

…lative decoding is enabled (vllm-project#40454) Signed-off-by: Roi Koren <roik@nvidia.com>

…lative decoding is enabled (vllm-project#40454) Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: Yifan <yzong@redhat.com>

…lative decoding is enabled (vllm-project#40454) Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>