Support mamba prefix eviction by guoqingbao · Pull Request #231 · guoqingbao/vllm.rs

guoqingbao · 2026-02-15T08:07:35Z

Motivation

The prior global-disable approach for mamba prefix snapshots was too blunt and risked losing useful mamba state when only part of the KV prefix cache was evicted.
Partial KV evictions can leave remaining KV blocks reusable while corresponding mamba snapshots for evicted blocks become stale, potentially causing cache mismatch and degraded prefill accuracy.

Description

Track mamba snapshot hashes per KV block in BlockManager by adding mamba_prefix_hashes_by_block: HashMap<usize, HashSet<u64>> and valid_mamba_prefix_hashes: HashSet<u64> and initialize them in the constructor.
Register captured mamba prefix snapshot hashes in capture_mamba_prefix_state and associate each hash with the terminal KV block for that prefix.
In all KV eviction/clear paths, remove only the mamba hashes tied to the evicted KV block(s) via handle_mamba_prefix_evicted_blocks instead of disabling all mamba prefix reuse.
Require candidate prefix hashes to be present in valid_mamba_prefix_hashes in resolve_mamba_matched_blocks before querying/using mamba snapshot state to prevent stale reuse.
Revert the previously introduced cross-runner DisableMambaPrefixCache RPC and related model-level disable_mamba_prefix_cache APIs, and add an effective mamba-prefix-cap computation in the runner to bound snapshot sizing with a warning log when capped.

Testing

Ran cargo fmt on the modified files which completed successfully.
Ran the targeted unit test cargo test --lib core::prefix_cache::tests::prefix_cache_evicts_leaf_blocks which compiled and passed (1 passed; 0 failed).
Built and ran library tests for the touched workspace targets which completed successfully (unit test run above and workspace compile succeeded).

Codex Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68dbae5677

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

guoqingbao · 2026-02-15T08:19:00Z

@sempervictus Here is the mamba state eviction PR for qwen3 next and qwen3.5. I think we shouldn't allocate 512 mamba slots (unless you need maximum 512 parallel requests), the suggested slot number is 64, since we now have partial mamba cache eviction, more slots became useless.

sempervictus · 2026-02-15T15:36:57Z

Hasn't caught fire yet... This will take some time testing

sempervictus · 2026-02-16T04:36:59Z

This seems to not do any damage at least - with the attn-rs PR its handling >500K in a session and still formatting output code correctly

guoqingbao · 2026-02-16T04:43:29Z

This seems to not do any damage at least - with the attn-rs PR its handling >500K in a session and still formatting output code correctly

Is this PR alone able to solve #229 issue (without hack into number of mamba slots)?

guoqingbao · 2026-02-16T04:48:24Z

This seems to not do any damage at least - with the attn-rs PR its handling >500K in a session and still formatting output code correctly

The fp8 model should be more accurate than the nvfp4 one.

sempervictus · 2026-02-16T06:56:22Z

Yes, this + the one i have up on attn-rs and #232 are stellar together at YARN-scaled context lengths. If you could please make #232 pass --feature python tests it should be good to go as well.

guoqingbao · 2026-02-16T13:31:01Z

Yes, this + the one i have up on attn-rs and #232 are stellar together at YARN-scaled context lengths. If you could please make #232 pass --feature python tests it should be good to go as well.

That's great! They released Qwen3.5 today https://huggingface.co/Qwen/Qwen3.5-397B-A17B and you may try that.

sempervictus · 2026-02-16T14:39:47Z

hopefully they release an FP8/NVFP4 of that ... a bit much for my little GPUs to do unquantized 😄

sempervictus · 2026-02-16T15:06:04Z

17b active, hmm, definitely need to look at that fp8 kernel padding bit

sempervictus · 2026-02-16T17:03:07Z

@guoqingbao - this can be merged - Seq 202 - chunk prefill finished (500704 tokens) and like the Energizer Bunny commercials of old... "it just keeps going and going"

guoqingbao · 2026-02-17T03:58:48Z

@guoqingbao - this can be merged - Seq 202 - chunk prefill finished (500704 tokens) and like the Energizer Bunny commercials of old... "it just keeps going and going"

Is the precision acceptable on 500k+ content?

sempervictus · 2026-02-17T04:05:30Z

Is the precision acceptable on 500k+ content?

Bumping into a million, need to increase YARN scale to 8.0 i guess ... still going strong

    "rope_scaling": {
      "rope_type": "yarn",
      "factor": 8.0,
      "original_max_position_embeddings": 262144
    },

😁

sempervictus · 2026-02-17T04:10:59Z

Deltanet offload to CPU memory and PD would be nice. At these sequence lengths 2nd queries cause evictions and they're not swapping in/out as one might desire

guoqingbao · 2026-02-17T04:21:14Z

Deltanet offload to CPU memory and PD would be nice. At these sequence lengths 2nd queries cause evictions and they're not swapping in/out as one might desire

Is the eviction of mamba state cause prefill/decode speed degradation?

sempervictus · 2026-02-17T04:52:21Z

Once it reaches the limit it forces re-fill of the prefix cache after a certain point. Bumping the constant to 256 "fixes" it but we should scale that with the max model size instantiated dynamically IMO

Partially evict mamba prefix states in lockstep with KV evictions

68dbae5

guoqingbao added the codex label Feb 15, 2026 — with ChatGPT Codex Connector

chatgpt-codex-connector Bot reviewed Feb 15, 2026

View reviewed changes

Comment thread src/core/block_manager.rs

Keep mamba prefix state in sync with partial KV eviction

d87b459

guoqingbao changed the title ~~Follow-up: partial mamba prefix eviction aligned with KV prefix eviction~~ Support mamba prefix eviction Feb 15, 2026

guoqingbao mentioned this pull request Feb 15, 2026

Enable precision flags for flashinfer compilation guoqingbao/attention.rs#30

Open

guoqingbao merged commit d3ebc10 into main Feb 17, 2026
1 check passed

guoqingbao deleted the codex/github-mention-cache-pollution branch April 18, 2026 06:35

Conversation

guoqingbao commented Feb 15, 2026

Motivation

Description

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

guoqingbao commented Feb 15, 2026

Uh oh!

sempervictus commented Feb 15, 2026

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

guoqingbao commented Feb 16, 2026

Uh oh!

guoqingbao commented Feb 16, 2026

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

guoqingbao commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

guoqingbao commented Feb 17, 2026

Uh oh!

sempervictus commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Feb 17, 2026

Uh oh!

guoqingbao commented Feb 17, 2026

Uh oh!

sempervictus commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guoqingbao commented Feb 16, 2026 •

edited

Loading

sempervictus commented Feb 17, 2026 •

edited

Loading

sempervictus commented Feb 17, 2026 •

edited

Loading