Skip to content

Support mamba prefix eviction#231

Merged
guoqingbao merged 2 commits intomainfrom
codex/github-mention-cache-pollution
Feb 17, 2026
Merged

Support mamba prefix eviction#231
guoqingbao merged 2 commits intomainfrom
codex/github-mention-cache-pollution

Conversation

@guoqingbao
Copy link
Copy Markdown
Owner

Motivation

  • The prior global-disable approach for mamba prefix snapshots was too blunt and risked losing useful mamba state when only part of the KV prefix cache was evicted.
  • Partial KV evictions can leave remaining KV blocks reusable while corresponding mamba snapshots for evicted blocks become stale, potentially causing cache mismatch and degraded prefill accuracy.

Description

  • Track mamba snapshot hashes per KV block in BlockManager by adding mamba_prefix_hashes_by_block: HashMap<usize, HashSet<u64>> and valid_mamba_prefix_hashes: HashSet<u64> and initialize them in the constructor.
  • Register captured mamba prefix snapshot hashes in capture_mamba_prefix_state and associate each hash with the terminal KV block for that prefix.
  • In all KV eviction/clear paths, remove only the mamba hashes tied to the evicted KV block(s) via handle_mamba_prefix_evicted_blocks instead of disabling all mamba prefix reuse.
  • Require candidate prefix hashes to be present in valid_mamba_prefix_hashes in resolve_mamba_matched_blocks before querying/using mamba snapshot state to prevent stale reuse.
  • Revert the previously introduced cross-runner DisableMambaPrefixCache RPC and related model-level disable_mamba_prefix_cache APIs, and add an effective mamba-prefix-cap computation in the runner to bound snapshot sizing with a warning log when capped.

Testing

  • Ran cargo fmt on the modified files which completed successfully.
  • Ran the targeted unit test cargo test --lib core::prefix_cache::tests::prefix_cache_evicts_leaf_blocks which compiled and passed (1 passed; 0 failed).
  • Built and ran library tests for the touched workspace targets which completed successfully (unit test run above and workspace compile succeeded).

Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68dbae5677

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/core/block_manager.rs
@guoqingbao
Copy link
Copy Markdown
Owner Author

@sempervictus Here is the mamba state eviction PR for qwen3 next and qwen3.5. I think we shouldn't allocate 512 mamba slots (unless you need maximum 512 parallel requests), the suggested slot number is 64, since we now have partial mamba cache eviction, more slots became useless.

@guoqingbao guoqingbao changed the title Follow-up: partial mamba prefix eviction aligned with KV prefix eviction Support mamba prefix eviction Feb 15, 2026
@sempervictus
Copy link
Copy Markdown
Contributor

Hasn't caught fire yet... This will take some time testing

@sempervictus
Copy link
Copy Markdown
Contributor

This seems to not do any damage at least - with the attn-rs PR its handling >500K in a session and still formatting output code correctly

@guoqingbao
Copy link
Copy Markdown
Owner Author

This seems to not do any damage at least - with the attn-rs PR its handling >500K in a session and still formatting output code correctly

Is this PR alone able to solve #229 issue (without hack into number of mamba slots)?

@guoqingbao
Copy link
Copy Markdown
Owner Author

This seems to not do any damage at least - with the attn-rs PR its handling >500K in a session and still formatting output code correctly

The fp8 model should be more accurate than the nvfp4 one.

@sempervictus
Copy link
Copy Markdown
Contributor

Yes, this + the one i have up on attn-rs and #232 are stellar together at YARN-scaled context lengths. If you could please make #232 pass --feature python tests it should be good to go as well.

@guoqingbao
Copy link
Copy Markdown
Owner Author

guoqingbao commented Feb 16, 2026

Yes, this + the one i have up on attn-rs and #232 are stellar together at YARN-scaled context lengths. If you could please make #232 pass --feature python tests it should be good to go as well.

That's great! They released Qwen3.5 today https://huggingface.co/Qwen/Qwen3.5-397B-A17B and you may try that.

@sempervictus
Copy link
Copy Markdown
Contributor

hopefully they release an FP8/NVFP4 of that ... a bit much for my little GPUs to do unquantized 😄

@sempervictus
Copy link
Copy Markdown
Contributor

17b active, hmm, definitely need to look at that fp8 kernel padding bit

@sempervictus
Copy link
Copy Markdown
Contributor

@guoqingbao - this can be merged - Seq 202 - chunk prefill finished (500704 tokens) and like the Energizer Bunny commercials of old... "it just keeps going and going"

@guoqingbao
Copy link
Copy Markdown
Owner Author

@guoqingbao - this can be merged - Seq 202 - chunk prefill finished (500704 tokens) and like the Energizer Bunny commercials of old... "it just keeps going and going"

Is the precision acceptable on 500k+ content?

@sempervictus
Copy link
Copy Markdown
Contributor

sempervictus commented Feb 17, 2026

Is the precision acceptable on 500k+ content?

Bumping into a million, need to increase YARN scale to 8.0 i guess ... still going strong

    "rope_scaling": {
      "rope_type": "yarn",
      "factor": 8.0,
      "original_max_position_embeddings": 262144
    },

😁

@sempervictus
Copy link
Copy Markdown
Contributor

Deltanet offload to CPU memory and PD would be nice. At these sequence lengths 2nd queries cause evictions and they're not swapping in/out as one might desire

@guoqingbao
Copy link
Copy Markdown
Owner Author

Deltanet offload to CPU memory and PD would be nice. At these sequence lengths 2nd queries cause evictions and they're not swapping in/out as one might desire

Is the eviction of mamba state cause prefill/decode speed degradation?

@sempervictus
Copy link
Copy Markdown
Contributor

sempervictus commented Feb 17, 2026

Once it reaches the limit it forces re-fill of the prefix cache after a certain point. Bumping the constant to 256 "fixes" it but we should scale that with the max model size instantiated dynamically IMO

@guoqingbao guoqingbao merged commit d3ebc10 into main Feb 17, 2026
1 check passed
@guoqingbao guoqingbao deleted the codex/github-mention-cache-pollution branch April 18, 2026 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants