Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 68dbae5677
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@sempervictus Here is the mamba state eviction PR for qwen3 next and qwen3.5. I think we shouldn't allocate 512 mamba slots (unless you need maximum 512 parallel requests), the suggested slot number is 64, since we now have partial mamba cache eviction, more slots became useless. |
|
Hasn't caught fire yet... This will take some time testing |
|
This seems to not do any damage at least - with the attn-rs PR its handling >500K in a session and still formatting output code correctly |
Is this PR alone able to solve #229 issue (without hack into number of mamba slots)? |
The fp8 model should be more accurate than the nvfp4 one. |
That's great! They released Qwen3.5 today https://huggingface.co/Qwen/Qwen3.5-397B-A17B and you may try that. |
|
hopefully they release an FP8/NVFP4 of that ... a bit much for my little GPUs to do unquantized 😄 |
|
17b active, hmm, definitely need to look at that fp8 kernel padding bit |
|
@guoqingbao - this can be merged - |
Is the precision acceptable on 500k+ content? |
Bumping into a million, need to increase YARN scale to 8.0 i guess ... still going strong 😁 |
|
Deltanet offload to CPU memory and PD would be nice. At these sequence lengths 2nd queries cause evictions and they're not swapping in/out as one might desire |
Is the eviction of mamba state cause prefill/decode speed degradation? |
|
Once it reaches the limit it forces re-fill of the prefix cache after a certain point. Bumping the constant to 256 "fixes" it but we should scale that with the max model size instantiated dynamically IMO |
Motivation
Description
BlockManagerby addingmamba_prefix_hashes_by_block: HashMap<usize, HashSet<u64>>andvalid_mamba_prefix_hashes: HashSet<u64>and initialize them in the constructor.capture_mamba_prefix_stateand associate each hash with the terminal KV block for that prefix.handle_mamba_prefix_evicted_blocksinstead of disabling all mamba prefix reuse.valid_mamba_prefix_hashesinresolve_mamba_matched_blocksbefore querying/using mamba snapshot state to prevent stale reuse.DisableMambaPrefixCacheRPC and related model-leveldisable_mamba_prefix_cacheAPIs, and add an effective mamba-prefix-cap computation in the runner to bound snapshot sizing with a warning log when capped.Testing
cargo fmton the modified files which completed successfully.cargo test --lib core::prefix_cache::tests::prefix_cache_evicts_leaf_blockswhich compiled and passed (1 passed; 0 failed).Codex Task