fix: accept RotatingKVCache in MLLM prompt merge by Thump604 · Pull Request #273 · waybarrios/vllm-mlx

Thump604 · 2026-04-10T12:00:26Z

Follow-up to the RotatingKVCache report in #256.

Context:

#256 adds trimming for RotatingKVCache before merge in MLLMBatchGenerator._process_prompts
but the existing guard immediately above it still rejects anything that is not KVCache
on Gemma 4 MLLM continuous batching, that makes the new trim path unreachable because RotatingKVCache is a sibling of KVCache, not a subclass

This patch does two things:

Relaxes the guard to accept either KVCache or RotatingKVCache
Adds a regression test that drives _process_prompts() with a real RotatingKVCache and asserts the merged cache is BatchRotatingKVCache

Validation:

python -m pytest tests/test_mllm_continuous_batching.py -q -> 24 passed, 3 deselected
python -m black --check --target-version py312 vllm_mlx/mllm_batch_generator.py tests/test_mllm_continuous_batching.py

If you prefer, this can be cherry-picked directly onto #256 rather than merged as a standalone PR.

- Patch gemma4 Attention to snapshot cache.offset before mutation (mx.array.__iadd__ is in-place, causes wrong RoPE positions) - Add Gemma 4 reasoning parser with channel name stripping (strips "thought"/"response" prefixes, supports both <channel|> and <|channel>response transition formats) - Configure Gemma 4 EOS/stop tokens to prevent uncontrolled generation - Add 16 Gemma 4 parser tests (non-streaming + streaming) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

waybarrios · 2026-04-10T21:48:38Z

Hey, thanks for pulling this out as a focused fix after keegoid flagged it on #256. The RotatingKVCache guard was a real blocker for Gemma 4 batching. It landed as part of #268 which we just merged. Between this, #250, and your review on #268, you've been all over the Gemma 4 effort, really appreciate it.

bludiesel · 2026-04-11T21:05:15Z

Confirming this PR fixes a production-breaking bug on Apple Silicon with Gemma 4 26B A4B MLLM. Ran a full benchmark on M3 Ultra 512GB (2026-04-12) against mlx-community/gemma-4-26B-A4B-it-heretic-4bit.

Reproduction on main

With vllm-mlx serve <model> --mllm --use-paged-cache --continuous-batching, any batch containing prompts of different sequence lengths crashes with:

ERROR:vllm_mlx.mllm_batch_generator:Failed to merge per-request caches (RotatingKVCache): ValueError: [broadcast_shapes] Shapes (1,8,48,256) and (1,8,177,256) cannot be broadcast.

On a 8-prompt benchmark suite (short, medium, long-gen, math, code, classification, Arabic, uncensored), main crashes on 2/8 single requests and the scheduler enters a broken state — all subsequent requests return finish_reason: error with empty content. --kv-cache-quantization --kv-cache-quantization-bits 4 does NOT bypass the crash; it hits the same _process_prompts guard.

After cherry-picking this PR

Same benchmark, patched branch:

=== SEQUENTIAL (8/8 success) ===
  short:          62w    1.3s
  medium:         391w   6.2s
  math:           266w   6.3s
  code:           380w   9.7s
  classification: 27w    1.3s
  arabic:         437w  11.6s
  uncensored:     477w   7.0s
  long_gen:       685w  11.6s

=== PARALLEL (8 concurrent, mixed-length prompts) ===
  8/8 successful
  Total wallclock:       20.0s
  Aggregate throughput:  237.7 tok/s

For context, on the same hardware and prompts:

Patched vllm-mlx + continuous batching: 20.0s / 8 parallel / 237.7 tok/s
mlx-vlm server (official): 24.7s / 8 parallel / ~130 tok/s
Ollama (gguf backend): 40.7s / 8 parallel / 151 tok/s
vllm-mlx no --continuous-batching: 74.6s / 8 parallel / 82 tok/s

This patch is the difference between 25% failure and a fully working continuous-batching MLLM path. Would love to see this land — happy to re-run any additional test config if useful.

janhilgard and others added 2 commits April 6, 2026 14:55

fix(mllm): accept rotating caches in prompt merge

a919010

This was referenced Apr 10, 2026

fix: patch Gemma 4 attention and RotatingKVCache for BatchKVCache #256

Merged

feat: add Gemma 4 multimodal model support #268

Merged

waybarrios closed this Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: accept RotatingKVCache in MLLM prompt merge#273

fix: accept RotatingKVCache in MLLM prompt merge#273
Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Thump604:codex/pr256-rotating-cache-guard

Thump604 commented Apr 10, 2026

Uh oh!

waybarrios commented Apr 10, 2026

Uh oh!

bludiesel commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Thump604 commented Apr 10, 2026

Uh oh!

waybarrios commented Apr 10, 2026

Uh oh!

bludiesel commented Apr 11, 2026

Reproduction on main

After cherry-picking this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants