Skip to content

fix: accept RotatingKVCache in MLLM prompt merge#273

Closed
Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Thump604:codex/pr256-rotating-cache-guard
Closed

fix: accept RotatingKVCache in MLLM prompt merge#273
Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Thump604:codex/pr256-rotating-cache-guard

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Follow-up to the RotatingKVCache report in #256.

Context:

  • #256 adds trimming for RotatingKVCache before merge in MLLMBatchGenerator._process_prompts
  • but the existing guard immediately above it still rejects anything that is not KVCache
  • on Gemma 4 MLLM continuous batching, that makes the new trim path unreachable because RotatingKVCache is a sibling of KVCache, not a subclass

This patch does two things:

  1. Relaxes the guard to accept either KVCache or RotatingKVCache
  2. Adds a regression test that drives _process_prompts() with a real RotatingKVCache and asserts the merged cache is BatchRotatingKVCache

Validation:

  • python -m pytest tests/test_mllm_continuous_batching.py -q -> 24 passed, 3 deselected
  • python -m black --check --target-version py312 vllm_mlx/mllm_batch_generator.py tests/test_mllm_continuous_batching.py

If you prefer, this can be cherry-picked directly onto #256 rather than merged as a standalone PR.

janhilgard and others added 2 commits April 6, 2026 14:55
- Patch gemma4 Attention to snapshot cache.offset before mutation
  (mx.array.__iadd__ is in-place, causes wrong RoPE positions)
- Add Gemma 4 reasoning parser with channel name stripping
  (strips "thought"/"response" prefixes, supports both <channel|>
  and <|channel>response transition formats)
- Configure Gemma 4 EOS/stop tokens to prevent uncontrolled generation
- Add 16 Gemma 4 parser tests (non-streaming + streaming)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waybarrios
Copy link
Copy Markdown
Owner

Hey, thanks for pulling this out as a focused fix after keegoid flagged it on #256. The RotatingKVCache guard was a real blocker for Gemma 4 batching. It landed as part of #268 which we just merged. Between this, #250, and your review on #268, you've been all over the Gemma 4 effort, really appreciate it.

@waybarrios waybarrios closed this Apr 10, 2026
@bludiesel
Copy link
Copy Markdown

Confirming this PR fixes a production-breaking bug on Apple Silicon with Gemma 4 26B A4B MLLM. Ran a full benchmark on M3 Ultra 512GB (2026-04-12) against mlx-community/gemma-4-26B-A4B-it-heretic-4bit.

Reproduction on main

With vllm-mlx serve <model> --mllm --use-paged-cache --continuous-batching, any batch containing prompts of different sequence lengths crashes with:

ERROR:vllm_mlx.mllm_batch_generator:Failed to merge per-request caches (RotatingKVCache): ValueError: [broadcast_shapes] Shapes (1,8,48,256) and (1,8,177,256) cannot be broadcast.

On a 8-prompt benchmark suite (short, medium, long-gen, math, code, classification, Arabic, uncensored), main crashes on 2/8 single requests and the scheduler enters a broken state — all subsequent requests return finish_reason: error with empty content. --kv-cache-quantization --kv-cache-quantization-bits 4 does NOT bypass the crash; it hits the same _process_prompts guard.

After cherry-picking this PR

Same benchmark, patched branch:

=== SEQUENTIAL (8/8 success) ===
  short:          62w    1.3s
  medium:         391w   6.2s
  math:           266w   6.3s
  code:           380w   9.7s
  classification: 27w    1.3s
  arabic:         437w  11.6s
  uncensored:     477w   7.0s
  long_gen:       685w  11.6s

=== PARALLEL (8 concurrent, mixed-length prompts) ===
  8/8 successful
  Total wallclock:       20.0s
  Aggregate throughput:  237.7 tok/s

For context, on the same hardware and prompts:

  • Patched vllm-mlx + continuous batching: 20.0s / 8 parallel / 237.7 tok/s
  • mlx-vlm server (official): 24.7s / 8 parallel / ~130 tok/s
  • Ollama (gguf backend): 40.7s / 8 parallel / 151 tok/s
  • vllm-mlx no --continuous-batching: 74.6s / 8 parallel / 82 tok/s

This patch is the difference between 25% failure and a fully working continuous-batching MLLM path. Would love to see this land — happy to re-run any additional test config if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants