fix: accept RotatingKVCache in MLLM prompt merge#273
fix: accept RotatingKVCache in MLLM prompt merge#273Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Conversation
- Patch gemma4 Attention to snapshot cache.offset before mutation (mx.array.__iadd__ is in-place, causes wrong RoPE positions) - Add Gemma 4 reasoning parser with channel name stripping (strips "thought"/"response" prefixes, supports both <channel|> and <|channel>response transition formats) - Configure Gemma 4 EOS/stop tokens to prevent uncontrolled generation - Add 16 Gemma 4 parser tests (non-streaming + streaming) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hey, thanks for pulling this out as a focused fix after keegoid flagged it on #256. The RotatingKVCache guard was a real blocker for Gemma 4 batching. It landed as part of #268 which we just merged. Between this, #250, and your review on #268, you've been all over the Gemma 4 effort, really appreciate it. |
|
Confirming this PR fixes a production-breaking bug on Apple Silicon with Gemma 4 26B A4B MLLM. Ran a full benchmark on M3 Ultra 512GB (2026-04-12) against Reproduction on mainWith On a 8-prompt benchmark suite (short, medium, long-gen, math, code, classification, Arabic, uncensored), main crashes on 2/8 single requests and the scheduler enters a broken state — all subsequent requests return After cherry-picking this PRSame benchmark, patched branch: For context, on the same hardware and prompts:
This patch is the difference between 25% failure and a fully working continuous-batching MLLM path. Would love to see this land — happy to re-run any additional test config if useful. |
Follow-up to the RotatingKVCache report in #256.
Context:
#256adds trimming forRotatingKVCachebefore merge inMLLMBatchGenerator._process_promptsKVCacheRotatingKVCacheis a sibling ofKVCache, not a subclassThis patch does two things:
KVCacheorRotatingKVCache_process_prompts()with a realRotatingKVCacheand asserts the merged cache isBatchRotatingKVCacheValidation:
python -m pytest tests/test_mllm_continuous_batching.py -q->24 passed, 3 deselectedpython -m black --check --target-version py312 vllm_mlx/mllm_batch_generator.py tests/test_mllm_continuous_batching.pyIf you prefer, this can be cherry-picked directly onto
#256rather than merged as a standalone PR.