fix: patch Gemma 4 attention and RotatingKVCache for BatchKVCache#256
Conversation
Benchmark: vllm-mlx vs LM Studio — Gemma 4 26B-A4B (4-bit)Tested on Apple M3 Ultra 256GB, same model (
vllm-mlx is 10.4% faster than LM Studio on single-request throughput. |
Batch throughput benchmark: vllm-mlx vs LM StudioSame setup (M3 Ultra 256GB, Gemma 4 26B-A4B 4-bit), 300 max tokens per request, temperature=0.
LM Studio saturates around ~134 tok/s at 4+ concurrent requests. vllm-mlx with continuous batching scales linearly — 2x faster at 8 concurrent requests. |
Gemma 4 31B dense model tested — patch works correctlyDeployed Test results (7/7 passed)
Throughput
Configvllm-mlx serve mlx-community/gemma-4-31b-it-4bit \
--port 1239 --host 0.0.0.0 \
--continuous-batching --max-num-seqs 8 \
--cache-memory-mb 8192 --max-tokens 131072 \
--enable-auto-tool-choice --tool-call-parser gemma4 \
--mllmThis confirms the patch handles both MoE (26B-A4B) and dense (31B) Gemma 4 architectures correctly with Note: |
dfe2e8f to
90f83e1
Compare
9f54984 to
8e547a7
Compare
8e547a7 to
047f523
Compare
|
@waybarrios, @janhilgard: cross-reference note. The BatchKVCache offset bug you describe in section 1 is the same root cause as a bug I fixed at the The fix at the mlx_vlm layer is the The two fixes are complementary. Yours patches the BatchKVCache layer in vllm_mlx, mine patches the per-attention-call usage in mlx_vlm. Both layers can hit the same root cause depending on which path is exercised. Mergeable on current main per the PR JSON. Sound fix on the BatchKVCache side, the additional |
|
@Thump604 Thanks for the cross-reference and confirmation — really helpful to have independent validation of the root cause. Agreed the two fixes are complementary. Since vllm-mlx doesn't control the upstream mlx_vlm code, the runtime monkey-patch in Ideally the Worth noting this bug isn't Gemma 4-specific — any mlx_vlm model that captures |
Thump604
left a comment
There was a problem hiding this comment.
Both fixes are the right call together. Landing the runtime monkey-patch here guarantees correct behavior against any mlx_vlm version a user might have installed, including older releases that don't yet carry the upstream defensive copy. The mlx-vlm#966 fix (now merged) handles the forward path for fresh installs.
Good catch that this isn't Gemma 4-specific — any mlx_vlm attention implementation that captures cache.offset into a local variable before update_and_fetch() has the same race against BatchKVCache.offset being advanced mid-call. I'd been treating it as a Gemma 4 bug, but your framing as a general mlx_vlm pattern-level hazard is more accurate. A follow-up audit of the other mlx_vlm model attention implementations for the same pattern would be worth filing as a separate issue on Blaizzy/mlx-vlm, tagged for coordination.
Approving.
…for BatchKVCache
PR waybarrios#256 added a trim-before-merge fix for RotatingKVCache in the MLLM continuous-batching path, but left the isinstance guard above it unchanged to require KVCache specifically. RotatingKVCache is a sibling of KVCache under _BaseCache in mlx_lm, not a subclass, so the guard always fires before the trim logic can run. Gemma 4 uses sliding-window attention and returns RotatingKVCache natively, so --mllm --continuous-batching fails at first inference with: ValueError: MLLM continuous batching requires standard KVCache but got RotatingKVCache. Relax the guard to accept both KVCache and RotatingKVCache. QuantizedKVCache (the original rejection target of the guard) is still rejected correctly because it's not in the isinstance tuple. The downstream .merge() call works for both cache types — RotatingKVCache produces a BatchRotatingKVCache, KVCache produces a BatchKVCache — both of which the BatchedEngine handles. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for this @janhilgard — the BatchKVCache offset snapshot and the extra EOS tokens from While integrating this on top of #254 for a production pm_peng workload I hit one issue I wanted to flag: the sample_cache = per_request_caches[0][0]
if not isinstance(sample_cache, KVCache):
raise ValueError(
f"MLLM continuous batching requires standard KVCache but got "
f"{type(sample_cache).__name__}. Disable --kv-cache-quantization "
f"when using multimodal models with --continuous-batching."
)
# Fix: RotatingKVCache._update_concat does NOT trim on first call —
# ...
for rc in per_request_caches:
for layer_cache in rc:
if isinstance(layer_cache, RotatingKVCache): # ← never reached
...In Proposed one-line fix — relax the guard to accept both cache types. if not isinstance(sample_cache, (KVCache, RotatingKVCache)):
raise ValueError(
f"MLLM continuous batching requires standard KVCache or "
f"RotatingKVCache but got {type(sample_cache).__name__}. "
f"Disable --kv-cache-quantization when using multimodal "
f"models with --continuous-batching."
)After this one-line change on top of your PR + #254, I'm getting clean end-to-end behavior on
The rest of your patch works exactly as described — the attention snapshot fix, the channel reasoning parser, and the extra EOS tokens all do their jobs. Just the guard above the trim loop needs relaxing. Happy to open a separate PR for the one-liner if that's easier, or you can fold it into this PR — whichever you prefer. Reference commit on my fork: keegoid@42b85db (Also — the |
|
I took @keegoid's report and packaged it as a minimal follow-up PR: #273. What it does:
Validation on the follow-up branch:
If you'd rather keep the fix inside this PR, #273 should be easy to cherry-pick. |
047f523 to
edcb33e
Compare
19f64f9 to
6fab18d
Compare
|
@keegoid Great catch — you're absolutely right. The I've folded the fix into this PR: the guard now accepts @Thump604 Thanks for packaging the fix as #273 — since it's now included here, #273 can be closed if you prefer. |
6fab18d to
5276bc2
Compare
- Fix BatchKVCache offset bug: mx.array.__iadd__ mutates in-place, causing incorrect RoPE positions and token repetition - Fix RotatingKVCache.max_size returning mx.array instead of int - Add Gemma 4 reasoning parser (--reasoning-parser gemma4) - Read additional EOS tokens from generation_config.json - Fix RotatingKVCache prefix cache extraction (negative left_padding) - Relax isinstance guard to accept RotatingKVCache for sliding window models like Gemma 4 (fixes ValueError on continuous batching) - Remove unused _make_batch_cache() dead code - Fix Anthropic endpoint JSON parsing for clients sending invalid escape sequences (e.g. \s, \d in regex patterns within tool defs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5276bc2 to
6f0efc2
Compare
|
Hey @janhilgard I rebased this on main after merging #268. The overlapping files (reasoning parser, mllm_scheduler, patches, etc.) resolved cleanly since #268 already covered those. What's left in this PR after the rebase is the stuff #268 didn't touch:
Tests pass, black is clean. This looks good to merge as the Anthropic API complement to #268. |
waybarrios
left a comment
There was a problem hiding this comment.
Rebased on main, tests pass, black clean. The remaining changes (Anthropic thinking blocks, RotatingKVCache in memory_cache, BatchRotatingKVCache extract fix) are solid and complement #268 well.
Summary
Fixes for Gemma 4 models running with
--mllmand continuous batching, plus Anthropic endpoint improvements:1. BatchKVCache offset bug fix
mx.array.__iadd__(+=) mutates in-place, so readingcache.offsetbeforeupdate_and_fetch()gives wrong value (offset+1 instead of offset)cache.offsetbefore cache mutationRotatingKVCache.max_sizereturningmx.arrayinstead ofint2. Gemma 4 reasoning parser (
--reasoning-parser gemma4)<|channel>(token 100) /<channel|>(token 101) intoreasoning_contentfield3. MLLM stop tokens from generation_config.json
tokenizer.eos_token_idonly returns primary EOS (token 1<eos>)generation_config.jsondefines additional EOS:<turn|>(106),<|tool_response>(50)_get_stop_tokens()now also readsgeneration_config.json4. RotatingKVCache prefix cache fix (sliding window models)
Problem: Models with
RotatingKVCachelayers (Gemma 4 uses sliding window attention withmax_size=1024on 25/30 layers) crash withValueError: [full] Negative dimensions not allowedwhen prefix cache attempts exact-hit or LCP-match restoration.Root cause: During generation with a full rotating buffer,
_update_in_placedecrementsleft_paddingbySon each step. After N generation tokens,left_padding = -N. Thenextract()uses this as a slice start, truncating the 1024-entry buffer. The truncated cache violates RotatingKVCache's invariant: whenoffset >= max_size, the buffer MUST be full.Fix:
MLLMBatch.extract_cache(): Custom extraction forBatchRotatingKVCachethat clampsleft_paddingto>= 0_trim_cache_offset(): Handle RotatingKVCache circular buffer trimming_QuantizedCacheWrapper: Type-preserving quantization wrapper_quantize_cache(): Skip RotatingKVCache (rotation state can't survive quantize/dequantize)_dequantize_cache(): Deep-copy all cache layers to prevent model mutations from corrupting stored entries5. Anthropic endpoint JSON escape handling
\s,\din regex patterns within tool definitions)json.loadsis strict per RFC 8259 and rejects theseJSONDecodeErrorwith "Invalid \escape" and sanitize lone backslashes6. Strip billing header for prefix cache (13x speedup!)
x-anthropic-billing-header: ...cch=HASH...into the system promptcch=hash changes with every request, causing token sequences to diverge at position ~40Files changed
vllm_mlx/patches/gemma4_mllm.py— Runtime monkey-patch for Gemma 4 attention + RotatingKVCachevllm_mlx/mllm_batch_generator.py— Register patch, RotatingKVCache prefix cache extractionvllm_mlx/memory_cache.py— RotatingKVCache support in trim/quantize/dequantizevllm_mlx/reasoning/gemma4_parser.py— New reasoning parser for<|channel>...<channel|>tagsvllm_mlx/reasoning/__init__.py— Register gemma4 parservllm_mlx/mllm_scheduler.py— Read additional EOS tokens from generation_config.jsonvllm_mlx/server.py— Anthropic JSON escape fixvllm_mlx/api/anthropic_adapter.py— Strip billing header from system promptTest plan
--reasoning-parser gemma4correctly separates thinking from content<turn|>(token 106) — no garbage after response\sin tool schemas)🤖 Generated with Claude Code