fix(specprefill): avoid dense tail expansion within cache window#291
Merged
janhilgard merged 1 commit intowaybarrios:mainfrom Apr 12, 2026
Merged
Conversation
janhilgard
approved these changes
Apr 12, 2026
Collaborator
janhilgard
left a comment
There was a problem hiding this comment.
LGTM. Clean one-line fix with clear regression tests.
The logic is correct: when M <= max_rotating_size, no eviction can happen yet, so forcing the full tail back in just collapses sparse prefill into dense work. The M > max_rotating_size guard is the right condition.
The Nemotron 64K benchmarks speak for themselves — 2x speedup over dense after the fix, vs being slower than dense before.
Thump604
added a commit
to Thump604/vllm-mlx
that referenced
this pull request
Apr 16, 2026
…hunks Add mx.synchronize() after mx.eval() in the target model sparse prefill loop. This drains the Metal GPU command buffer backlog between chunks, preventing the macOS GPU watchdog (kIOGPUCommandBufferCallbackError ImpactingInteractivity) from firing on long-context sparse prefill. Root cause: models using manual RoPE paths (e.g. Gemma 4 ProportionalRoPE) generate ~20 MLX kernel dispatches per attention layer per chunk. At 256K tokens with chunk_size=256 (~500 chunks × 35 layers), cumulative dispatch pressure exceeds the Metal watchdog threshold. mx.synchronize() forces a GPU-CPU sync that drains the backlog. Also restores Gemma 4 SpecPrefill support (ProportionalRoPE query extraction and position-mapped sparse prefill) that was inadvertently dropped in upstream PR waybarrios#291. Tested on M2 Ultra 128GB across all model families: - Qwen 3.5 27B: 164K tokens, no crash (standard RoPE) - Nemotron 120B: 184K tokens, no crash (no RoPE) - Gemma 4 26B: 260K tokens, no crash (ProportionalRoPE — previously crashed at 256K without synchronize) The blanket specprefill_max_input=131072 cap is no longer needed and has been removed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RotatingKVCachewhen the prompt still fits inside the cache windowWhy
On the Nemotron 120B live runtime path we cap unbounded attention caches with
max_kv_size=65536, which means the SpecPrefill target cache is aRotatingKVCacheeven for prompts that still fit entirely inside the window. The previous sparse-prefill logic treated anyRotatingKVCacheas if eviction were already active and unioned in the fullmax_sizetail. At 64K this effectively collapsed sparse prefill back into dense target work.Validation
pytest tests/test_specprefill_rotating_cache.py -qblack --check --fast vllm_mlx/specprefill.py tests/test_specprefill_rotating_cache.pypython -m py_compile vllm_mlx/specprefill.py tests/test_specprefill_rotating_cache.pyLive runtime evidence from the promoted stack:
271.98s, sparse395.37s277.89s, sparse143.90s68.48s, sparse36.26s