fix(specprefill): avoid dense tail expansion within cache window by Thump604 · Pull Request #291 · waybarrios/vllm-mlx

Thump604 · 2026-04-12T01:52:18Z

Summary

avoid forcing full-tail inclusion for RotatingKVCache when the prompt still fits inside the cache window
add a regression test that pins the no-eviction vs eviction boundary

Why

On the Nemotron 120B live runtime path we cap unbounded attention caches with max_kv_size=65536, which means the SpecPrefill target cache is a RotatingKVCache even for prompts that still fit entirely inside the window. The previous sparse-prefill logic treated any RotatingKVCache as if eviction were already active and unioned in the full max_size tail. At 64K this effectively collapsed sparse prefill back into dense target work.

Validation

pytest tests/test_specprefill_rotating_cache.py -q
black --check --fast vllm_mlx/specprefill.py tests/test_specprefill_rotating_cache.py
python -m py_compile vllm_mlx/specprefill.py tests/test_specprefill_rotating_cache.py

Live runtime evidence from the promoted stack:

Nemotron 64K clean cold rerun before fix: dense 271.98s, sparse 395.37s
Nemotron 64K clean cold rerun after fix: dense 277.89s, sparse 143.90s
Nemotron 16K clean cold rerun after fix: dense 68.48s, sparse 36.26s
recall stayed correct in all fixed runs

janhilgard

LGTM. Clean one-line fix with clear regression tests.

The logic is correct: when M <= max_rotating_size, no eviction can happen yet, so forcing the full tail back in just collapses sparse prefill into dense work. The M > max_rotating_size guard is the right condition.

The Nemotron 64K benchmarks speak for themselves — 2x speedup over dense after the fix, vs being slower than dense before.

…hunks Add mx.synchronize() after mx.eval() in the target model sparse prefill loop. This drains the Metal GPU command buffer backlog between chunks, preventing the macOS GPU watchdog (kIOGPUCommandBufferCallbackError ImpactingInteractivity) from firing on long-context sparse prefill. Root cause: models using manual RoPE paths (e.g. Gemma 4 ProportionalRoPE) generate ~20 MLX kernel dispatches per attention layer per chunk. At 256K tokens with chunk_size=256 (~500 chunks × 35 layers), cumulative dispatch pressure exceeds the Metal watchdog threshold. mx.synchronize() forces a GPU-CPU sync that drains the backlog. Also restores Gemma 4 SpecPrefill support (ProportionalRoPE query extraction and position-mapped sparse prefill) that was inadvertently dropped in upstream PR waybarrios#291. Tested on M2 Ultra 128GB across all model families: - Qwen 3.5 27B: 164K tokens, no crash (standard RoPE) - Nemotron 120B: 184K tokens, no crash (no RoPE) - Gemma 4 26B: 260K tokens, no crash (ProportionalRoPE — previously crashed at 256K without synchronize) The blanket specprefill_max_input=131072 cap is no longer needed and has been removed.

fix(specprefill): avoid dense tail expansion within cache window

13f6e45

janhilgard approved these changes Apr 12, 2026

View reviewed changes

janhilgard merged commit 9d69467 into waybarrios:main Apr 12, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(specprefill): avoid dense tail expansion within cache window#291

fix(specprefill): avoid dense tail expansion within cache window#291
janhilgard merged 1 commit intowaybarrios:mainfrom
Thump604:codex/specprefill-rotating-window

Thump604 commented Apr 12, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Thump604 commented Apr 12, 2026

Summary

Why

Validation

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants