Skip to content

fix(specprefill): avoid dense tail expansion within cache window#291

Merged
janhilgard merged 1 commit intowaybarrios:mainfrom
Thump604:codex/specprefill-rotating-window
Apr 12, 2026
Merged

fix(specprefill): avoid dense tail expansion within cache window#291
janhilgard merged 1 commit intowaybarrios:mainfrom
Thump604:codex/specprefill-rotating-window

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

  • avoid forcing full-tail inclusion for RotatingKVCache when the prompt still fits inside the cache window
  • add a regression test that pins the no-eviction vs eviction boundary

Why

On the Nemotron 120B live runtime path we cap unbounded attention caches with max_kv_size=65536, which means the SpecPrefill target cache is a RotatingKVCache even for prompts that still fit entirely inside the window. The previous sparse-prefill logic treated any RotatingKVCache as if eviction were already active and unioned in the full max_size tail. At 64K this effectively collapsed sparse prefill back into dense target work.

Validation

  • pytest tests/test_specprefill_rotating_cache.py -q
  • black --check --fast vllm_mlx/specprefill.py tests/test_specprefill_rotating_cache.py
  • python -m py_compile vllm_mlx/specprefill.py tests/test_specprefill_rotating_cache.py

Live runtime evidence from the promoted stack:

  • Nemotron 64K clean cold rerun before fix: dense 271.98s, sparse 395.37s
  • Nemotron 64K clean cold rerun after fix: dense 277.89s, sparse 143.90s
  • Nemotron 16K clean cold rerun after fix: dense 68.48s, sparse 36.26s
  • recall stayed correct in all fixed runs

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Clean one-line fix with clear regression tests.

The logic is correct: when M <= max_rotating_size, no eviction can happen yet, so forcing the full tail back in just collapses sparse prefill into dense work. The M > max_rotating_size guard is the right condition.

The Nemotron 64K benchmarks speak for themselves — 2x speedup over dense after the fix, vs being slower than dense before.

@janhilgard janhilgard merged commit 9d69467 into waybarrios:main Apr 12, 2026
7 checks passed
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 16, 2026
…hunks

Add mx.synchronize() after mx.eval() in the target model sparse prefill
loop. This drains the Metal GPU command buffer backlog between chunks,
preventing the macOS GPU watchdog (kIOGPUCommandBufferCallbackError
ImpactingInteractivity) from firing on long-context sparse prefill.

Root cause: models using manual RoPE paths (e.g. Gemma 4
ProportionalRoPE) generate ~20 MLX kernel dispatches per attention layer
per chunk. At 256K tokens with chunk_size=256 (~500 chunks × 35 layers),
cumulative dispatch pressure exceeds the Metal watchdog threshold.
mx.synchronize() forces a GPU-CPU sync that drains the backlog.

Also restores Gemma 4 SpecPrefill support (ProportionalRoPE query
extraction and position-mapped sparse prefill) that was inadvertently
dropped in upstream PR waybarrios#291.

Tested on M2 Ultra 128GB across all model families:
  - Qwen 3.5 27B: 164K tokens, no crash (standard RoPE)
  - Nemotron 120B: 184K tokens, no crash (no RoPE)
  - Gemma 4 26B:  260K tokens, no crash (ProportionalRoPE — previously
    crashed at 256K without synchronize)

The blanket specprefill_max_input=131072 cap is no longer needed and
has been removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants