scheduler: preserve prompt checkpoints in chunked prefill resume path by krystophny · Pull Request #221 · waybarrios/vllm-mlx

krystophny · 2026-03-24T17:56:07Z

Summary

Preserve prompt_checkpoints across chunked-prefill partial resume and finalization for mlx-lm prompt tuples, and invoke the upstream prompt_checkpoint_callback contract.

Why

Recent mlx-lm prompt tuples carry a seventh prompt_checkpoints field. Upstream one-line fixes cover the immediate unpack crash in _chunked_next, but the chunked-prefill monkeypatch also needs to retain that field when it stores partial progress and resumes generation later.

Additionally, the upstream BatchGenerator invokes prompt_checkpoint_callback after cache finalization. The chunked-prefill monkeypatch was missing this callback invocation, breaking the checkpoint contract.

What changed

update the chunked-prefill scheduler monkeypatch to accept 7-field prompt tuples
preserve prompt-checkpoint state through partial-prefill resume and finalization
import _lazy_extract_cache from mlx-lm and invoke prompt_checkpoint_callback at the correct semantic boundary (after c.finalize(), before the checkpoint tail model call)
add regression coverage for partial chunked-prefill with prompt checkpoints, callback firing, and prompt_checkpoint > 1 tail replay

Status

refreshed onto current upstream main (b4fa030) on 2026-04-09
added inline clarification for the upstream prompt_checkpoint normalization and the post-finalize() checkpoint-tail replay
added a regression that exercises the prompt_checkpoint > 1 tail replay path the review called out

Files to review

vllm_mlx/scheduler.py
tests/test_batching.py

Validation

python -m pytest tests/test_batching.py::TestSchedulerBasic -q -> 10 passed

krystophny · 2026-03-24T19:08:11Z

Critical review found a blocking scheduler issue: the chunked-prefill path still does not invoke prompt_checkpoint_callback, so it does not preserve the upstream checkpoint contract yet. I am leaving this PR in draft until that callback path is fixed.

Thump604 · 2026-03-25T01:18:17Z

This is the right approach. We had #194 open for the same crash (one-line unpack fix) but just closed it -- your PR is the only one of the four submissions that actually preserves checkpoint semantics through the chunked prefill lifecycle instead of silently discarding them.

The checkpoint-aware finalization step (feeding remaining checkpoint tokens through the model before generation) is what the others all miss.

Happy to test once the callback gap is resolved. We run chunked prefill on Qwen3.5-122B (M2 Ultra 128GB) with 2048-token prefill steps.

Thump604

The PR correctly addresses the callback gap that was flagged. The fix preserves prompt_checkpoints through partial resume and invokes the callback after finalization, which the prior submissions missed. The semantic changes to checkpoint-aware bounds checking are sound.

However, I have three concrete concerns before this merges:

Test coverage is incomplete. test_chunked_prefill_accepts_prompt_checkpoints constructs a 7-tuple but never exercises the actual zip(*batch_prompts) unpack on line 414. test_chunked_prefill_invokes_checkpoint_callback doesn't validate the "checkpoint tail" model execution (lines 313-318), which is the most complex part of the fix. A test with checkpoint > 1 initially (not just in resume) would strengthen confidence.
Line 425 has a suspicious defensive formula: (ln - pc if pc > 0 else -pc). Checkpoint offsets should always be positive. Why the negation when pc <= 0? This deserves a comment explaining the intent — is this dead code, or does mlx-lm ever produce zero/negative checkpoints?
The checkpoint tail replay (lines 313-318) feeds tokens 0..checkpoint-1 through the model after finalize(). Is this the intended semantics? The comment on line 303 says "once only the checkpoint tail remains" but then we immediately replay it. If the cache already saw these tokens during chunked processing, feeding them again seems redundant. I need clarification: does finalize() clear cache state, or just mark it finalized? If it clears, the replay is necessary; if not, it may be redundant computation.

The implementation is not blocking — all three can be addressed post-merge. But they should be addressed before production use on 122B at 2048-token prefill steps. Otherwise this is good work fixing a real checkpoint contract violation.

Thump604 · 2026-03-31T22:40:32Z

The PR correctly addresses the callback gap that was flagged. The fix preserves prompt_checkpoints through partial resume and invokes the callback after finalization, which the prior submissions missed. The semantic changes to checkpoint-aware bounds checking are sound.

However, I have three concrete concerns before this merges:

Test coverage is incomplete. test_chunked_prefill_accepts_prompt_checkpoints constructs a 7-tuple but never exercises the actual zip(*batch_prompts) unpack on line 414. test_chunked_prefill_invokes_checkpoint_callback does not validate the "checkpoint tail" model execution (lines 313-318), which is the most complex part of the fix. A test with checkpoint > 1 initially (not just in resume) would strengthen confidence.
Line 425 needs clarification. The formula (ln - pc if pc > 0 else -pc) looks defensive. Checkpoint offsets should always be positive. Why negate when pc <= 0? This deserves a comment explaining the intent. Is this dead code, or does mlx-lm ever produce zero/negative checkpoints?
The checkpoint tail replay needs justification (lines 313-318). After finalize(), the code feeds tokens 0..checkpoint-1 through the model again. Does finalize() clear cache state, or just mark it finalized? If it clears, the replay is necessary; if not, it may be redundant computation. A comment or docstring explaining the semantics would help.

The implementation is not blocking. All three can be addressed post-merge. But they should be resolved before production use on 122B at 2048-token prefill steps. Otherwise this is good work fixing a real checkpoint contract violation that prior submissions missed.

Thump604 · 2026-04-08T00:09:41Z

@waybarrios, @krystophny: status note.

PR currently shows CONFLICTING merge status. Likely needs a rebase on current main since the chunked prefill code path has had other changes since this branch was created. The fix itself addresses a real upstream contract: recent mlx-lm prompt tuples carry a seventh prompt_checkpoints field that the chunked-prefill monkeypatch needs to retain across partial resume.

This PR is likely the right fix for issue #155 (omprashantjain, "Chunked prefill crashes with mlx-lm >= 0.31.0") which describes the symptom of the same underlying mlx-lm format change.

The upstream BatchGenerator contract requires prompt_checkpoint_callback to fire after cache finalization, before the checkpoint tail model call. The chunked-prefill monkeypatch preserved the checkpoint field but never invoked the callback, breaking the upstream checkpoint contract. Wire _lazy_extract_cache from mlx-lm and invoke the callback at the correct semantic boundary. Add regression test verifying the callback fires with the correct uid and checkpoint offset.

krystophny · 2026-04-09T06:40:51Z

Force-pushed a refresh onto current upstream main (b4fa030). I addressed the three concrete review points in the refresh: added inline clarification for the upstream checkpoint normalization, added an inline comment on the post-finalize() checkpoint-tail replay, and added a regression that exercises the prompt_checkpoint > 1 tail replay path. Validation: python -m pytest tests/test_batching.py::TestSchedulerBasic -q -> 10 passed.

Thump604

Refresh confirmed on head b80073a against upstream main b4fa030. The three concrete review points from the prior COMMENTED review are addressed:

Inline clarification for the upstream checkpoint normalization is visible in vllm_mlx/scheduler.py.
Inline comment on the post-finalize() checkpoint-tail replay path is visible.
tests/test_batching.py gains +358 lines of regression coverage, including the prompt_checkpoint > 1 tail replay path that the prior test construction never exercised.

The chunked prefill path now invokes prompt_checkpoint_callback after finalization, which was the blocking issue flagged on the earlier revision. Checkpoint semantics through partial resume match the upstream contract: recent mlx-lm prompt tuples carry a seventh prompt_checkpoints field that the chunked-prefill monkeypatch now retains across partial resume.

CI green across lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, tests. Approved. This is the right fix for issue #155 ("Chunked prefill crashes").

krystophny marked this pull request as draft March 24, 2026 19:08

Thump604 mentioned this pull request Mar 25, 2026

fix: unpack prompt_checkpoints in _chunked_next for mlx-lm >= 0.31.0 #194

Closed

4 tasks

This was referenced Mar 25, 2026

[Tracking] Upstream backlog and merge plan computor-org/vllm-mlx#12

Open

scheduler: finish prompt checkpoint callback support in chunked prefill computor-org/vllm-mlx#13

Closed

krystophny marked this pull request as ready for review March 25, 2026 23:53

Thump604 mentioned this pull request Mar 30, 2026

feat: full sampling parameter support (top_k, min_p, presence_penalty, repetition_penalty) #213

Merged

5 tasks

Thump604 reviewed Mar 31, 2026

View reviewed changes

Micha0827 mentioned this pull request Apr 1, 2026

prefix-cache + large prompts: mid_prefill_cache re-enables chunked_prefill causing crash #178

Closed

Thump604 mentioned this pull request Apr 8, 2026

Chunked prefill crashes with mlx-lm >= 0.31.0: zip unpacking expects 6 values, gets 7 #155

Closed

krystophny added 3 commits April 9, 2026 08:28

Fix chunked prefill for mlx-lm prompt checkpoints

d95c47f

test: cover checkpoint tail replay on upstream refresh

a1d566b

krystophny force-pushed the fix/chunked-prefill-prompt-checkpoints-upstream branch from f0a3b47 to a1d566b Compare April 9, 2026 06:35

krystophny added 2 commits April 9, 2026 09:31

style: format prompt checkpoint refresh

e071a24

fix: tolerate mlx-lm Batch export drift in chunked prefill

b80073a

krystophny force-pushed the fix/chunked-prefill-prompt-checkpoints-upstream branch from 2ce37be to b80073a Compare April 9, 2026 08:27

Thump604 approved these changes Apr 9, 2026

View reviewed changes

Thump604 merged commit 8767fe6 into waybarrios:main Apr 10, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: preserve prompt checkpoints in chunked prefill resume path#221

scheduler: preserve prompt checkpoints in chunked prefill resume path#221
Thump604 merged 5 commits into
waybarrios:mainfrom
computor-org:fix/chunked-prefill-prompt-checkpoints-upstream

krystophny commented Mar 24, 2026 •

edited

Loading

Uh oh!

krystophny commented Mar 24, 2026

Uh oh!

Thump604 commented Mar 25, 2026

Uh oh!

Thump604 left a comment

Uh oh!

Thump604 commented Mar 31, 2026

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

krystophny commented Apr 9, 2026

Uh oh!

Thump604 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krystophny commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Status

Files to review

Validation

Uh oh!

krystophny commented Mar 24, 2026

Uh oh!

Thump604 commented Mar 25, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

Thump604 commented Mar 31, 2026

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

krystophny commented Apr 9, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

krystophny commented Mar 24, 2026 •

edited

Loading