Skip to content

fix: compatibility with mlx_lm 0.31.0 prompt checkpoints#126

Merged
jundot merged 1 commit intojundot:mainfrom
rsnow:mlx-lm-0.31.0-compat-pr
Mar 10, 2026
Merged

fix: compatibility with mlx_lm 0.31.0 prompt checkpoints#126
jundot merged 1 commit intojundot:mainfrom
rsnow:mlx-lm-0.31.0-compat-pr

Conversation

@rsnow
Copy link
Copy Markdown
Contributor

@rsnow rsnow commented Mar 10, 2026

Closes #110

mlx_lm 0.31.0 added prompt checkpoint support to BatchGenerator, changing insert() tuple arity from 6 to 7 fields and replacing hardcoded prefill boundaries with a variable prompt_checkpoint.

Changes to _BoundarySnapshotBatchGenerator._process_prompts:

Accept 7th tuple field (prompt_checkpoints) from insert()
Compute effective prompt_checkpoint matching upstream semantics
Replace hardcoded prefill split (1) with prompt_checkpoint in both left-pad and right-pad paths
Add prompt_checkpoint_callback support for upstream parity
Defensive clamp in cache prepare to prevent negative lengths
Backward compatible: When prompt_checkpoints are not supplied (default), prompt_checkpoint computes to 1 and all code paths behave identically to pre-patch.

Tested on M3 Ultra 256GB with mlx_lm 0.31.0 — chat completions, benchmarks, continuous batching all working. Soaking since March 8.

mlx_lm 0.31.0 added prompt checkpoint support to BatchGenerator,
changing insert() tuple arity from 6 to 7 fields and replacing
hardcoded prefill boundaries with a variable prompt_checkpoint.

Changes to _BoundarySnapshotBatchGenerator._process_prompts:

- Accept 7th tuple field (prompt_checkpoints) from insert()
- Compute effective prompt_checkpoint matching upstream semantics
- Replace hardcoded prefill split (1) with prompt_checkpoint in both
  left-pad and right-pad paths (loop bounds, last_inputs slice,
  cache prepare lengths)
- Add prompt_checkpoint_callback support for upstream parity
- Process remaining prompt_checkpoint-1 tokens before _step when
  checkpoint > 1, with VLM embed slicing
- Defensive clamp in cache prepare to prevent negative lengths
- Materialize checkpoint callback cache extracts (tuple vs generator)

When prompt_checkpoints are not supplied (default), prompt_checkpoint
computes to 1 and all code paths behave identically to pre-patch.

Tested with mlx_lm 0.31.0 on M3 Ultra 256GB (bolo):
- Chat completions: working
- Benchmarks: working, ~15% gen speed improvement (Python 3.14)
- Continuous batching: 1.4x at 2x batch, stable
@jundot jundot merged commit dfe3bda into jundot:main Mar 10, 2026
@jundot
Copy link
Copy Markdown
Owner

jundot commented Mar 10, 2026

Thanks for this, really appreciate the contribution! I verified the diff against upstream's _process_prompts and everything lines up correctly. The defensive max(0, l - prompt_checkpoint) clamp and the VLM embedding handling during checkpoint processing are nice touches that upstream doesn't need but omlx definitely does.

Merged and working well on my end. Also pushed a follow-up commit (6a9b264) on top of this to finish the rest of the 0.31.1 migration (presence/frequency penalty integration, dependency refs update).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for mlx_lm 0.31

2 participants