batch generator crashes with prompt_progress_callback on mlx-lm 0.31.x#294
batch generator crashes with prompt_progress_callback on mlx-lm 0.31.x#294waybarrios merged 2 commits intomainfrom
Conversation
The backport in f61d34e assumed internal BatchGenerator APIs that were refactored in mlx-lm 0.31.x. This breaks bench and serve for all users on v0.2.7. Changes: - Set prompt_progress_callback as instance attribute instead of passing it to BatchGenerator constructor (not a valid parameter) - Guard _install_chunked_prefill with hasattr check and log warning when skipped (relies on removed _process_prompts, active_batch) - Handle next() returning (prompt_responses, generation_responses) tuple instead of flat list - Add hasattr guard for active_batch in periodic cache eval Benchmark (Llama-3.2-1B-Instruct-4bit, mlx-lm 0.31.2): Total time: 2.38s Prompts: 10 Prompts/second: 4.19 Total prompt tokens: 80 Total completion tokens: 960 Total tokens: 1040 Tokens/second: 402.52 Throughput: 436.06 tok/s Closes #293
ef2e904 to
980d092
Compare
Thump604
left a comment
There was a problem hiding this comment.
I reviewed this against the current mlx-lm 0.31.x crash surface and I do not see a blocking issue. The constructor-argument fix, tuple-return handling, and active_batch guard address the reported failures, and degrading chunked prefill to a warning is a reasonable compatibility fallback for now.
There was a problem hiding this comment.
Review
Overall a good fix — minimal, defensive, and backwards-compatible. A few notes:
Backwards compatibility verification
Checked against mlx-lm==0.31.1 (our production):
BatchGenerator.__init__in 0.31.1 still acceptsprompt_progress_callback— moving it to an attribute post-construction is fine, but the constructor in 0.31.1 internally consumes it (setsself.prompt_progress_callback). Moving it to an attribute after construction in 0.31.1 effectively overwrites the value set by the constructor. Works, but worth noting._process_prompts,active_batch,_next()— all exist in 0.31.1 →chunked_compatible = True→ no behavior change.next()in 0.31.1 returns a flat list →isinstance(result, tuple)= False → no behavior change.
Conclusion: backwards-compatible with 0.31.1. ✅
Question about prompt_responses
if isinstance(result, tuple):
responses = result[1] # generation_responses onlyIf result[0] contains prompt_responses — are they safe to ignore? In the original flat-list format, prompt responses and generation responses were mixed together and _process_batch_responses() processed them uniformly (response.uid + response.token).
If 0.31.2 returns prompt_responses in result[0] and those contain data that previously went through _process_batch_responses(), silently dropping them could cause missing tokens or incomplete requests.
The smoke test (10 prompts, 960 tokens) works — but it would be worth confirming that prompt_responses in 0.31.2 don't carry generated tokens, only metadata/progress info.
MTP note
The PR correctly notes that MTP (_install_mtp) has the same issue with _step but is disabled by default. That's fine for this fix, but will need to be addressed separately for --enable-mtp users.
Verdict
LGTM for merge. Defensive guards (hasattr, isinstance) are the right approach for cross-version mlx-lm compatibility. Graceful degradation of chunked prefill with a warning is better than a crash.
Problem
Reported in #293 — after the backport in f61d34e, running
vllm-mlx benchorvllm-mlx servewith batching enabled crashes immediately on mlx-lm 0.31.x (the latest release).This affects all users on v0.2.7 installed via
piporuv.Error 1:
prompt_progress_callbacknot a constructor parameterError 2:
_process_promptsremoved in mlx-lm 0.31.xThe
_install_chunked_prefillmonkey-patch relies on internal BatchGenerator APIs (_process_prompts,active_batch,_step) that were refactored in mlx-lm 0.31.x.Error 3:
next()return format changedBatchGenerator.next()now returns a(prompt_responses, generation_responses)tuple instead of a flat list. The scheduler was iterating over the tuple and trying to access.uidon a list.Error 4:
active_batchno longer existsThe periodic cache eval in
step()referencesself.batch_generator.active_batchwithout checking if it exists.Root cause
The backport commit f61d34e was written against a version of
mlx-lmwith different internal APIs than the released 0.31.x series. No released version ofmlx-lm(up to 0.31.2) has these internal methods/attributes.Fix
Three changes in
scheduler.py:1. Set
prompt_progress_callbackas an instance attribute instead of constructor argumentThe callback is only consumed by the
_install_chunked_prefillmonkey-patch (lines 343, 566), not by BatchGenerator itself.bg = BatchGenerator( ... - prompt_progress_callback=_prefill_progress, ) + bg.prompt_progress_callback = _prefill_progress2. Guard
_install_chunked_prefillwith a compatibility checkSkip the monkey-patch gracefully when the required BatchGenerator internals are absent, and log a warning:
3. Handle
next()return format change4. Add
hasattrguard foractive_batchif ( self.batch_generator is not None + and hasattr(self.batch_generator, "active_batch") and self.batch_generator.active_batch is not NoneTest results
Smoke test:
vllm-mlx bench(Llama-3.2-1B-Instruct-4bit)Fresh conda environment with
mlx-lm==0.31.2:Test plan
pytest tests/test_batching.py— all chunked prefill tests passpytest tests/test_memory_stability.py— all BatchGenerator lifecycle tests passBatchGenerator.__init__signature in freshmlx-lm==0.31.2install (conda env)vllm-mlx bench mlx-community/Llama-3.2-1B-Instruct-4bit— 10 prompts, 960 tokens, 0 errorsNotes
_install_mtp) also references removed internals (_step) but is disabled by default (enable_mtp: bool = False), so it's not part of this fix.Closes #293