Skip to content

batch generator crashes with prompt_progress_callback on mlx-lm 0.31.x#294

Merged
waybarrios merged 2 commits intomainfrom
fix-prompt-progress-callback
Apr 12, 2026
Merged

batch generator crashes with prompt_progress_callback on mlx-lm 0.31.x#294
waybarrios merged 2 commits intomainfrom
fix-prompt-progress-callback

Conversation

@waybarrios
Copy link
Copy Markdown
Owner

@waybarrios waybarrios commented Apr 12, 2026

Problem

Reported in #293 — after the backport in f61d34e, running vllm-mlx bench or vllm-mlx serve with batching enabled crashes immediately on mlx-lm 0.31.x (the latest release).

This affects all users on v0.2.7 installed via pip or uv.

Error 1: prompt_progress_callback not a constructor parameter

TypeError: BatchGenerator.__init__() got an unexpected keyword argument 'prompt_progress_callback'
  File ".../vllm_mlx/scheduler.py", line 1252, in _create_batch_generator
    bg = BatchGenerator(
         ^^^^^^^^^^^^^^^
TypeError: BatchGenerator.__init__() got an unexpected keyword argument 'prompt_progress_callback'

Error 2: _process_prompts removed in mlx-lm 0.31.x

AttributeError: 'BatchGenerator' object has no attribute '_process_prompts'

The _install_chunked_prefill monkey-patch relies on internal BatchGenerator APIs (_process_prompts, active_batch, _step) that were refactored in mlx-lm 0.31.x.

Error 3: next() return format changed

AttributeError: 'list' object has no attribute 'uid'

BatchGenerator.next() now returns a (prompt_responses, generation_responses) tuple instead of a flat list. The scheduler was iterating over the tuple and trying to access .uid on a list.

Error 4: active_batch no longer exists

AttributeError: 'BatchGenerator' object has no attribute 'active_batch'

The periodic cache eval in step() references self.batch_generator.active_batch without checking if it exists.

Root cause

The backport commit f61d34e was written against a version of mlx-lm with different internal APIs than the released 0.31.x series. No released version of mlx-lm (up to 0.31.2) has these internal methods/attributes.

Fix

Three changes in scheduler.py:

1. Set prompt_progress_callback as an instance attribute instead of constructor argument

The callback is only consumed by the _install_chunked_prefill monkey-patch (lines 343, 566), not by BatchGenerator itself.

         bg = BatchGenerator(
             ...
-            prompt_progress_callback=_prefill_progress,
         )
+        bg.prompt_progress_callback = _prefill_progress

2. Guard _install_chunked_prefill with a compatibility check

Skip the monkey-patch gracefully when the required BatchGenerator internals are absent, and log a warning:

chunked_compatible = hasattr(bg, "_process_prompts") and hasattr(bg, "active_batch")

if need_chunked and chunked_compatible:
    _install_chunked_prefill(...)
elif need_chunked and not chunked_compatible:
    logger.warning("Chunked prefill disabled: mlx-lm BatchGenerator lacks ...")

3. Handle next() return format change

-                    responses = self.batch_generator.next()
+                    result = self.batch_generator.next()
+                    # mlx-lm >=0.31.x returns (prompt_responses, generation_responses)
+                    if isinstance(result, tuple):
+                        responses = result[1]
+                    else:
+                        responses = result

4. Add hasattr guard for active_batch

             if (
                 self.batch_generator is not None
+                and hasattr(self.batch_generator, "active_batch")
                 and self.batch_generator.active_batch is not None

Test results

tests/test_batching.py         — 22 passed
tests/test_memory_stability.py — 15 passed
======================= 37 passed, 2 deselected in 3.29s =======================

Smoke test: vllm-mlx bench (Llama-3.2-1B-Instruct-4bit)

Fresh conda environment with mlx-lm==0.31.2:

$ vllm-mlx bench mlx-community/Llama-3.2-1B-Instruct-4bit --max-tokens 100 --num-prompts 10

Chunked prefill disabled: mlx-lm BatchGenerator lacks required internals
(_process_prompts, active_batch). Upgrade mlx-lm or check compatibility.

Loading model: mlx-community/Llama-3.2-1B-Instruct-4bit

Running benchmark with 10 prompts, max_tokens=100
--------------------------------------------------

Results:
  Total time: 2.38s
  Prompts: 10
  Prompts/second: 4.19
  Total prompt tokens: 80
  Total completion tokens: 960
  Total tokens: 1040
  Tokens/second: 402.52
  Throughput: 436.06 tok/s

Test plan

  • pytest tests/test_batching.py — all chunked prefill tests pass
  • pytest tests/test_memory_stability.py — all BatchGenerator lifecycle tests pass
  • Verified BatchGenerator.__init__ signature in fresh mlx-lm==0.31.2 install (conda env)
  • vllm-mlx bench mlx-community/Llama-3.2-1B-Instruct-4bit — 10 prompts, 960 tokens, 0 errors

Notes

  • Chunked prefill is disabled as a degradation (not a crash) on mlx-lm 0.31.x. When mlx-lm adds back compatible internals or a public chunked prefill API, the guard will automatically re-enable it.
  • MTP (_install_mtp) also references removed internals (_step) but is disabled by default (enable_mtp: bool = False), so it's not part of this fix.

Closes #293

The backport in f61d34e assumed internal BatchGenerator APIs that were
refactored in mlx-lm 0.31.x. This breaks bench and serve for all
users on v0.2.7.

Changes:
- Set prompt_progress_callback as instance attribute instead of
  passing it to BatchGenerator constructor (not a valid parameter)
- Guard _install_chunked_prefill with hasattr check and log warning
  when skipped (relies on removed _process_prompts, active_batch)
- Handle next() returning (prompt_responses, generation_responses)
  tuple instead of flat list
- Add hasattr guard for active_batch in periodic cache eval

Benchmark (Llama-3.2-1B-Instruct-4bit, mlx-lm 0.31.2):

  Total time: 2.38s
  Prompts: 10
  Prompts/second: 4.19
  Total prompt tokens: 80
  Total completion tokens: 960
  Total tokens: 1040
  Tokens/second: 402.52
  Throughput: 436.06 tok/s

Closes #293
@waybarrios waybarrios force-pushed the fix-prompt-progress-callback branch from ef2e904 to 980d092 Compare April 12, 2026 14:48
@waybarrios waybarrios added the bug Something isn't working label Apr 12, 2026
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this against the current mlx-lm 0.31.x crash surface and I do not see a blocking issue. The constructor-argument fix, tuple-return handling, and active_batch guard address the reported failures, and degrading chunked prefill to a warning is a reasonable compatibility fallback for now.

@waybarrios waybarrios merged commit d2e7f88 into main Apr 12, 2026
7 checks passed
Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Overall a good fix — minimal, defensive, and backwards-compatible. A few notes:

Backwards compatibility verification

Checked against mlx-lm==0.31.1 (our production):

  • BatchGenerator.__init__ in 0.31.1 still accepts prompt_progress_callback — moving it to an attribute post-construction is fine, but the constructor in 0.31.1 internally consumes it (sets self.prompt_progress_callback). Moving it to an attribute after construction in 0.31.1 effectively overwrites the value set by the constructor. Works, but worth noting.
  • _process_prompts, active_batch, _next() — all exist in 0.31.1 → chunked_compatible = True → no behavior change.
  • next() in 0.31.1 returns a flat list → isinstance(result, tuple) = False → no behavior change.

Conclusion: backwards-compatible with 0.31.1.

Question about prompt_responses

if isinstance(result, tuple):
    responses = result[1]  # generation_responses only

If result[0] contains prompt_responses — are they safe to ignore? In the original flat-list format, prompt responses and generation responses were mixed together and _process_batch_responses() processed them uniformly (response.uid + response.token).

If 0.31.2 returns prompt_responses in result[0] and those contain data that previously went through _process_batch_responses(), silently dropping them could cause missing tokens or incomplete requests.

The smoke test (10 prompts, 960 tokens) works — but it would be worth confirming that prompt_responses in 0.31.2 don't carry generated tokens, only metadata/progress info.

MTP note

The PR correctly notes that MTP (_install_mtp) has the same issue with _step but is disabled by default. That's fine for this fix, but will need to be addressed separately for --enable-mtp users.

Verdict

LGTM for merge. Defensive guards (hasattr, isinstance) are the right approach for cross-version mlx-lm compatibility. Graceful degradation of chunked prefill with a warning is better than a crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unexpected keyword argument 'prompt_progress_callback'

3 participants