Skip to content

Fix batch generation and adopt mlx-lm batch improvements#911

Open
Blaizzy wants to merge 5 commits intomainfrom
pc/batch-improvements
Open

Fix batch generation and adopt mlx-lm batch improvements#911
Blaizzy wants to merge 5 commits intomainfrom
pc/batch-improvements

Conversation

@Blaizzy
Copy link
Copy Markdown
Owner

@Blaizzy Blaizzy commented Apr 4, 2026

Summary

  • Per-sequence samplers & logits processors: BatchGenerator.insert() now accepts per-sequence samplers and logits_processors lists, enabling mixed-temperature/top-p serving in batch mode. Falls back to shared sampler when not provided — fully backward compatible.
  • Token tracking: Batch tracks generated tokens per sequence in the new tokens field.
  • Cache interface fix: Added missing nbytes property and empty() method to SlidingWindowCache and StaticKVCache to satisfy _BaseCache abstract interface from mlx-lm, preventing breakage on future mlx-lm updates.

Inspired by ml-explore/mlx-lm#1072.

Test plan

  • batch_generate() with Qwen2.5-VL-3B produces correct output
  • Single generate() unaffected
  • SlidingWindowCache and StaticKVCache pass empty() and nbytes checks

🤖 Generated with Claude Code

Blaizzy and others added 4 commits April 4, 2026 03:01
- Add `tokens`, `samplers`, and `logits_processors` fields to Batch class
  with proper filter/extend support
- BatchGenerator.insert() now accepts per-sequence samplers and
  logits_processors for fine-grained control (e.g. mixed temperature)
- _step() applies per-sequence logits processors and samplers during
  generation, falling back to shared sampler when not provided
- Add missing `nbytes` and `empty()` to SlidingWindowCache and
  StaticKVCache to satisfy _BaseCache interface from mlx-lm

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oken handling

- Introduced a new `_right_pad_prompts` function for right padding of prompts.
- Integrated `SequenceStateMachine` to manage stop token detection, allowing for multi-token sequences.
- Updated `Batch` class to support state machine states, ensuring proper handling during filtering and merging.
- Modified `BatchGenerator` to utilize the state machine for improved stop detection logic.
- Ensured backward compatibility with legacy stopping criteria while enhancing functionality.
…offset

Three bugs that caused garbage output when batch_size > 1:

1. Vision tower flattened all batch image tokens into [1, total, dim],
   losing the batch dimension. Now preserves [B, tokens_per_image, dim].

2. masked_scatter flattened all batches causing cross-batch index
   contamination via modulo wrapping. Now processes per-batch for B>1.

3. BatchRotatingKVCache.offset is a mutable mx.array that gets modified
   by update_and_fetch. The attention code captured this reference before
   the update but it mutated, causing queries to get wrong RoPE positions.
   Fixed by snapshotting the offset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Blaizzy Blaizzy changed the title Add per-sequence samplers and fix cache interface Fix Gemma 4 batch generation and add SequenceStateMachine support Apr 4, 2026
@Blaizzy Blaizzy changed the title Fix Gemma 4 batch generation and add SequenceStateMachine support Fix batch generation and adopt mlx-lm batch improvements Apr 4, 2026
CI uses released mlx-lm which doesn't have these yet. Gracefully
falls back to legacy stopping_criteria when unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant