Skip to content

merge conflict resolution with main#36

Merged
krystophny merged 62 commits intocomputor-org:fix/chat-template-kwargs-forwardingfrom
waybarrios:fix/chat-template-kwargs-forwarding
Apr 14, 2026
Merged

merge conflict resolution with main#36
krystophny merged 62 commits intocomputor-org:fix/chat-template-kwargs-forwardingfrom
waybarrios:fix/chat-template-kwargs-forwarding

Conversation

@waybarrios
Copy link
Copy Markdown

Summary

Resolves the 7 merge conflicts between fix/chat-template-kwargs-forwarding and main on waybarrios/vllm-mlx.

  • Integrates enable_thinking support (from main) alongside chat_template_kwargs forwarding (from this branch) in batched.py
  • Adopts main's _run_blocking_serialized refactor in simple.py while preserving chat_template_kwargs forwarding
  • Forwards chat_template_kwargs through the new tool-stall workaround path in simple.py

See full details: waybarrios#218 (comment)

janhilgard and others added 30 commits February 15, 2026 18:14
Large model downloads via huggingface_hub often hang or fail around 10GB.
This adds a pre-download step with configurable retry/timeout before
load_model() is called, so interrupted downloads can be resumed.

New CLI flags for `serve`: --download-timeout, --download-retries, --offline
New subcommand: `vllm-mlx download <model>` for pre-warming caches

Closes #75

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The output_token_ids from AsyncEngineCore were tracked internally but
never forwarded to GenerationOutput, leaving tokens always []. Also
adds tests for the generate() output fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parse MiniMax-M2.5's XML tool call format:
<minimax:tool_call>
  <invoke name="function">
    <parameter name="arg">value</parameter>
  </invoke>
</minimax:tool_call>

Handles single/multiple tool calls, JSON parameter values,
no-parameter calls, and preserves <think> blocks.

9 unit tests included.
…n streaming parser

The streaming reasoning parser (BaseThinkingReasoningParser) scans the full
accumulated output text for <think>/<think> on every token via `in` checks on
previous_text and current_text. This is O(N) per token and O(N²) over a full
generation, becoming measurable at longer outputs (5ms+ at 2k tokens, 141ms
at 10k tokens).

Replace with a three-phase state machine (pre_think → thinking → content) that
tracks transitions using only the delta text. Each token is now O(1) regardless
of output length.

Benchmark results (streaming parser overhead, simulated server loop):

  Tokens   Old (scan)   New (state)   Speedup
  ------   ----------   -----------   -------
     500     0.37ms       0.04ms       8.6x
    1000     1.38ms       0.10ms      13.5x
    2000     5.28ms       0.28ms      19.1x
    5000    34.03ms       2.05ms      16.6x
   10000   141.26ms      10.16ms      13.9x

At 50 tok/s decode on Apple Silicon, each token has a 20ms budget. The old parser
consumed 0.3ms/tok at 2k tokens and 1.4ms/tok at 10k — up to 7% of the budget
on overhead alone. The new parser is <0.01ms/tok at any length.

Changes:
- think_parser.py: Rewrote extract_reasoning_streaming() as a state machine with
  _phase tracking. reset_state() initializes the phase. All three scenarios
  preserved (explicit tags, implicit mode, no tags). Method signature unchanged
  for backward compatibility.
- benchmarks/bench_reasoning_parser.py: Added streaming parser benchmark.

No changes to extract_reasoning() (non-streaming path) — it only runs once per
request and is not on the hot path.
Add _normalize_messages() to server.py and call it in all request paths
before apply_chat_template. Maps non-standard roles (developer -> system,
per OpenAI Responses API) and merges consecutive same-role messages.

Fixes agent crashes from:
- OpenAI Responses API sending role="developer" (unrecognized by Qwen3.5 template)
- OpenCode sending [system, system, user, user] (rejected by alternating-role templates)

Applied in create_chat_completion (both MLLM and LLM paths),
create_anthropic_message, and _stream_anthropic_messages.
Add detection and inference support for Google's Gemma 4 models
(e.g. mlx-community/gemma-4-e2b-it-mxfp4) which include vision
and audio capabilities via mlx-vlm >= 0.4.3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Patch gemma4 Attention to snapshot cache.offset before mutation
  (mx.array.__iadd__ is in-place, causes wrong RoPE positions)
- Add Gemma 4 reasoning parser with channel name stripping
  (strips "thought"/"response" prefixes, supports both <channel|>
  and <|channel>response transition formats)
- Configure Gemma 4 EOS/stop tokens to prevent uncontrolled generation
- Add 16 Gemma 4 parser tests (non-streaming + streaming)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…okenizer

- Accept RotatingKVCache (used by Gemma 4) in batch cache validation
- Add missing return statement in load_model_with_fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This depends on PR 215 or PR 243 being applied first.
error responses with token=0 were falling through to the detokenizer
and decoding garbage text. now they skip decoding and set the request
status to FINISHED_ABORTED. added a test for this case.
also ran black on batched.py to fix CI.
feat: add Gemma 4 multimodal model support
- Fix BatchKVCache offset bug: mx.array.__iadd__ mutates in-place,
  causing incorrect RoPE positions and token repetition
- Fix RotatingKVCache.max_size returning mx.array instead of int
- Add Gemma 4 reasoning parser (--reasoning-parser gemma4)
- Read additional EOS tokens from generation_config.json
- Fix RotatingKVCache prefix cache extraction (negative left_padding)
- Relax isinstance guard to accept RotatingKVCache for sliding window
  models like Gemma 4 (fixes ValueError on continuous batching)
- Remove unused _make_batch_cache() dead code
- Fix Anthropic endpoint JSON parsing for clients sending invalid
  escape sequences (e.g. \s, \d in regex patterns within tool defs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: patch Gemma 4 attention and RotatingKVCache for BatchKVCache
* test: add Gemma 4 tool parser tests (red)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Gemma 4 tool call parser

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: register Gemma 4 parser, add streaming tests and wiring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add edge case tests for DC review findings

- Unclosed tool call block (server fallback path)
- String containing colon (step-ordering guard)
- String with real newline and double quote (JSON escaping)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: verify Gemma 4 tool calls produce exact OpenAI format for Claude Code

Integration tests that verify the full pipeline (parser → server models →
JSON serialization) matches what Claude Code expects: tool_calls structure,
null content, function.arguments as JSON string, correct finish_reason.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* add Gemma 4 auto-detection to AutoToolParser

integrates Gemma 4 format as the first format tried in auto-detection,
adds streaming markers for tool call start/end. based on keegoid's
approach in #254.

* remove unused pytest imports

* run black on tool parser, tests, and server

---------

Co-authored-by: Jack Neil <jackneil@Jacks-Mac-Studio.local>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Wayner Barrios <waybarrios@gmail.com>
…258)

Extends MLLM batch generator to support top_k, min_p, and
presence_penalty alongside the existing repetition_penalty.
This gives the MLLM path full parity with the LLM/SimpleEngine
sampling parameter coverage.

Changes:
- MLLMBatchRequest: add top_k, min_p, presence_penalty fields
- MLLMBatch: add per-request samplers list (filter/extend support)
- _process_prompts: build per-request logits processors for
  presence_penalty and per-request samplers for top_k/min_p
- _step: accept and apply per-request samplers
- SamplingParams: add presence_penalty field
- MLLMScheduler: propagate new params from kwargs to batch requests
- BatchedEngine: pass new params through generate/stream_generate

When a request uses default values (top_k=0, min_p=0.0,
presence_penalty=0.0), no extra processors or samplers are created
— zero overhead for standard requests.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Fix Qwen3.5 hybrid paged cache reconstruction

* fix: add deduplication safety test and remove duplicate tokenizer hunk

Add test confirming deduplicated terminal blocks correctly isolate
recurrent state per sequence. Remove the duplicate tokenizer return fix
that already ships in PR #215.

* style: format hybrid cache follow-up
* fix: keep simple engine serialized across cancellation (#8)

* fix: avoid nested simple engine generation locks

* fix: catch BaseException in cancellation handler, fix async test markers

_run_blocking_serialized catches CancelledError (a BaseException subclass)
from the outer scope, but the inner try/except used Exception which would
let a second CancelledError during await task escape unhandled. Changed to
BaseException to suppress any exception from the draining await.

Also fix test_simple_engine.py to use pytest.mark.anyio instead of
pytest.mark.asyncio (pytest-asyncio is not configured), and add the
anyio_backend fixture to conftest.py restricting to asyncio only since
trio is not installed.

* fix: preserve prompt token accounting after upstream refresh

* fix: restore specprefill fallback helper scope
…#221)

* Fix chunked prefill for mlx-lm prompt checkpoints

* fix: invoke prompt_checkpoint_callback in chunked-prefill path

The upstream BatchGenerator contract requires prompt_checkpoint_callback
to fire after cache finalization, before the checkpoint tail model call.
The chunked-prefill monkeypatch preserved the checkpoint field but never
invoked the callback, breaking the upstream checkpoint contract.

Wire _lazy_extract_cache from mlx-lm and invoke the callback at the
correct semantic boundary. Add regression test verifying the callback
fires with the correct uid and checkpoint offset.

* test: cover checkpoint tail replay on upstream refresh

* style: format prompt checkpoint refresh

* fix: tolerate mlx-lm Batch export drift in chunked prefill
fix: populate tokens field in BatchedEngine.generate()
Upgrade mlx-vlm and torchvision so Qwen3.5 multimodal will run
* fix(server): integrate tool call parser into reasoning parser streaming path

* use _model_name instead of request.model in reasoning tool chunk

---------

Co-authored-by: Wayner Barrios <waybarrios@gmail.com>
When tool_choice='none', models should never return tool calls. Two fixes:

1. Strip tools from chat template context — prevents templates from
   activating tool-call token generation.
2. Suppress tool call parsing — _parse_tool_calls_with_parser() returns
   early with no tools, streaming parser skips initialization.

Applied across all server paths: chat completions (streaming + non-streaming),
Anthropic adapter (streaming + non-streaming).

Fixes #162
)

Claude Code injects `x-anthropic-billing-header: cc_version=...; cch=HASH;`
into the system prompt. The `cch=` hash changes with every request, causing
token sequences to diverge at position ~40 and completely defeating prefix
cache reuse across turn boundaries.

Strip this header before tokenization so consecutive requests from the same
conversation share 99%+ of their token prefix.

Result: 50s → 3.65s per request (13.7x speedup) on Gemma 4 26B-A4B with
60K-token prompts.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…fill

New files:
- patches/qwen3_5_mllm.py: BatchKVCache offset fix for Qwen3.5
- patches/qwen3_5_mtp.py: Runtime MTP injection for Qwen3.5
- tool_parsers/minimax_tool_parser.py: MiniMax-M2 tool parser
- scripts/add_mtp_weights_qwen35.py: Extract MTP weights from BF16

Key changes:
- mllm_batch_generator: chunked prefill, mid-batch extend, MTP hooks,
  patch registration, repetition penalty, prefill abort, think-suffix
  stripping for prefix cache
- mllm_scheduler: request status, cache config, prefill abort
- server: enable_thinking, tool_choice=none, tool argument coercion
- engines: MTP injection, enable_thinking, gpu_memory_utilization
- memory_cache: block LCP for hybrid models (SSM can't be rewound)

Prefix cache fix: enable_thinking=True adds <think>\n to generation
prompt, breaking PREFIX match across conversation turns.  Strip these
tokens from cache keys in both store and fetch paths so stored entries
match as clean prefixes.  Tested: 3.12s → 0.39s (8x) for 1400-token
prompts on Qwen3.5-122B hybrid model.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thump604 and others added 26 commits April 11, 2026 10:17
…-format

Add <function=name> format support to Qwen tool parser
…-leak

fix: skip RNN snapshots in MTP optimistic mode to prevent memory leak
…e-machine

perf(reasoning): O(1) state-machine streaming parser (13-19x faster at 2k+ tokens)
feat: add MiniMax tool call parsing support
…r-restack

cli: expose harmony and gpt-oss tool parsers
* fix: unify tool-enabled simple chat on streaming path

* fix: preserve simple chat contracts on streaming path

* fix: keep tool chat on the streaming execution path

* fix: preserve streamed completion token counts
The try/except block computing `tokens` via tokenizer.encode() was
unused -- the return statement already reads from final_output.tokens.
…stack

simple-engine: keep tool chat on the streaming execution path
…, repetition_penalty) (#213)

Pass all OpenAI-compatible sampling parameters through to mlx-lm's
make_sampler and make_logits_processors. Previously only temperature,
top_p and max_tokens reached the engine — top_k, min_p,
presence_penalty and repetition_penalty were silently dropped.

Changes:
- api/models.py: Add fields to ChatCompletionRequest and CompletionRequest
- request.py: Add presence_penalty to SamplingParams dataclass
- server.py: Extract and pass all params in every code path (6 locations),
  log all params on request
- models/llm.py: Build sampler with top_k/min_p, build logits_processors
  for presence_penalty/repetition_penalty
- engine/simple.py: Fix enable_thinking to read VLLM_MLX_ENABLE_THINKING
  env var instead of hardcoding based on model name

Tested with all 4 Unsloth Qwen 3.5 sampling profiles on 122B model.
* compatibility with mlx-lm 0.31.x BatchGenerator API

The backport in f61d34e assumed internal BatchGenerator APIs that were
refactored in mlx-lm 0.31.x. This breaks bench and serve for all
users on v0.2.7.

Changes:
- Set prompt_progress_callback as instance attribute instead of
  passing it to BatchGenerator constructor (not a valid parameter)
- Guard _install_chunked_prefill with hasattr check and log warning
  when skipped (relies on removed _process_prompts, active_batch)
- Handle next() returning (prompt_responses, generation_responses)
  tuple instead of flat list
- Add hasattr guard for active_batch in periodic cache eval

Benchmark (Llama-3.2-1B-Instruct-4bit, mlx-lm 0.31.2):

  Total time: 2.38s
  Prompts: 10
  Prompts/second: 4.19
  Total prompt tokens: 80
  Total completion tokens: 960
  Total tokens: 1040
  Tokens/second: 402.52
  Throughput: 436.06 tok/s

Closes #293

* bump to 0.2.8
* feat: add --prefill-step-size CLI flag

Expose prefill_step_size as a CLI argument for both serve and bench
commands. Default of 0 means "use engine default" (2048 for LLM, 1024
for MLLM), preserving existing behavior.

Vision models routinely exceed 1024 tokens per prompt (images alone
contribute 1400+), hitting the MLLM batch generator's safe limit.
This flag lets users raise the limit without patching source code.

* Clarify MLLM prefill step override behavior

* refactor: clarify MLLM prefill CLI flag and validate override
…er stream_generate (#266)

stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.

Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR #222 established for tool-enabled chat(). Non-streaming
clients now get:

- SpecPrefill engagement when `specprefill=true` is set (top-level or
  extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
  mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
  streaming path
- Same thread-safety (stream_generate holds self._generation_lock
  around the MLX call)

Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR #222.

Tests:
- New test_generate_accumulates_over_stream_generate stubs
  stream_generate with an async generator, calls generate() with
  per-request specprefill kwargs, and asserts:
  * final output fields (text, tokens, prompt_tokens,
    completion_tokens, finish_reason, finished) match the last yielded
    chunk
  * specprefill / specprefill_keep_pct were forwarded through to
    stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
  empty-stream edge case (returns GenerationOutput(text="",
  finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
  so test_lock_prevents_concurrent_generate still observes
  serialization through the new accumulator path

Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:

  SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s

prompt_tokens reporting is now 6007 (was always 0 before).

Related: companion PR #265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
* feat(api): per-request SpecPrefill overrides on /v1/completions

ChatCompletionRequest already accepts per-request `specprefill` and
`specprefill_keep_pct` overrides, and /v1/chat/completions threads
them into engine.chat(). CompletionRequest does not, so /v1/completions
clients cannot opt a single request into (or out of) SpecPrefill, nor
tune the keep percentage per request.

Changes:

- vllm_mlx/api/models.py: add specprefill and specprefill_keep_pct to
  CompletionRequest, matching the existing ChatCompletionRequest fields.
- vllm_mlx/server.py::create_completion: extract both and thread into
  engine.generate(**gen_kwargs), mirroring the pattern used at
  server.py:1421 in create_chat_completion.
- vllm_mlx/server.py::stream_completion: apply the same extraction so
  streaming /v1/completions clients get the same control.

Both new fields default to None, so existing behavior is unchanged for
clients that do not set them. No schema changes to ChatCompletionRequest.
No engine-side changes needed: SimpleEngine.stream_generate already
consumes these kwargs (see simple.py:307-308).

* style(server): align completions kwargs handling
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Prefix cache, MTP support, tool improvements, and streaming enhancements with comprehensive testing

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• **Prefix cache and KV cache optimization**: Added MemoryAwarePrefixCache support with chat
  template normalization, cache extraction/merging with RotatingKVCache buffer trimming, and hybrid
  model support (ArraysCache)
• **Multi-Token Prediction (MTP) support**: Implemented MTP injection for Qwen3.5 models via
  inject_mtp_support(), added MTP weight extraction script for Dense/MoE architectures, and
  integrated MTP into scheduler with always-advance verification
• **Chunked prefill with prompt checkpoints**: Enhanced chunked prefill to support prompt
  checkpoints (positive values for token positions, non-positive for offsets) with checkpoint tail
  replay and callback support
• **Client disconnect detection**: Added PrefillAbortedError exception and prefill abort tracking
  with heartbeat SSE comments in _disconnect_guard() to detect disconnects during long prefill
  operations
• **Tool call improvements**: Implemented tool argument coercion via _coerce_tool_arguments() to
  fix LLM tool failures, added Gemma 4 and Qwen function format parsers with streaming buffering
  support, and improved tool call filtering
• **Message normalization**: Added _normalize_messages() to map non-standard roles (e.g.,
  "developer" → "system") and merge consecutive same-role messages for chat template compatibility
• **Reasoning parser enhancements**: Refactored streaming parser to state-machine approach with
  three phases (pre_think, thinking, content), added Gemma 4 reasoning parser support
• **Per-request sampling parameters**: Added forwarding of top_k, min_p, presence_penalty,
  repetition_penalty through all generation paths (completion, chat, streaming variants)
• **GPU memory utilization configuration**: Added --gpu-memory-utilization flag and dynamic memory
  pressure threshold calculation for Metal allocation limits
• **Model download utilities**: Implemented download_command() with retry logic, timeout, and
  offline mode support; optimized VLM loading with up-front detection to avoid double-loading penalty
• **Blocking operation refactoring**: Added _run_blocking_serialized() method for safe MLX
  operations under generation lock with proper cancellation handling
• **Comprehensive test coverage**: Added tests for chunked prefill checkpoints, Gemma 4
  tool/reasoning parsers, streaming chat completion, message normalization, download utilities, and
  streaming aggregation
Diagram
flowchart LR
  A["Prefix Cache<br/>KV Optimization"] --> B["Batch Generator<br/>Chunked Prefill"]
  C["MTP Injection<br/>Qwen3.5"] --> B
  D["Tool Parsers<br/>Gemma4/Qwen"] --> E["Server<br/>Tool Coercion"]
  F["Message<br/>Normalization"] --> E
  G["Reasoning<br/>State Machine"] --> E
  B --> H["Scheduler<br/>Request Tracking"]
  H --> I["Engine<br/>Sampling Parameters"]
  J["GPU Memory<br/>Utilization"] --> I
  K["Download<br/>Utilities"] --> L["CLI<br/>Model Pre-download"]
  L --> I
  M["Blocking Ops<br/>Serialization"] --> I
Loading

Grey Divider

File Changes

1. vllm_mlx/mllm_batch_generator.py ✨ Enhancement +1177/-90

Prefix cache, chunked prefill, abort handling, and MTP support

• Added PrefillAbortedError exception and prefill abort tracking (_aborted_request_ids) to
 support client disconnect handling during long prefill operations
• Implemented _run_chunked_text_prefill() for text-only requests with real-time progress tracking
 via _prefill_progress dictionary
• Integrated KV prefix cache support with MemoryAwarePrefixCache, including chat template
 normalization for Qwen3.5 and think-suffix computation
• Enhanced _process_prompts() with per-request error handling, prefix cache lookup/hit logic, and
 per-request logits processors/samplers for sampling parameters (top_k, min_p, presence_penalty,
 repetition_penalty)
• Added _maybe_store_prefix_cache() to persist finished request caches and install_mtp_mllm()
 for multi-token prediction support with always-advance verification strategy
• Improved cache extraction and merging with RotatingKVCache buffer trimming and hybrid model
 support (ArraysCache)

vllm_mlx/mllm_batch_generator.py


2. vllm_mlx/server.py ✨ Enhancement +614/-165

Tool argument coercion, message normalization, disconnect detection

• Added _coerce_tool_arguments() to fix LLM tool call failures by JSON-stringifying object/array
 values when schema expects strings
• Implemented _normalize_messages() to map non-standard roles (e.g. "developer" → "system") and
 merge consecutive same-role messages for chat template compatibility
• Enhanced _disconnect_guard() with heartbeat SSE comments to detect client disconnects during
 long prefill by forcing ASGI writes
• Refactored Anthropic streaming (_stream_anthropic_messages()) to use reasoning parser for
 thinking blocks and improved tool call filtering with _TOOL_MARKUP_PATTERN
• Added per-request sampling parameters (top_k, min_p, presence_penalty, repetition_penalty) to
 completion and chat endpoints
• Improved tool call parsing with tool_choice="none" support and schema-aware argument coercion in
 both streaming and non-streaming paths

vllm_mlx/server.py


3. vllm_mlx/engine/batched.py ✨ Enhancement +146/-8

MTP injection, memory utilization config, sampling parameters

• Added gpu_memory_utilization parameter to control Metal memory allocation limits (default 0.90)
• Implemented _inject_mtp_mllm() to inject MTP weights into MLLM language models for multi-token
 prediction support
• Enhanced MLLM scheduler config with cache memory, MTP, and KV quantization settings; added prefill
 step size override
• Extended _apply_chat_template() to support per-request enable_thinking parameter with coder
 model detection
• Added sampling parameters (top_k, min_p, presence_penalty, repetition_penalty) forwarding to
 generate/stream_generate/chat/stream_chat methods
• Improved stats collection to promote MLLM metrics (running, num_requests, cache stats) to
 top-level for monitoring

vllm_mlx/engine/batched.py


View more (59)
4. vllm_mlx/models/mllm.py ✨ Enhancement +7/-3

SHA256 image hashing and enable_thinking parameter

• Changed base64 image hashing from MD5 to SHA256 to prevent collisions between images with
 identical headers
• Added enable_thinking parameter support in chat() and stream_chat() methods, forwarded to
 chat template application

vllm_mlx/models/mllm.py


5. vllm_mlx/engine/simple.py ✨ Enhancement +434/-384

Refactor blocking operations with cancellation-safe serialization

• Added _run_blocking_serialized() method to safely run blocking MLX operations under generation
 lock with proper cancellation handling
• Refactored generate() to use stream_generate() internally, enabling per-request SpecPrefill
 overrides for non-streaming clients
• Refactored chat() to route tool-stall workaround through streaming path and use
 _run_blocking_serialized() for both MLLM and LLM paths
• Added per-request enable_thinking override support in stream_chat() and
 _stream_generate_text(), falling back to environment variable
• Refactored _stream_generate_specprefill() and _stream_generate_text() to use
 _run_blocking_serialized() instead of direct asyncio.to_thread() calls

vllm_mlx/engine/simple.py


6. vllm_mlx/mllm_scheduler.py ✨ Enhancement +258/-33

Add MTP support, prefix cache, and detailed request monitoring

• Added configuration fields for KV cache memory limits, MTP speculative decoding, and prefix cache
 quantization
• Removed vision cache manager in favor of batch generator's built-in vision embedding cache
• Enhanced _get_stop_tokens() to read additional EOS tokens from generation_config.json
• Added request timing tracking (first_token_time) and per-request sampling parameters (top_k,
 min_p, presence_penalty, repetition_penalty)
• Implemented get_running_requests_info() for detailed per-request status endpoint with phase,
 progress, and throughput metrics
• Added memory management with periodic mx.clear_cache() and improved error handling for failed
 preprocessing
• Refactored _process_loop() to use thread pool executor for prefill-heavy steps while keeping
 decode-only steps inline

vllm_mlx/mllm_scheduler.py


7. scripts/add_mtp_weights_qwen35.py ✨ Enhancement +470/-0

Add MTP weight extraction and quantization script for Qwen3.5

• New script to add MTP (Multi-Token Prediction) weights to MLX Qwen3.5 models from HuggingFace BF16
 checkpoints
• Fetches shard index, downloads only MTP-containing shards with resume support, and extracts
 weights
• Handles both Dense (27B) and MoE (122B-A10B, 35B-A3B) architectures with expert weight stacking
• Applies norm shift (+1.0) for RMSNorm weights and optional quantization matching base model scheme
• Saves MTP weights to mtp/weights.safetensors subdirectory to avoid mlx_vlm glob loading
 conflicts

scripts/add_mtp_weights_qwen35.py


8. vllm_mlx/scheduler.py ✨ Enhancement +180/-31

Add prompt checkpoint support and repetition penalty handling

• Added mllm_prefill_step_size configuration override for MLLM prefill guard with validation
• Enhanced chunked prefill to support prompt checkpoints (positive values for token positions,
 non-positive for offsets)
• Added prompt_checkpoint_callback support and checkpoint tail replay before generation step
• Implemented fallback for missing mlx-lm private exports (Batch, _lazy_extract_cache) with
 compatibility checks
• Added make_logits_processors() import and per-request repetition penalty support via logits
 processors
• Fixed MTP RNN snapshot handling to only snapshot when not in optimistic mode
• Improved detokenizer pool management and added Metal buffer evaluation to prevent buildup

vllm_mlx/scheduler.py


9. vllm_mlx/patches/qwen3_5_mtp.py ✨ Enhancement +399/-0

Add runtime MTP injection for Qwen3.5 models

• New module providing runtime MTP injection for Qwen3.5 models without modifying mlx_lm source
• inject_mtp_support() creates MTP module, loads BF16 weights (no quantization for accuracy), and
 monkey-patches model class
• _fixup_moe_mtp() handles missing MoE weights by copying gates from main model and zeroing
 attention projections
• Adds return_hidden, mtp_forward(), and make_mtp_cache() methods to model class
• validate_mtp_support() checks for working MTP implementation with detailed logging

vllm_mlx/patches/qwen3_5_mtp.py


10. tests/test_batching.py 🧪 Tests +359/-1

Add chunked prefill checkpoint tests and async marker update

• Added comprehensive tests for chunked prefill with prompt checkpoints:
 test_chunked_prefill_accepts_prompt_checkpoints(),
 test_chunked_prefill_invokes_checkpoint_callback(),
 test_chunked_prefill_replays_checkpoint_tail_before_step()
• Added test for graceful handling of missing mlx-lm private exports:
 test_chunked_prefill_works_without_private_mlx_generate_exports()
• Changed async test marker from @pytest.mark.asyncio to @pytest.mark.anyio for broader async
 framework support

tests/test_batching.py


11. vllm_mlx/cli.py ✨ Enhancement +115/-3

Add GPU memory utilization and model download CLI options

• Added --gpu-memory-utilization flag (0.0-1.0, default 0.90) to control Metal allocation limits
 and emergency cache clear thresholds
• Added --mllm-prefill-step-size override flag for MLLM prefill guard configuration
• Added download options: --download-timeout, --download-retries, --offline for model
 pre-download with retry logic
• Implemented download_command() for standalone model downloading without starting server
• Refactored parser creation into create_parser() function for reusability
• Expanded tool-call parser options with new models: harmony, gpt-oss, gemma4, minimax
• Added pre-download step in serve_command() with timeout and retry configuration
• Added gpu_memory_utilization parameter to engine config and validation for valid range

vllm_mlx/cli.py


12. vllm_mlx/engine_core.py ✨ Enhancement +6/-11

Add dynamic GPU memory utilization configuration

• Added gpu_memory_utilization field to EngineConfig (default 0.90) for dynamic memory pressure
 threshold calculation
• Refactored memory pressure threshold calculation to use gpu_memory_utilization instead of fixed
 85% of max recommended working set
• Improved fallback memory threshold handling with better device memory detection

vllm_mlx/engine_core.py


13. vllm_mlx/memory_cache.py ✨ Enhancement +165/-21

Support RotatingKVCache trimming and quantization wrapper

• Refactored _trim_cache_offset() to handle RotatingKVCache (circular buffer) in addition to
 plain KVCache, with proper trimming logic that reorders temporal order and pads with zeros when
 needed
• Introduced _QuantizedCacheWrapper class to preserve original cache type metadata during
 quantization/dequantization roundtrips
• Updated _quantize_cache() and _dequantize_cache() to use the new wrapper and support multiple
 cache types (KVCache, RotatingKVCache, etc.)
• Fixed LCP (Longest Common Prefix) cache fetch logic by inverting the has_non_trimmable condition
 and adding debug logging for hybrid models

vllm_mlx/memory_cache.py


14. vllm_mlx/reasoning/think_parser.py ✨ Enhancement +99/-110

Implement state-machine streaming reasoning parser

• Refactored streaming parser from text-based detection to state-machine approach with three phases:
 pre_think, thinking, content
• Added reset_state() method and _phase tracking to avoid rescanning full accumulated text on
 every token
• Simplified extract_reasoning_streaming() logic by replacing helper methods with inline phase
 transitions
• Improved documentation with performance notes and clearer phase transition descriptions

vllm_mlx/reasoning/think_parser.py


15. tests/test_gemma4_tool_parser.py 🧪 Tests +240/-0

Add Gemma 4 tool parser test suite

• Added comprehensive test suite for Gemma4ToolParser with 25+ test cases covering extraction and
 streaming
• Tests cover single/multiple tool calls, nested objects, arrays, special characters, and edge cases
 like missing delimiters
• Includes streaming tests for buffering behavior and structured tool_calls emission on close
 delimiter
• Tests parser registration and SUPPORTS_NATIVE_TOOL_FORMAT flag

tests/test_gemma4_tool_parser.py


16. tests/test_server.py 🧪 Tests +258/-7

Add CLI and streaming chat completion tests

• Added TestServeCli class to test CLI argument parsing for tool call parser selection (harmony,
 gpt-oss aliases)
• Added TestStreamChatCompletion class with two async tests for reasoning stream with tool calls
 and plain content
• Tests verify tool_calls chunks are emitted after </think> and tool parser is skipped for
 non-markup content
• Minor formatting fix: changed f"Request {i+1}" to f"Request {i + 1}" and replaced
 asyncio.get_event_loop().run_until_complete() with asyncio.run()

tests/test_server.py


17. vllm_mlx/prefix_cache.py ✨ Enhancement +139/-77

Support multiple cache types in prefix cache reconstruction

• Refactored _extract_block_tensor_slice() to return per-layer metadata dicts instead of tuples,
 supporting both sequence-backed and recurrent cache types
• Added _can_concatenate_cache_state(), _slice_concat_cache_state(), and
 _concat_cache_states() helper methods for flexible cache handling
• Updated reconstruct_cache() to handle both concat (sequence-backed) and latest (recurrent)
 storage modes with proper type reconstruction
• Improved docstrings to clarify support for RotatingKVCache and other cache types beyond plain
 KVCache

vllm_mlx/prefix_cache.py


18. tests/test_simple_engine.py 🧪 Tests +164/-5

Add streaming aggregation and tool-enabled chat tests

• Added pytestmark = pytest.mark.anyio and anyio_backend fixture for async test compatibility
• Added stream_generate_side_effect to mock model to track concurrency alongside generate
• Added three new async tests: test_chat_with_tools_aggregates_streaming_path(),
 test_generate_accumulates_over_stream_generate(), and
 test_generate_empty_stream_returns_safe_default()
• Changed @pytest.mark.asyncio to @pytest.mark.anyio for consistency

tests/test_simple_engine.py


19. tests/test_reasoning_parser.py 🧪 Tests +266/-0

Add Gemma 4 reasoning parser tests

• Added TestGemma4Parser class with 20+ test cases for non-streaming and streaming reasoning
 extraction
• Tests cover standard format (<|channel>thought...<channel|>), alternative format, channel name
 stripping, and edge cases
• Streaming tests verify state transitions and proper handling of character-by-character token
 boundaries
• Updated docstring to mention Gemma 4 parser alongside Qwen3 and DeepSeek-R1

tests/test_reasoning_parser.py


20. vllm_mlx/utils/tokenizer.py ✨ Enhancement +107/-26

Optimize VLM loading and MTP injection for Qwen3.5

• Added _needs_strict_false() function to detect VLM models (e.g., Qwen3.5) up-front and avoid
 double-loading penalty
• Enhanced load_model_with_fallback() to call _needs_strict_false() before first load attempt
 and skip to _load_strict_false() for VLM models
• Improved _load_strict_false() with weight verification logging and proper memory cleanup
 (clearing traceback references, garbage collection)
• Updated _try_inject_mtp() to detect Qwen3.5 vs Qwen3-Next by checking model_type and load
 appropriate MTP patch
• Enhanced _try_inject_mtp_post_load() to check both flat and nested config paths and support new
 MTP weights directory structure
• Refactored _load_with_tokenizer_fallback() to use ensure_model_downloaded() helper with
 retry/timeout support

vllm_mlx/utils/tokenizer.py


21. vllm_mlx/tool_parsers/gemma4_tool_parser.py ✨ Enhancement +237/-0

Add Gemma 4 tool call parser implementation

• Implemented new Gemma4ToolParser class handling Gemma 4's native tool call format with
 <|tool_call> delimiters and <|"|> string tokens
• Added _find_balanced_brace() to handle brace matching while skipping over <|"|>-delimited
 strings
• Implemented _gemma4_args_to_json() three-step converter: extract strings to placeholders, quote
 bare keys, restore as JSON-escaped strings
• Supports both complete extraction and streaming with proper buffering during incomplete tool call
 blocks

vllm_mlx/tool_parsers/gemma4_tool_parser.py


22. tests/test_tool_parsers.py 🧪 Tests +177/-0

Add Qwen function format and streaming buffering tests

• Added Gemma4ToolParser to imports and registration test
• Added TestQwenFunctionFormat class with 3 tests for Qwen's <function=name> format support
• Added TestQwenStreamingBuffering class with 5 tests for partial marker buffering and false
 positive recovery
• Tests verify that partial markers like <function are buffered and content before them is emitted
 immediately

tests/test_tool_parsers.py


23. vllm_mlx/tool_parsers/qwen_tool_parser.py ✨ Enhancement +141/-2

Add Qwen function format and streaming buffering support

• Added _parse_param_value() helper to parse parameter values as JSON, Python literals, or plain
 strings
• Added support for Qwen's <function=name><parameter=key>value</parameter></function> format
 (Qwen3.5 native)
• Implemented partial marker buffering with _PARTIAL_MARKERS, _has_partial_marker(),
 _get_partial_marker_len(), and _was_buffering() methods
• Enhanced extract_tool_calls_streaming() to handle function-style format and buffer incomplete
 markers while emitting safe content

vllm_mlx/tool_parsers/qwen_tool_parser.py


24. tests/test_download.py 🧪 Tests +196/-0

Add download utility test suite

• Added comprehensive test suite for ensure_model_downloaded() with 6 test classes covering local
 paths, retry logic, offline mode, timeout, and allow patterns
• Tests verify retry behavior with exponential backoff, KeyboardInterrupt propagation, offline mode
 caching, and HF_HUB_DOWNLOAD_TIMEOUT environment variable handling
• Includes tests for LLM vs MLLM allow patterns and CLI download command argument parsing

tests/test_download.py


25. tests/test_normalize_messages.py 🧪 Tests +174/-0

Add message normalization test suite

• Added test suite for _normalize_messages() function with 13 test cases
• Tests cover merging consecutive same-role messages, developer role mapping to system, OpenCode
 format normalization, and edge cases
• Verifies that multimodal content and null content are not merged, and non-content fields are
 preserved

tests/test_normalize_messages.py


26. vllm_mlx/api/utils.py ✨ Enhancement +8/-1

Add Gemma 4 and Qwen3.5 multimodal support

• Extended SPECIAL_TOKENS_PATTERN regex to include Gemma 4 tool call tokens (</?tool_call>,
 </?tool_call_reasoning>) and other special markers
• Added <|tool_call> and <tool_call|> delimiters to TOOL_CALL_MARKERS tuple for Gemma 4
 support
• Added "gemma-4", "gemma4", "Qwen3.5-", and "qwen3_5" to MLLM_MODELS list for proper multimodal
 detection

vllm_mlx/api/utils.py


27. vllm_mlx/reasoning/__init__.py ✨ Enhancement +2/-0
 Register Gemma 4 reasoning parser

vllm_mlx/reasoning/init.py


28. tests/test_native_tool_format.py 🧪 Tests +2/-0

Update native tool format test for Gemma 4

• Added Gemma4ToolParser to imports
• Added Gemma4ToolParser to non_native_parsers list in test_parsers_without_native_support()
 test

tests/test_native_tool_format.py


29. LICENSE Additional files +176/-0

...

LICENSE


30. README.md Additional files +1/-0

...

README.md


31. benchmarks/bench_reasoning_parser.py Additional files +55/-0

...

benchmarks/bench_reasoning_parser.py


32. docs/reference/models.md Additional files +3/-2

...

docs/reference/models.md


33. pyproject.toml Additional files +4/-4

...

pyproject.toml


34. tests/conftest.py Additional files +6/-0

...

tests/conftest.py


35. tests/test_batched_engine.py Additional files +94/-0

...

tests/test_batched_engine.py


36. tests/test_batching_deterministic.py Additional files +11/-11

...

tests/test_batching_deterministic.py


37. tests/test_continuous_batching.py Additional files +1/-1

...

tests/test_continuous_batching.py


38. tests/test_gemma4_openai_format.py Additional files +160/-0

...

tests/test_gemma4_openai_format.py


39. tests/test_minimax_tool_calling.py Additional files +130/-0

...

tests/test_minimax_tool_calling.py


40. tests/test_mllm_continuous_batching.py Additional files +47/-0

...

tests/test_mllm_continuous_batching.py


41. tests/test_paged_cache.py Additional files +149/-0

...

tests/test_paged_cache.py


42. tests/test_simple_engine_cancel_serialization.py Additional files +143/-0

...

tests/test_simple_engine_cancel_serialization.py


43. tests/test_specprefill_rotating_cache.py Additional files +84/-0

...

tests/test_specprefill_rotating_cache.py


44. tests/test_streaming_latency.py Additional files +1/-1

...

tests/test_streaming_latency.py


45. tests/test_tokenizer_utils.py Additional files +54/-0

...

tests/test_tokenizer_utils.py


46. tests/test_tool_choice_none.py Additional files +65/-0

...

tests/test_tool_choice_none.py


47. vllm_mlx/api/anthropic_adapter.py Additional files +5/-0

...

vllm_mlx/api/anthropic_adapter.py


48. vllm_mlx/api/anthropic_models.py Additional files +3/-1

...

vllm_mlx/api/anthropic_models.py


49. vllm_mlx/api/models.py Additional files +18/-0

...

vllm_mlx/api/models.py


50. vllm_mlx/api/tool_calling.py Additional files +42/-0

...

vllm_mlx/api/tool_calling.py


51. vllm_mlx/models/llm.py Additional files +53/-6

...

vllm_mlx/models/llm.py


52. vllm_mlx/patches/gemma4_mllm.py Additional files +121/-0

...

vllm_mlx/patches/gemma4_mllm.py


53. vllm_mlx/patches/qwen3_5_mllm.py Additional files +120/-0

...

vllm_mlx/patches/qwen3_5_mllm.py


54. vllm_mlx/reasoning/gemma4_parser.py Additional files +170/-0

...

vllm_mlx/reasoning/gemma4_parser.py


55. vllm_mlx/request.py Additional files +1/-0

...

vllm_mlx/request.py


56. vllm_mlx/specprefill.py Additional files +5/-5

...

vllm_mlx/specprefill.py


57. vllm_mlx/text_model_from_vlm.py Additional files +17/-5

...

vllm_mlx/text_model_from_vlm.py


58. vllm_mlx/tool_parsers/__init__.py Additional files +6/-0

...

vllm_mlx/tool_parsers/init.py


59. vllm_mlx/tool_parsers/auto_tool_parser.py Additional files +23/-13

...

vllm_mlx/tool_parsers/auto_tool_parser.py


60. vllm_mlx/tool_parsers/minimax_tool_parser.py Additional files +172/-0

...

vllm_mlx/tool_parsers/minimax_tool_parser.py


61. vllm_mlx/utils/__init__.py Additional files +2/-1

...

vllm_mlx/utils/init.py


62. vllm_mlx/utils/download.py Additional files +144/-0

...

vllm_mlx/utils/download.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Apr 14, 2026

Code Review by Qodo

🐞 Bugs (3)   📘 Rule violations (0)   📎 Requirement gaps (0)
🐞\ ≡ Correctness (1) ☼ Reliability (1) ➹ Performance (1)

Grey Divider


Action required

1. Memory threshold uses RAM 🐞
Description
EngineCore computes its emergency cache-clear threshold from mx.device_info()['memory_size'] instead
of Metal’s max_recommended_working_set_size, so cache clearing can trigger far too late and lead to
OOM/Metal instability. This diverges from other memory sizing in the codebase that consistently uses
max_recommended_working_set_size for Metal limits.
Code

vllm_mlx/engine_core.py[R154-160]

+        # Emergency memory pressure threshold — dynamic based on gpu_memory_utilization
+        _gpu_mem_util = self.config.gpu_memory_utilization
        try:
-            _device_info = mx.device_info()
-            _max_recommended = _device_info.get(
-                "max_recommended_working_set_size",
-                _device_info.get("memory_size", 0),
-            )
-            _memory_pressure_threshold = (
-                int(_max_recommended * 0.85)
-                if _max_recommended > 0
-                else 200 * 1024 * 1024 * 1024
+            _device_mem = mx.device_info().get("memory_size", 200 * 1024 * 1024 * 1024)
+            _memory_pressure_threshold = int(
+                _device_mem * min(_gpu_mem_util + 0.05, 0.99)
            )
Evidence
EngineCore now derives the pressure threshold from total device memory, while other components use
Metal’s max recommended working set sizing, which is the safer bound for preventing command-buffer
failures under memory pressure.

vllm_mlx/engine_core.py[154-162]
vllm_mlx/engine/batched.py[358-378]
vllm_mlx/mllm_batch_generator.py[443-448]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`EngineCore`’s emergency memory pressure threshold is computed from `mx.device_info()['memory_size']` (physical memory), not `max_recommended_working_set_size`. On Metal, the recommended working set is the relevant ceiling; using physical memory can delay cache clearing until it’s too late, increasing risk of OOM / Metal command-buffer failures.

### Issue Context
Other parts of this repo already use `max_recommended_working_set_size` for memory sizing and limits (e.g., BatchedEngine’s `mx.set_memory_limit` and MLLM wired limit), so `EngineCore` should align with that source of truth.

### Fix Focus Areas
- vllm_mlx/engine_core.py[154-162]
- vllm_mlx/engine/batched.py[358-378]

### Implementation notes
- Prefer `max_recommended_working_set_size` when present; fall back to `memory_size` only if it’s missing/zero.
- Keep the `gpu_memory_utilization` scaling behavior, but scale the recommended working set, not physical memory.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Strict-false scans all weights 🐞
Description
_load_strict_false() counts all-zero tensors by running mx.all(v == 0) over every parameter, which
is O(total tensor elements) work and can severely slow startup for large models. This runs on every
strict=False load (the exact path intended for huge VLM/extra-weight models), undermining the goal
of faster/more reliable loading.
Code

vllm_mlx/utils/tokenizer.py[R129-134]

+    # Verify weights loaded correctly
+    from mlx.utils import tree_flatten
+
+    params = tree_flatten(model.parameters())
+    total_params = len(params)
+    zero_params = sum(1 for _, v in params if mx.all(v == 0).item())
Evidence
The strict=False loader now flattens all model parameters and performs a full-tensor reduction on
each to detect all-zero weights; this is expensive for multi-GB models and is in the default
strict=False path used by core engine/model loaders.

vllm_mlx/utils/tokenizer.py[114-145]
vllm_mlx/engine/batched.py[327-343]
vllm_mlx/models/llm.py[74-96]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`_load_strict_false()` performs a full scan of every parameter tensor:
- `tree_flatten(model.parameters())`
- `mx.all(v == 0)` for each tensor

For large models (especially the VLM/extra-weights models that trigger strict=False), this adds significant startup latency and additional MLX/Metal work.

### Issue Context
This code runs inside the main model loading path (`load_model_with_fallback()`), which is called by both `BatchedEngine` and `vllm_mlx.models.llm.LLM.load()`.

### Fix Focus Areas
- vllm_mlx/utils/tokenizer.py[129-145]
- vllm_mlx/engine/batched.py[327-343]
- vllm_mlx/models/llm.py[74-96]

### Suggested fix
- Remove the full-parameter `mx.all(v == 0)` scan, or gate it behind a debug flag (e.g., env var) and/or `logger.isEnabledFor(logging.DEBUG)`.
- If you still want a sanity check, do a constant-time spot check (e.g., a couple known tensors like embeddings / lm_head) rather than iterating all tensors.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. presence_penalty ignored in Scheduler 🐞
Description
presence_penalty is added to SamplingParams and is forwarded from the server, but the LLM
continuous-batching Scheduler only constructs logits processors from repetition_penalty. As a
result, presence_penalty has no effect for the EngineCore/Scheduler path even though it’s accepted
by the API.
Code

vllm_mlx/request.py[R57-61]

    top_p: float = 0.9
    top_k: int = 0  # 0 means disabled
    min_p: float = 0.0
+    presence_penalty: float = 0.0
    repetition_penalty: float = 1.0
Evidence
SamplingParams now includes presence_penalty and server forwards it into generation kwargs, but
Scheduler’s per-request logits processor construction only passes repetition_penalty to
make_logits_processors; therefore presence_penalty is never applied on the LLM batching path. The
MLLM batching path in this repo already passes presence_penalty into make_logits_processors,
demonstrating intended support exists but is inconsistent.

vllm_mlx/request.py[51-63]
vllm_mlx/server.py[2063-2071]
vllm_mlx/scheduler.py[1915-1923]
vllm_mlx/mllm_batch_generator.py[1224-1238]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The server forwards `presence_penalty`, and `SamplingParams` stores it, but the LLM continuous batching `Scheduler` only builds logits processors for `repetition_penalty`. This means `presence_penalty` is silently ignored for the EngineCore/Scheduler path.

### Issue Context
The MLLM batching code (`MLLMBatchGenerator`) already supports both `repetition_penalty` and `presence_penalty` via `make_logits_processors(**lp_kwargs)`, so the LLM Scheduler should match that behavior.

### Fix Focus Areas
- vllm_mlx/scheduler.py[1915-1935]
- vllm_mlx/request.py[51-63]
- vllm_mlx/mllm_batch_generator.py[1224-1238]

### Suggested fix
- In `Scheduler._schedule_waiting()`, read `presence_penalty = request.sampling_params.presence_penalty`.
- If either penalty is non-default, call `make_logits_processors(repetition_penalty=..., presence_penalty=...)` with the non-default values.
- Keep existing behavior when values are defaults (1.0 for repetition, 0.0 for presence).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread vllm_mlx/engine_core.py
Comment on lines +154 to 160
# Emergency memory pressure threshold — dynamic based on gpu_memory_utilization
_gpu_mem_util = self.config.gpu_memory_utilization
try:
_device_info = mx.device_info()
_max_recommended = _device_info.get(
"max_recommended_working_set_size",
_device_info.get("memory_size", 0),
)
_memory_pressure_threshold = (
int(_max_recommended * 0.85)
if _max_recommended > 0
else 200 * 1024 * 1024 * 1024
_device_mem = mx.device_info().get("memory_size", 200 * 1024 * 1024 * 1024)
_memory_pressure_threshold = int(
_device_mem * min(_gpu_mem_util + 0.05, 0.99)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Memory threshold uses ram 🐞 Bug ☼ Reliability

EngineCore computes its emergency cache-clear threshold from mx.device_info()['memory_size'] instead
of Metal’s max_recommended_working_set_size, so cache clearing can trigger far too late and lead to
OOM/Metal instability. This diverges from other memory sizing in the codebase that consistently uses
max_recommended_working_set_size for Metal limits.
Agent Prompt
### Issue description
`EngineCore`’s emergency memory pressure threshold is computed from `mx.device_info()['memory_size']` (physical memory), not `max_recommended_working_set_size`. On Metal, the recommended working set is the relevant ceiling; using physical memory can delay cache clearing until it’s too late, increasing risk of OOM / Metal command-buffer failures.

### Issue Context
Other parts of this repo already use `max_recommended_working_set_size` for memory sizing and limits (e.g., BatchedEngine’s `mx.set_memory_limit` and MLLM wired limit), so `EngineCore` should align with that source of truth.

### Fix Focus Areas
- vllm_mlx/engine_core.py[154-162]
- vllm_mlx/engine/batched.py[358-378]

### Implementation notes
- Prefer `max_recommended_working_set_size` when present; fall back to `memory_size` only if it’s missing/zero.
- Keep the `gpu_memory_utilization` scaling behavior, but scale the recommended working set, not physical memory.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@krystophny krystophny merged commit 3f9cb1e into computor-org:fix/chat-template-kwargs-forwarding Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.