Skip to content

fix: route text-only requests through MLLM scheduler for vision models#52

Closed
janhilgard wants to merge 32 commits intowaybarrios:mainfrom
janhilgard:fix/mllm-text-only-crash
Closed

fix: route text-only requests through MLLM scheduler for vision models#52
janhilgard wants to merge 32 commits intowaybarrios:mainfrom
janhilgard:fix/mllm-text-only-crash

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • Fix AttributeError: 'NoneType' object has no attribute 'generate' crash when sending text-only requests to vision (MLLM) models
  • Fix false "Aborting orphaned MLLM request" log spam in stream_outputs

Problem

When a vision model is loaded via BatchedEngine, only the MLLM scheduler is initialized (_start_mllm()), not the LLM engine (AsyncEngineCore). However, the condition in generate() and stream_generate() required (images or videos) to route through the MLLM scheduler — text-only requests fell through to the LLM engine path where self._engine is None, causing a crash.

Fix

Remove the (images or videos) guard from the MLLM condition. The MLLM scheduler's generate() already accepts images=None, so text-only requests work correctly through the MLLM path.

Additionally, set finished_normally = True before yield in stream_outputs() so that when the consumer breaks on output.finished, the generator's finally block doesn't incorrectly call abort_request().

Test plan

  • Text-only completion on MLLM model returns valid response (previously crashed with 500)
  • Vision completion with base64 image works (model correctly identifies "red" square)
  • No more "Aborting orphaned MLLM request" log spam
  • Non-MLLM models unaffected (LLM engine path unchanged)

🤖 Generated with Claude Code

janhilgard and others added 30 commits February 6, 2026 17:45
Add /v1/messages and /v1/messages/count_tokens endpoints that translate
Anthropic Messages API requests to OpenAI format, enabling clients like
Claude Code to communicate with vllm-mlx directly.

Also fix tool call parsing: when the configured parser (e.g. hermes)
doesn't recognize a format (e.g. Nemotron XML), fall back to the generic
parser which handles more formats.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace .nbytes access (which triggers lazy MLX evaluation) with
shape+dtype-based memory estimation in estimate_kv_cache_memory().
Remove eager mx.eval(mx.array(0)) from cache extraction path that
forced full graph evaluation. Add incremental per-layer mx.eval()
in _cleanup_finished() to spread evaluation cost. Add BatchGenerator
close(), periodic mx.clear_cache(), and Metal memory stats reporting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…safety

- Save KV cache incrementally during chunked prefill (every 8192 tokens)
  so cancelled/disconnected long-context requests preserve partial work
- Add supersequence cache matching (hit when cached key is longer than query)
- Add _wait_with_disconnect() for non-streaming endpoints to detect client
  disconnect during prefill and abort orphaned requests
- Add try/finally cancellation safety to EngineCore.generate() to clean up
  scheduler state on CancelledError
- Support hybrid Mamba+Attention models via generic class_ref.from_state()
  reconstruction instead of hardcoded KVCache tuple unpacking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BatchKVCache doesn't inherit from KVCache, so _merge_caches
couldn't handle restored cache objects. Convert to KVCache during
reconstruction since mid-prefill save is always batch_size=1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Supersequence match covers all requested tokens (remaining=[])
so it should always be preferred over a prefix match which only
covers a subset. Previously prefix match was checked first, causing
full cache entries to be ignored in favor of partial mid-prefill ones.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep mid-prefill cache entries after request completion instead of
deleting them, and store prompt-only entries alongside prompt+output.
This allows future requests sharing the same prefix (e.g. same system
prompt + tools but different user message) to get a prefix cache hit.

Tested: 4.4x speedup for requests with shared 30K-token prefix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Defer abort_request() to executor thread via thread-safe deque to
prevent race condition between main asyncio thread and mlx-step_0
thread that causes Metal assertion failure (uncommitted encoder).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check _pending_abort_ids inside _chunked_next() before each chunk
so partial prefill stops within 1 chunk instead of running to
completion. Also fix _do_abort_request to clean up BatchGenerator
even when request was already removed from self.requests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Streaming mode was sending tool calls as plain text in delta.content
instead of structured delta.tool_calls objects. This broke tool call
detection in clients (Claude Code, Cursor, etc.) during streaming.

- Add tool parser integration to stream_chat_completion() with lazy
  initialization of _tool_parser_instance for streaming-first requests
- Extend HermesToolParser to handle Nemotron XML format
  (<function=name><parameter=p>v</parameter></function>) used by
  Qwen3-Coder-Next in addition to JSON format
- Add fallback for incomplete tool calls at end of stream
- Set finish_reason to "tool_calls" when tool calls are detected

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Parameter values from <parameter=name>value</parameter> were always
treated as strings, causing nested arrays and objects to be
double-serialized (e.g. "books": "[{...}]" instead of "books": [{...}]).

Now try json.loads() on each parameter value first — if it parses as
valid JSON (array, object, number, boolean), use the parsed value;
otherwise keep the raw string.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Qwen3-Coder-Next is a hybrid model with 36 MambaCache + 12 KVCache
layers. MambaCache stores cumulative SSM state that cannot be trimmed
back to "prompt only" after output generation. The previous approach
of storing prompt-only cache entries by trimming the post-generation
cache caused immediate EOS on cache hit (1-token responses).

Fix: capture prompt-only cache state by monkey-patching
_process_prompts in BatchGenerator. At the point where it returns,
the batch cache contains the exact prompt-only state (all prompt
tokens processed, no output token fed back yet). This state is
correct for both KVCache and MambaCache layers.

Also disable supersequence match trimming for hybrid models
(detected via non-trimmable cache layers) to prevent the same
corruption. Pure KVCache models still use supersequence trimming.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In multi-turn agentic workloads (e.g. tool calling), each request
extends the previous conversation. The cache was storing complete
KV state for every request independently — 10 sequential tool calls
on a 30K-token conversation stored ~190K tokens worth of KV data
(~34GB), even though 95%+ was duplicated prefix data.

Now, when storing a new cache entry, any existing entries whose
token sequence is a strict prefix of the new entry are evicted
first. Since fetch() always prefers the longest match, these
shorter entries would never be selected anyway.

Expected impact: ~6x memory reduction for typical agentic sessions
(190K → 30K stored tokens, ~34GB → ~5-8GB cache).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After 2-3 rounds of tool use, the model would start generating tool
calls as text ("[Calling tool: read(...)]") instead of structured
<tool_call> XML.  Root cause: the Hermes parser had
SUPPORTS_NATIVE_TOOL_FORMAT = False (default), so assistant tool_calls
were converted to "[Calling tool: ...]" text and tool results to
"[Tool Result (...)]" in the conversation history.  The model then
mimicked this text format instead of producing proper tool call XML.

Two fixes:
1. Set SUPPORTS_NATIVE_TOOL_FORMAT = True on HermesToolParser, so
   role="tool" and tool_calls are preserved in their native format
   when building the chat template input.

2. Parse tool_call arguments from JSON string to dict before passing
   to the chat template.  Qwen3's template iterates
   tool_call.arguments|items which requires a dict, but the OpenAI
   API sends arguments as a JSON string.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Anthropic /v1/messages streaming path filtered special tokens
(im_end, im_start, etc.) but the OpenAI /v1/chat/completions
streaming path did not. This caused raw <|im_end|> to appear in
client output (e.g. "Dev server běží na http://localhost:3000<|im_end|>").

Use the shared SPECIAL_TOKENS_PATTERN regex in both streaming paths
for consistent filtering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the model generates a stop token (e.g. <|im_end|>, token 151645),
the scheduler correctly detects it and sets finish_reason="stop", but
still decoded it into new_text which was then sent to the client as
content. This caused "<|im_end|>" to appear at the end of responses.

Now, when finish_reason is "stop", new_text is set to "" so the stop
token is never sent as content. The server-side SPECIAL_TOKENS_PATTERN
filter remains as a safety net for edge cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e hits)

Two bugs prevented prefix cache hits on the Anthropic endpoint:

1. prompt_cache_save was never called for chunked prefill. Large prompts
   (>4096 tokens) bypass _process_prompts and go through _chunked_next,
   which had no prompt_cache_save hook. Only prompt+output entries were
   stored, whose keys include generated tokens that don't match the next
   request's prompt — causing permanent cache misses.

2. Prefix-subset eviction removed prompt-only entries. When cache_store
   saved a prompt+output entry, the eviction logic deleted the shorter
   prompt-only entry (from prompt_cache_save) because it was technically
   a prefix. But the prompt-only entry was the one future requests needed.

Fixes:
- Add prompt_cache_save call after chunked prefill completion
- Add evict_prefixes parameter to store(); set False for cache_store
- Log cache clears on BatchGenerator recreation for diagnostics

Before: Anthropic endpoint always MISS, TTFT 30-90s for 50-100K prompts
After: Cache HIT on consecutive requests, TTFT 0.3-1.7s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KV cache entries depend only on input tokens, not sampling parameters
(temperature, top_p, min_p). When different clients send requests with
different sampler configs, the BatchGenerator is recreated but the
prefix cache was unnecessarily cleared — causing full re-prefill of
50-140K token prompts (30-90s TTFT) on every model/client switch.

Now the cache is preserved across BatchGenerator recreations, since
the server runs a single model and all cache entries remain valid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When cache entries survived a BatchGenerator recreation (sampler param
change), stale entries with batch_size > 1 could cause a concatenation
crash in _merge_caches: shapes like (2,3,8192) vs (5,2048,8192) are
incompatible. The error looped indefinitely, blocking all requests.

Two fixes:
1. Add batch dimension validation to _validate_cache(): reject entries
   where KVCache keys.shape[0] != 1 or MambaCache cache[i].shape[0] != 1.
   These are caught at fetch time and treated as cache MISS.

2. Wrap batch_generator.insert() in try/except: if an unexpected shape
   mismatch slips through validation, discard the cache and retry the
   insert without it, instead of entering an infinite error loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…odels

The prefix cache was returning MISS for every turn in agentic workloads
(multi-turn conversations with shared context) on hybrid models like
Qwen3-Next-80B which use MambaCache layers that cannot be trimmed.

Root causes and fixes:
- Add two-phase prefill: save cache at prefix boundary during chunked
  prefill so future requests with the same prefix but different suffix
  get a HIT (mid_prefill_save bypasses throttle at prefix_boundary)
- Fix prefix boundary computation: use two-tokenization LCP approach
  instead of separate prefix tokenization, avoiding Jinja template
  discrepancies (e.g. Qwen3 <think> markers on last assistant message)
- Always install chunked prefill when memory-aware cache is active,
  even without explicit --chunked-prefill-tokens flag
- Prevent prompt_cache_save from evicting mid-prefill boundary entries
  (evict_prefixes=False)
- Add LCP matching in memory_cache.py fetch() for divergent sequences
  (works for pure KVCache models; safely skipped for MambaCache)
- Pass requests dict to _install_chunked_prefill for boundary detection

Verified: Turn 1 MISS (expected), Turn 2-5 HIT with 96-97% cached
tokens and 7x TTFT improvement (1.0s → 0.13s).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r_cache, fast-path tool parsing, hybrid executor

- Replace O(N) linear scan in MemoryAwarePrefixCache.fetch() with O(log N)
  bisect-based lookup for prefix, supersequence, and LCP matches
- Remove unnecessary copy.deepcopy() from PrefixCacheManager (MLX arrays
  are immutable)
- Increase _clear_cache_interval from 16 to 32 and remove redundant
  per-layer mx.clear_cache() in _cleanup_finished
- Add fast-path in streaming tool parsing: skip extract_tool_calls_streaming()
  until '<' is seen in the token stream
- Use hybrid executor in engine loop: inline for generation-only steps (~1-3ms),
  ThreadPoolExecutor only for prefill steps that may block for seconds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a new status endpoint that exposes per-request details for debugging
and production monitoring: phase (queued/prefill/generation), tokens/s,
TTFT, progress, and cache hit type (exact/prefix/supersequence/lcp/miss).

- Add first_token_time and cache_hit_type fields to Request dataclass
- Track _last_match_type in MemoryAwarePrefixCache.fetch() for all paths
- Set cache_hit_type after fetch() for all cache backends (memory-aware,
  paged, legacy)
- Record first_token_time in _process_batch_responses() on first output
  token
- Add Scheduler.get_running_requests_info() for per-request status data
- Extend EngineCore.get_stats() with requests info
- Add GET /v1/status endpoint in server.py
- Fix pre-existing test failures: update abort tests to match deferred
  abort pattern (abort_request() enqueues, _process_pending_aborts()
  executes)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts in:
- vllm_mlx/server.py: keep _wait_with_disconnect(), adopt upstream
  _resolve_temperature()/_resolve_top_p() helpers
- vllm_mlx/tool_parsers/hermes_tool_parser.py: combine our
  SUPPORTS_NATIVE_TOOL_FORMAT + Nemotron XML with upstream's lenient
  pattern + raw JSON fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
HermesToolParser has SUPPORTS_NATIVE_TOOL_FORMAT = True to enable
proper multi-turn tool calling with Qwen3/Hermes models. Move it
from the "without native support" list to "with native support" in
the upstream test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Upstream merge added serve_command() usage of these args but the
argparse definitions were missing, causing AttributeError on startup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The request parameter was available in _parse_tool_calls_with_parser but
never forwarded to extract_tool_calls or the generic parse_tool_calls
fallbacks. This meant parsers like Hermes and GLM47 could never validate
tool names against the actual tools provided in the request, allowing
hallucinated tool names to pass through unfiltered.

Convert the Pydantic ChatCompletionRequest to a dict once at the top of
the function and pass it to all four call sites.
When a vision model (MLLM) is loaded, text-only requests crashed with
`AttributeError: 'NoneType' object has no attribute 'generate'` because
the LLM engine (AsyncEngineCore) is not initialized for MLLM models.

The condition `if self._is_mllm and self._mllm_scheduler and (images or videos)`
incorrectly routed text-only requests to the LLM engine path. Since
`_start_mllm()` never creates an AsyncEngineCore, `self._engine` is None.

Fix: Remove the `(images or videos)` guard so all requests on MLLM models
use the MLLM scheduler, which handles text-only prompts correctly
(images=None is valid).

Also fix a cosmetic bug in `stream_outputs` where `finished_normally` was
set after `yield`, causing false "Aborting orphaned MLLM request" log spam
when the consumer broke out of the generator on `output.finished`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
janhilgard and others added 2 commits February 9, 2026 22:01
Resolve merge conflict in mllm_scheduler.py: take upstream version
of finished_normally flag placement (yield before setting flag).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard
Copy link
Copy Markdown
Collaborator Author

Closing — all changes in this PR are already covered by #76 (973b695), which was merged to main. Both batched.py routing fix and mllm_scheduler.py yield-order fix are identical on main now.

@janhilgard janhilgard closed this Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants