feat: MiniMax-M2 support, Tier 0 optimizations & pinned prefix cache#1
Merged
raullenchai merged 6 commits intomainfrom Feb 25, 2026
Merged
feat: MiniMax-M2 support, Tier 0 optimizations & pinned prefix cache#1raullenchai merged 6 commits intomainfrom
raullenchai merged 6 commits intomainfrom
Conversation
Add full tool call parsing for MiniMax-M2 models' native XML format, including streaming integration with reasoning parsers. Changes: - minimax_tool_parser.py: Parser for <minimax:tool_call>/<invoke> XML format with streaming support; handles bare <invoke> without wrapper (model sometimes emits inside <think> blocks); filters hallucinated <invoke> tags without parameters - cli.py: Add "minimax" to --tool-call-parser choices - tool_parsers/__init__.py: Register MiniMaxToolParser - api/utils.py: Strip MiniMax special tokens ([e~[, ]~b]role, ]~!b[) - server.py: Integrate tool parser within reasoning parser streaming path — detect tool call markers in reasoning stream and redirect to content for parsing; suppress whitespace-only content before tool calls to avoid confusing clients Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a single CLI flag to control both the Metal soft allocation limit (mx.set_memory_limit) and the emergency cache clear threshold in the engine loop. Default 0.90 preserves existing behavior. For large models (200GB+), the previous hardcoded 200GB emergency threshold and fixed 90% soft limit caused excessive cache clearing, resulting in ~3.5x slowdown. With --gpu-memory-utilization 0.95 both limits scale to the actual device memory, eliminating the thrashing. The emergency threshold is always 5% above the soft limit (capped at 99%) to give MLX headroom for temporary allocations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Speculative decoding with mlx-lm draft models (1.2-1.4x speedup) - HybridEngine: shared model between speculative + batched modes - JSON schema enforcement with guided generation support - Fix false positive tool call detection for regular JSON - Strip <think> tags from API responses to prevent JSON parse errors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, _disconnect_guard() only caught StopAsyncIteration from the inner generator. Any other exception (e.g. from tool call parsing, reasoning parser, or serialization) would propagate unhandled, causing the HTTP connection to drop abruptly — the client sees "peer closed connection without sending complete message body (incomplete chunked read)" with no server-side error logged. Now catches all exceptions, logs the full traceback server-side, sends an SSE error event to the client, and closes the stream gracefully with [DONE]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Six-dimension benchmark for MiniMax-M2.5 on M3 Ultra: 1. TTFT across prompt sizes (0.33-1.4s) 2. Decode throughput (49-54 tok/s) 3. Prefix cache multi-turn effectiveness (2.09x ratio) 4. Tool calling correctness (4/4, avg 2.89s) 5. Reasoning separation (1/3 fully separated) 6. Long generation stability (8192 tok, no crash) Baseline results added to README for tracking improvements as we implement upstream vLLM optimizations (pinned prefix cache, GC control, structural tags, etc.). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GC control: disable Python GC during generation to eliminate latency spikes with large models (120GB+). Controlled via --gc-control flag (enabled by default). At startup, GC thresholds raised to (100000, 50, 50). During generation, GC is disabled and a gc.collect() runs after completion. Pinned prefix cache: prevent system prompt eviction under memory pressure. CacheBlock gets is_pinned flag; FreeKVCacheBlockQueue.popleft() and popleft_n() skip pinned blocks; _maybe_evict_cached_block() refuses to evict pinned blocks. pin_blocks()/unpin_blocks() API on PagedCacheManager. pin_prefix()/unpin_prefix() on both PrefixCacheManager and BlockAwarePrefixCache. --pin-system-prompt auto-pins system prompt after first request. Schema error hardening: guided generation now falls back to standard generation on failure instead of returning 500. Problematic schemas logged at DEBUG level. Optimization roadmap added to README (Tier 0-2) with upstream vLLM PR references. Detailed implementation plan in docs/plans/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai
pushed a commit
that referenced
this pull request
Mar 14, 2026
Each model now ranks engines by combined decode+TTFT bar length. Longest bar (best overall performance) appears first. "not supported" engines sink to the bottom. Result: Rapid-MLX is #1 on 13 of 15 benchmarked models. The two exceptions (Qwen3.5-4B, Gemma 3 12B) are clearly visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai
pushed a commit
that referenced
this pull request
Mar 15, 2026
- Bar chart now has 17 benchmarked models (14/17 Rapid-MLX #1) - New Tool Calling table: 9/16 models produce structured tool calls - Qwen family 100% across all sizes, GLM and MiniMax also 100% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai
pushed a commit
that referenced
this pull request
Mar 15, 2026
Merge tool calling into the bar chart as a third visual metric. Total bar length now reflects complete user experience: speed + responsiveness + agentic capability. Rapid-MLX #1 on 15/17 models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai
added a commit
that referenced
this pull request
Mar 15, 2026
…ing edge case Addresses review findings #1, #3, #7: 1. Thread safety (high): Split step() into _step_no_queue() (GPU work, safe for executor thread) and _distribute_outputs() (queue writes, event loop thread only). _process_loop now calls _step_no_queue in the executor and distributes outputs on the event loop thread, preventing races on asyncio.Queue which is not thread-safe. 2. Stop-sequence streaming (high): When a stop string appears mid-token, the valid prefix before the marker is now emitted in new_text instead of being silently dropped. Streaming clients no longer lose content. 3. Empty-string truthiness (medium): Stop-string finalization now uses an explicit `stop_trimmed` flag instead of `if not request.output_text`, which is falsy for empty string. A stop match at position 0 no longer re-decodes the full token sequence and leaks the stop text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
raullenchai
pushed a commit
that referenced
this pull request
Mar 21, 2026
P0 — install.sh: if no Python 3.10+ and no Homebrew, automatically downloads standalone Python from python-build-standalone (no sudo needed). Eliminates the #1 install blocker for users without Homebrew. P0 — first request hang: adds a warmup step after model load that runs one forward pass to trigger Metal shader compilation. Prints "Warming up (compiling Metal shaders)..." so users know what's happening. Prevents the first real request from hanging 5+ minutes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai
added a commit
that referenced
this pull request
Mar 21, 2026
* fix: auto-install Python + Metal shader warmup on startup P0 — install.sh: if no Python 3.10+ and no Homebrew, automatically downloads standalone Python from python-build-standalone (no sudo needed). Eliminates the #1 install blocker for users without Homebrew. P0 — first request hang: adds a warmup step after model load that runs one forward pass to trigger Metal shader compilation. Prints "Warming up (compiling Metal shaders)..." so users know what's happening. Prevents the first real request from hanging 5+ minutes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: strip think tags from Anthropic endpoint + disk space check P2: Think tags leaked through Anthropic /v1/messages endpoint because it bypassed the reasoning parser entirely. Both streaming and non-streaming paths now use the reasoning parser to separate reasoning from content, emitting only content to Anthropic clients. P1: Add disk space check before model download — queries HuggingFace for model repo size and warns if available disk is insufficient. Skips silently for local/cached models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: standalone Python URL + move warmup to lifespan hook P0: The hardcoded python-build-standalone URL pointed at the old indygreg repo which now 404s. Updated to astral-sh/python-build-standalone with cpython 3.12.13 (release 20260320), verified accessible. P2: Metal shader warmup ran in CLI before batched/hybrid engines were started (they start in the FastAPI lifespan hook). Moved warmup into the lifespan hook so it runs after engine.start() for all engine types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add generate_warmup() to BatchedEngine and HybridEngine Both engines inherited the no-op base generate_warmup(), so Metal shader warmup in the lifespan hook was silently skipped for --continuous-batching and hybrid modes. Now both engines override it with a real forward pass, matching SimpleEngine's implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MiniMax-M2.5 model support with full tool calling, speculative decoding, configurable GPU memory management, streaming error resilience, Tier 0 performance optimizations, and a comprehensive benchmark suite.
MiniMax-M2 Support
<minimax:tool_call>/<invoke>format) with streaming support, reasoning parser integration, and hallucinated tag filtering--draft-modelfor 1.2-1.4x decode speedup, with HybridEngine for shared model stateInfrastructure
--gpu-memory-utilizationflag to control Metal soft allocation limit and emergency cache clear threshold (fixes 3.5x slowdown on 200GB+ models)_disconnect_guard()now catches all exceptions, logs server-side, sends SSE error event to clientTier 0 Optimizations
--gc-control(default: enabled),--no-gc-controlto disableis_pinnedflag on CacheBlock, skip-pinned eviction in LRU queue,pin_blocks()/unpin_blocks()API,pin_prefix()/unpin_prefix()on both PrefixCacheManager and BlockAwarePrefixCache.--pin-system-promptauto-pins after first requestBenchmark Suite
Commits
feat: Add MiniMax-M2 tool call parser with streaming supportfeat: Add --gpu-memory-utilization for configurable memory limitsfeat: Add speculative decoding support with draft modelsfix: Handle unexpected exceptions in streaming disconnect guardfeat: Add comprehensive benchmark suite with baseline resultsfeat: Add Tier 0 optimizations and pinned prefix cacheKey files changed
vllm_mlx/tool_parsers/minimax_tool_parser.pyvllm_mlx/engine/hybrid.pyvllm_mlx/speculative/prompt_lookup.pyvllm_mlx/api/guided.pyvllm_mlx/server.py_disconnect_guardhardeningvllm_mlx/cli.py--gc-control,--pin-system-prompt,--draft-model,--gpu-memory-utilization,minimaxparservllm_mlx/paged_cache.pyis_pinnedon CacheBlock, skip-pinned eviction,pin_blocks()/unpin_blocks()APIvllm_mlx/prefix_cache.pypin_prefix()/unpin_prefix()on PrefixCacheManager and BlockAwarePrefixCachebenchmark_minmax.pydocs/plans/tier0-pinned-prefix-cache.mdREADME.mdBenchmark Results (MiniMax-M2.5, M3 Ultra 256GB)
Test plan
_disconnect_guarderror handling--gpu-memory-utilizationat 0.90 and 0.95 on 200GB+ model🤖 Generated with Claude Code