feat: MiniMax-M2 support, Tier 0 optimizations & pinned prefix cache by raullenchai · Pull Request #1 · raullenchai/Rapid-MLX

raullenchai · 2026-02-25T05:04:13Z

Summary

MiniMax-M2.5 model support with full tool calling, speculative decoding, configurable GPU memory management, streaming error resilience, Tier 0 performance optimizations, and a comprehensive benchmark suite.

MiniMax-M2 Support

Tool call parser — Full XML-based tool call parsing (<minimax:tool_call>/<invoke> format) with streaming support, reasoning parser integration, and hallucinated tag filtering
Speculative decoding — Draft model support via --draft-model for 1.2-1.4x decode speedup, with HybridEngine for shared model state

Infrastructure

Configurable GPU memory — --gpu-memory-utilization flag to control Metal soft allocation limit and emergency cache clear threshold (fixes 3.5x slowdown on 200GB+ models)
Streaming error resilience — _disconnect_guard() now catches all exceptions, logs server-side, sends SSE error event to client

Tier 0 Optimizations

GC control — Disable Python GC during generation to eliminate latency spikes with large models (120GB+). --gc-control (default: enabled), --no-gc-control to disable
Pinned prefix cache — Prevent system prompt eviction under memory pressure. is_pinned flag on CacheBlock, skip-pinned eviction in LRU queue, pin_blocks()/unpin_blocks() API, pin_prefix()/unpin_prefix() on both PrefixCacheManager and BlockAwarePrefixCache. --pin-system-prompt auto-pins after first request
Schema error hardening — Guided generation falls back to standard generation on failure instead of 500 error
Optimization roadmap — Tier 0-2 roadmap added to README with upstream vLLM PR references

Benchmark Suite

Six-dimension benchmark for MiniMax-M2.5 on M3 Ultra 256GB with baseline results

Commits

feat: Add MiniMax-M2 tool call parser with streaming support
feat: Add --gpu-memory-utilization for configurable memory limits
feat: Add speculative decoding support with draft models
fix: Handle unexpected exceptions in streaming disconnect guard
feat: Add comprehensive benchmark suite with baseline results
feat: Add Tier 0 optimizations and pinned prefix cache

Key files changed

File	Description
`vllm_mlx/tool_parsers/minimax_tool_parser.py`	MiniMax XML tool call parser with streaming
`vllm_mlx/engine/hybrid.py`	HybridEngine for speculative + batched mode
`vllm_mlx/speculative/prompt_lookup.py`	Prompt lookup speculative decoding
`vllm_mlx/api/guided.py`	JSON schema enforcement with outlines + schema error hardening
`vllm_mlx/server.py`	GC control, guided gen fallback, auto-pin system prompt, streaming tool calls, `_disconnect_guard` hardening
`vllm_mlx/cli.py`	`--gc-control`, `--pin-system-prompt`, `--draft-model`, `--gpu-memory-utilization`, `minimax` parser
`vllm_mlx/paged_cache.py`	`is_pinned` on CacheBlock, skip-pinned eviction, `pin_blocks()`/`unpin_blocks()` API
`vllm_mlx/prefix_cache.py`	`pin_prefix()`/`unpin_prefix()` on PrefixCacheManager and BlockAwarePrefixCache
`benchmark_minmax.py`	Comprehensive benchmark script
`docs/plans/tier0-pinned-prefix-cache.md`	Tier 0 implementation plan
`README.md`	Baseline benchmark + Tier 0-2 optimization roadmap

Benchmark Results (MiniMax-M2.5, M3 Ultra 256GB)

Metric	Result
TTFT (short prompt)	0.33s
TTFT (long prompt)	1.40s
Decode speed	49-54 tok/s
Prefix cache Turn4/Turn1 ratio	2.09x
Tool calling accuracy	4/4 (100%)
Long generation (8192 tok)	Stable, no crash

Test plan

Paged cache tests (37 passed)
Prefix cache tests (21 passed)
API model tests (229 passed)
MiniMax tool calling: simple, multi-arg, code execution, multi-tool
Streaming with reasoning parser + tool call detection
_disconnect_guard error handling
--gpu-memory-utilization at 0.90 and 0.95 on 200GB+ model

🤖 Generated with Claude Code

Add full tool call parsing for MiniMax-M2 models' native XML format, including streaming integration with reasoning parsers. Changes: - minimax_tool_parser.py: Parser for <minimax:tool_call>/<invoke> XML format with streaming support; handles bare <invoke> without wrapper (model sometimes emits inside <think> blocks); filters hallucinated <invoke> tags without parameters - cli.py: Add "minimax" to --tool-call-parser choices - tool_parsers/__init__.py: Register MiniMaxToolParser - api/utils.py: Strip MiniMax special tokens ([e~[, ]~b]role, ]~!b[) - server.py: Integrate tool parser within reasoning parser streaming path — detect tool call markers in reasoning stream and redirect to content for parsing; suppress whitespace-only content before tool calls to avoid confusing clients Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a single CLI flag to control both the Metal soft allocation limit (mx.set_memory_limit) and the emergency cache clear threshold in the engine loop. Default 0.90 preserves existing behavior. For large models (200GB+), the previous hardcoded 200GB emergency threshold and fixed 90% soft limit caused excessive cache clearing, resulting in ~3.5x slowdown. With --gpu-memory-utilization 0.95 both limits scale to the actual device memory, eliminating the thrashing. The emergency threshold is always 5% above the soft limit (capped at 99%) to give MLX headroom for temporary allocations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Speculative decoding with mlx-lm draft models (1.2-1.4x speedup) - HybridEngine: shared model between speculative + batched modes - JSON schema enforcement with guided generation support - Fix false positive tool call detection for regular JSON - Strip <think> tags from API responses to prevent JSON parse errors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previously, _disconnect_guard() only caught StopAsyncIteration from the inner generator. Any other exception (e.g. from tool call parsing, reasoning parser, or serialization) would propagate unhandled, causing the HTTP connection to drop abruptly — the client sees "peer closed connection without sending complete message body (incomplete chunked read)" with no server-side error logged. Now catches all exceptions, logs the full traceback server-side, sends an SSE error event to the client, and closes the stream gracefully with [DONE]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Six-dimension benchmark for MiniMax-M2.5 on M3 Ultra: 1. TTFT across prompt sizes (0.33-1.4s) 2. Decode throughput (49-54 tok/s) 3. Prefix cache multi-turn effectiveness (2.09x ratio) 4. Tool calling correctness (4/4, avg 2.89s) 5. Reasoning separation (1/3 fully separated) 6. Long generation stability (8192 tok, no crash) Baseline results added to README for tracking improvements as we implement upstream vLLM optimizations (pinned prefix cache, GC control, structural tags, etc.). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GC control: disable Python GC during generation to eliminate latency spikes with large models (120GB+). Controlled via --gc-control flag (enabled by default). At startup, GC thresholds raised to (100000, 50, 50). During generation, GC is disabled and a gc.collect() runs after completion. Pinned prefix cache: prevent system prompt eviction under memory pressure. CacheBlock gets is_pinned flag; FreeKVCacheBlockQueue.popleft() and popleft_n() skip pinned blocks; _maybe_evict_cached_block() refuses to evict pinned blocks. pin_blocks()/unpin_blocks() API on PagedCacheManager. pin_prefix()/unpin_prefix() on both PrefixCacheManager and BlockAwarePrefixCache. --pin-system-prompt auto-pins system prompt after first request. Schema error hardening: guided generation now falls back to standard generation on failure instead of returning 500. Problematic schemas logged at DEBUG level. Optimization roadmap added to README (Tier 0-2) with upstream vLLM PR references. Detailed implementation plan in docs/plans/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Each model now ranks engines by combined decode+TTFT bar length. Longest bar (best overall performance) appears first. "not supported" engines sink to the bottom. Result: Rapid-MLX is #1 on 13 of 15 benchmarked models. The two exceptions (Qwen3.5-4B, Gemma 3 12B) are clearly visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Bar chart now has 17 benchmarked models (14/17 Rapid-MLX #1) - New Tool Calling table: 9/16 models produce structured tool calls - Qwen family 100% across all sizes, GLM and MiniMax also 100% Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge tool calling into the bar chart as a third visual metric. Total bar length now reflects complete user experience: speed + responsiveness + agentic capability. Rapid-MLX #1 on 15/17 models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ing edge case Addresses review findings #1, #3, #7: 1. Thread safety (high): Split step() into _step_no_queue() (GPU work, safe for executor thread) and _distribute_outputs() (queue writes, event loop thread only). _process_loop now calls _step_no_queue in the executor and distributes outputs on the event loop thread, preventing races on asyncio.Queue which is not thread-safe. 2. Stop-sequence streaming (high): When a stop string appears mid-token, the valid prefix before the marker is now emitted in new_text instead of being silently dropped. Streaming clients no longer lose content. 3. Empty-string truthiness (medium): Stop-string finalization now uses an explicit `stop_trimmed` flag instead of `if not request.output_text`, which is falsy for empty string. A stop match at position 0 no longer re-decodes the full token sequence and leaks the stop text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P0 — install.sh: if no Python 3.10+ and no Homebrew, automatically downloads standalone Python from python-build-standalone (no sudo needed). Eliminates the #1 install blocker for users without Homebrew. P0 — first request hang: adds a warmup step after model load that runs one forward pass to trigger Metal shader compilation. Prints "Warming up (compiling Metal shaders)..." so users know what's happening. Prevents the first real request from hanging 5+ minutes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: auto-install Python + Metal shader warmup on startup P0 — install.sh: if no Python 3.10+ and no Homebrew, automatically downloads standalone Python from python-build-standalone (no sudo needed). Eliminates the #1 install blocker for users without Homebrew. P0 — first request hang: adds a warmup step after model load that runs one forward pass to trigger Metal shader compilation. Prints "Warming up (compiling Metal shaders)..." so users know what's happening. Prevents the first real request from hanging 5+ minutes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: strip think tags from Anthropic endpoint + disk space check P2: Think tags leaked through Anthropic /v1/messages endpoint because it bypassed the reasoning parser entirely. Both streaming and non-streaming paths now use the reasoning parser to separate reasoning from content, emitting only content to Anthropic clients. P1: Add disk space check before model download — queries HuggingFace for model repo size and warns if available disk is insufficient. Skips silently for local/cached models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: standalone Python URL + move warmup to lifespan hook P0: The hardcoded python-build-standalone URL pointed at the old indygreg repo which now 404s. Updated to astral-sh/python-build-standalone with cpython 3.12.13 (release 20260320), verified accessible. P2: Metal shader warmup ran in CLI before batched/hybrid engines were started (they start in the FastAPI lifespan hook). Moved warmup into the lifespan hook so it runs after engine.start() for all engine types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add generate_warmup() to BatchedEngine and HybridEngine Both engines inherited the no-op base generate_warmup(), so Metal shader warmup in the lifespan hook was silently skipped for --continuous-batching and hybrid modes. Now both engines override it with a real forward pass, matching SimpleEngine's implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard and others added 6 commits February 19, 2026 19:43

raullenchai merged commit 095d9ac into main Feb 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MiniMax-M2 support, Tier 0 optimizations & pinned prefix cache#1

feat: MiniMax-M2 support, Tier 0 optimizations & pinned prefix cache#1
raullenchai merged 6 commits intomainfrom
feat-minimax-parser

raullenchai commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raullenchai commented Feb 25, 2026

Summary

MiniMax-M2 Support

Infrastructure

Tier 0 Optimizations

Benchmark Suite

Commits

Key files changed

Benchmark Results (MiniMax-M2.5, M3 Ultra 256GB)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants