Skip to content

feat: MiniMax-M2 support, Tier 0 optimizations & pinned prefix cache#1

Merged
raullenchai merged 6 commits intomainfrom
feat-minimax-parser
Feb 25, 2026
Merged

feat: MiniMax-M2 support, Tier 0 optimizations & pinned prefix cache#1
raullenchai merged 6 commits intomainfrom
feat-minimax-parser

Conversation

@raullenchai
Copy link
Copy Markdown
Owner

Summary

MiniMax-M2.5 model support with full tool calling, speculative decoding, configurable GPU memory management, streaming error resilience, Tier 0 performance optimizations, and a comprehensive benchmark suite.

MiniMax-M2 Support

  • Tool call parser — Full XML-based tool call parsing (<minimax:tool_call>/<invoke> format) with streaming support, reasoning parser integration, and hallucinated tag filtering
  • Speculative decoding — Draft model support via --draft-model for 1.2-1.4x decode speedup, with HybridEngine for shared model state

Infrastructure

  • Configurable GPU memory--gpu-memory-utilization flag to control Metal soft allocation limit and emergency cache clear threshold (fixes 3.5x slowdown on 200GB+ models)
  • Streaming error resilience_disconnect_guard() now catches all exceptions, logs server-side, sends SSE error event to client

Tier 0 Optimizations

  • GC control — Disable Python GC during generation to eliminate latency spikes with large models (120GB+). --gc-control (default: enabled), --no-gc-control to disable
  • Pinned prefix cache — Prevent system prompt eviction under memory pressure. is_pinned flag on CacheBlock, skip-pinned eviction in LRU queue, pin_blocks()/unpin_blocks() API, pin_prefix()/unpin_prefix() on both PrefixCacheManager and BlockAwarePrefixCache. --pin-system-prompt auto-pins after first request
  • Schema error hardening — Guided generation falls back to standard generation on failure instead of 500 error
  • Optimization roadmap — Tier 0-2 roadmap added to README with upstream vLLM PR references

Benchmark Suite

  • Six-dimension benchmark for MiniMax-M2.5 on M3 Ultra 256GB with baseline results

Commits

  1. feat: Add MiniMax-M2 tool call parser with streaming support
  2. feat: Add --gpu-memory-utilization for configurable memory limits
  3. feat: Add speculative decoding support with draft models
  4. fix: Handle unexpected exceptions in streaming disconnect guard
  5. feat: Add comprehensive benchmark suite with baseline results
  6. feat: Add Tier 0 optimizations and pinned prefix cache

Key files changed

File Description
vllm_mlx/tool_parsers/minimax_tool_parser.py MiniMax XML tool call parser with streaming
vllm_mlx/engine/hybrid.py HybridEngine for speculative + batched mode
vllm_mlx/speculative/prompt_lookup.py Prompt lookup speculative decoding
vllm_mlx/api/guided.py JSON schema enforcement with outlines + schema error hardening
vllm_mlx/server.py GC control, guided gen fallback, auto-pin system prompt, streaming tool calls, _disconnect_guard hardening
vllm_mlx/cli.py --gc-control, --pin-system-prompt, --draft-model, --gpu-memory-utilization, minimax parser
vllm_mlx/paged_cache.py is_pinned on CacheBlock, skip-pinned eviction, pin_blocks()/unpin_blocks() API
vllm_mlx/prefix_cache.py pin_prefix()/unpin_prefix() on PrefixCacheManager and BlockAwarePrefixCache
benchmark_minmax.py Comprehensive benchmark script
docs/plans/tier0-pinned-prefix-cache.md Tier 0 implementation plan
README.md Baseline benchmark + Tier 0-2 optimization roadmap

Benchmark Results (MiniMax-M2.5, M3 Ultra 256GB)

Metric Result
TTFT (short prompt) 0.33s
TTFT (long prompt) 1.40s
Decode speed 49-54 tok/s
Prefix cache Turn4/Turn1 ratio 2.09x
Tool calling accuracy 4/4 (100%)
Long generation (8192 tok) Stable, no crash

Test plan

  • Paged cache tests (37 passed)
  • Prefix cache tests (21 passed)
  • API model tests (229 passed)
  • MiniMax tool calling: simple, multi-arg, code execution, multi-tool
  • Streaming with reasoning parser + tool call detection
  • _disconnect_guard error handling
  • --gpu-memory-utilization at 0.90 and 0.95 on 200GB+ model

🤖 Generated with Claude Code

janhilgard and others added 6 commits February 19, 2026 19:43
Add full tool call parsing for MiniMax-M2 models' native XML format,
including streaming integration with reasoning parsers.

Changes:
- minimax_tool_parser.py: Parser for <minimax:tool_call>/<invoke> XML
  format with streaming support; handles bare <invoke> without wrapper
  (model sometimes emits inside <think> blocks); filters hallucinated
  <invoke> tags without parameters
- cli.py: Add "minimax" to --tool-call-parser choices
- tool_parsers/__init__.py: Register MiniMaxToolParser
- api/utils.py: Strip MiniMax special tokens ([e~[, ]~b]role, ]~!b[)
- server.py: Integrate tool parser within reasoning parser streaming
  path — detect tool call markers in reasoning stream and redirect to
  content for parsing; suppress whitespace-only content before tool
  calls to avoid confusing clients

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a single CLI flag to control both the Metal soft allocation limit
(mx.set_memory_limit) and the emergency cache clear threshold in the
engine loop. Default 0.90 preserves existing behavior.

For large models (200GB+), the previous hardcoded 200GB emergency
threshold and fixed 90% soft limit caused excessive cache clearing,
resulting in ~3.5x slowdown. With --gpu-memory-utilization 0.95
both limits scale to the actual device memory, eliminating the
thrashing.

The emergency threshold is always 5% above the soft limit (capped
at 99%) to give MLX headroom for temporary allocations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Speculative decoding with mlx-lm draft models (1.2-1.4x speedup)
- HybridEngine: shared model between speculative + batched modes
- JSON schema enforcement with guided generation support
- Fix false positive tool call detection for regular JSON
- Strip <think> tags from API responses to prevent JSON parse errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, _disconnect_guard() only caught StopAsyncIteration from
the inner generator. Any other exception (e.g. from tool call parsing,
reasoning parser, or serialization) would propagate unhandled, causing
the HTTP connection to drop abruptly — the client sees "peer closed
connection without sending complete message body (incomplete chunked
read)" with no server-side error logged.

Now catches all exceptions, logs the full traceback server-side, sends
an SSE error event to the client, and closes the stream gracefully
with [DONE].

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Six-dimension benchmark for MiniMax-M2.5 on M3 Ultra:
1. TTFT across prompt sizes (0.33-1.4s)
2. Decode throughput (49-54 tok/s)
3. Prefix cache multi-turn effectiveness (2.09x ratio)
4. Tool calling correctness (4/4, avg 2.89s)
5. Reasoning separation (1/3 fully separated)
6. Long generation stability (8192 tok, no crash)

Baseline results added to README for tracking improvements
as we implement upstream vLLM optimizations (pinned prefix
cache, GC control, structural tags, etc.).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GC control: disable Python GC during generation to eliminate latency
spikes with large models (120GB+). Controlled via --gc-control flag
(enabled by default). At startup, GC thresholds raised to (100000, 50, 50).
During generation, GC is disabled and a gc.collect() runs after completion.

Pinned prefix cache: prevent system prompt eviction under memory
pressure. CacheBlock gets is_pinned flag; FreeKVCacheBlockQueue.popleft()
and popleft_n() skip pinned blocks; _maybe_evict_cached_block() refuses
to evict pinned blocks. pin_blocks()/unpin_blocks() API on
PagedCacheManager. pin_prefix()/unpin_prefix() on both PrefixCacheManager
and BlockAwarePrefixCache. --pin-system-prompt auto-pins system prompt
after first request.

Schema error hardening: guided generation now falls back to standard
generation on failure instead of returning 500. Problematic schemas
logged at DEBUG level.

Optimization roadmap added to README (Tier 0-2) with upstream vLLM PR
references. Detailed implementation plan in docs/plans/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@raullenchai raullenchai merged commit 095d9ac into main Feb 25, 2026
raullenchai pushed a commit that referenced this pull request Mar 14, 2026
Each model now ranks engines by combined decode+TTFT bar length.
Longest bar (best overall performance) appears first. "not supported"
engines sink to the bottom.

Result: Rapid-MLX is #1 on 13 of 15 benchmarked models. The two
exceptions (Qwen3.5-4B, Gemma 3 12B) are clearly visible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai pushed a commit that referenced this pull request Mar 15, 2026
- Bar chart now has 17 benchmarked models (14/17 Rapid-MLX #1)
- New Tool Calling table: 9/16 models produce structured tool calls
- Qwen family 100% across all sizes, GLM and MiniMax also 100%

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai pushed a commit that referenced this pull request Mar 15, 2026
Merge tool calling into the bar chart as a third visual metric.
Total bar length now reflects complete user experience: speed +
responsiveness + agentic capability. Rapid-MLX #1 on 15/17 models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai added a commit that referenced this pull request Mar 15, 2026
…ing edge case

Addresses review findings #1, #3, #7:

1. Thread safety (high): Split step() into _step_no_queue() (GPU work,
   safe for executor thread) and _distribute_outputs() (queue writes,
   event loop thread only). _process_loop now calls _step_no_queue in
   the executor and distributes outputs on the event loop thread,
   preventing races on asyncio.Queue which is not thread-safe.

2. Stop-sequence streaming (high): When a stop string appears mid-token,
   the valid prefix before the marker is now emitted in new_text instead
   of being silently dropped. Streaming clients no longer lose content.

3. Empty-string truthiness (medium): Stop-string finalization now uses
   an explicit `stop_trimmed` flag instead of `if not request.output_text`,
   which is falsy for empty string. A stop match at position 0 no longer
   re-decodes the full token sequence and leaks the stop text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
raullenchai pushed a commit that referenced this pull request Mar 21, 2026
P0 — install.sh: if no Python 3.10+ and no Homebrew, automatically
downloads standalone Python from python-build-standalone (no sudo
needed). Eliminates the #1 install blocker for users without Homebrew.

P0 — first request hang: adds a warmup step after model load that
runs one forward pass to trigger Metal shader compilation. Prints
"Warming up (compiling Metal shaders)..." so users know what's
happening. Prevents the first real request from hanging 5+ minutes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai added a commit that referenced this pull request Mar 21, 2026
* fix: auto-install Python + Metal shader warmup on startup

P0 — install.sh: if no Python 3.10+ and no Homebrew, automatically
downloads standalone Python from python-build-standalone (no sudo
needed). Eliminates the #1 install blocker for users without Homebrew.

P0 — first request hang: adds a warmup step after model load that
runs one forward pass to trigger Metal shader compilation. Prints
"Warming up (compiling Metal shaders)..." so users know what's
happening. Prevents the first real request from hanging 5+ minutes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: strip think tags from Anthropic endpoint + disk space check

P2: Think tags leaked through Anthropic /v1/messages endpoint because it
bypassed the reasoning parser entirely. Both streaming and non-streaming
paths now use the reasoning parser to separate reasoning from content,
emitting only content to Anthropic clients.

P1: Add disk space check before model download — queries HuggingFace for
model repo size and warns if available disk is insufficient. Skips
silently for local/cached models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: standalone Python URL + move warmup to lifespan hook

P0: The hardcoded python-build-standalone URL pointed at the old
indygreg repo which now 404s. Updated to astral-sh/python-build-standalone
with cpython 3.12.13 (release 20260320), verified accessible.

P2: Metal shader warmup ran in CLI before batched/hybrid engines were
started (they start in the FastAPI lifespan hook). Moved warmup into
the lifespan hook so it runs after engine.start() for all engine types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add generate_warmup() to BatchedEngine and HybridEngine

Both engines inherited the no-op base generate_warmup(), so Metal shader
warmup in the lifespan hook was silently skipped for --continuous-batching
and hybrid modes. Now both engines override it with a real forward pass,
matching SimpleEngine's implementation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants