feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse by raullenchai · Pull Request #2 · raullenchai/Rapid-MLX

raullenchai · 2026-02-25T05:36:11Z

Summary

Implements the top 3 Tier 1 optimizations for MiniMax-M2.5 + OpenClaw on M3 Ultra, building on the merged Tier 0 work (GC control, pinned prefix cache, schema hardening).

1. Fix Streaming Tool Truncation

Problem: The fallback in stream_chat_completion() only checked for "<tool_call>" in accumulated text. MiniMax uses <minimax:tool_call>, Llama uses <function=, and Mistral uses [TOOL_CALLS] — so the fallback never triggered for these parsers, and incomplete tool calls were silently lost.
Fix: Added has_pending_tool_call() to ToolParser base class with parser-specific overrides. Server fallback now uses tool_parser.has_pending_tool_call() instead of hardcoded string check.
Bonus: Hardened MiniMax extract_tool_calls() with partial regex patterns (INVOKE_PARTIAL, PARAM_PARTIAL) to extract tool calls even when </parameter> or </invoke> closing tags are missing due to truncated streams.

2. Frequency-Aware Cache Eviction (LRU-LFU Hybrid)

Problem: Pure LRU eviction doesn't account for block reuse frequency. A system prompt block accessed 100 times gets evicted just as easily as a one-shot block if it hasn't been touched recently.
Fix: Added access_count field to CacheBlock, incremented on every touch(). New popleft_lfu() method examines a window of 8 LRU candidates and picks the one with the lowest access_count. Used in allocate_block() and evict_lru_blocks() when prefix caching is enabled.
Result: High-frequency blocks (system prompts, common prefixes) survive much longer under memory pressure.

3. KV Block Reuse Ordering

Problem: When multiple blocks share the same hash (e.g., in hybrid models), get_block() returned an arbitrary block via next(iter(blocks.values())).
Fix: Added get_best_block() to BlockHashToBlockMap that returns the block with the highest access_count. Used in get_computed_blocks() for better cache locality.

Files Changed

File	Changes
`vllm_mlx/server.py`	Use `tool_parser.has_pending_tool_call()` in streaming fallback
`vllm_mlx/tool_parsers/abstract_tool_parser.py`	Add `has_pending_tool_call()` base method
`vllm_mlx/tool_parsers/minimax_tool_parser.py`	Override `has_pending_tool_call()`, add partial regex patterns for truncated input
`vllm_mlx/tool_parsers/llama_tool_parser.py`	Override `has_pending_tool_call()` for `<function=`
`vllm_mlx/tool_parsers/mistral_tool_parser.py`	Override `has_pending_tool_call()` for `[TOOL_CALLS]`
`vllm_mlx/paged_cache.py`	`access_count` on CacheBlock, `popleft_lfu()`, `get_best_block()`, frequency-aware eviction
`README.md`	Update optimization roadmap — Tier 0 marked merged, Tier 1 top 3 marked merged

Test plan

pytest tests/test_paged_cache.py tests/test_prefix_cache.py tests/test_tool_parsers.py — all 136 tests pass
Full test suite: 902 passed (20 pre-existing async infra failures, unrelated)
Manual: Start server with --tool-call-parser minimax --enable-auto-tool-choice, send streaming tool call requests, verify fallback triggers on truncated streams
Benchmark: Run benchmark_minmax.py to compare prefix cache Turn4/Turn1 ratio and tool call reliability vs Tier 0 baseline

🤖 Generated with Claude Code

…cache, block reuse ordering Three high-impact improvements for MiniMax-M2.5 + OpenClaw on M3 Ultra: 1. Fix streaming tool truncation: The fallback in stream_chat_completion() only checked for "<tool_call>" — MiniMax uses <minimax:tool_call>, so incomplete tool calls were silently lost. Now uses parser-aware has_pending_tool_call() with overrides for MiniMax, Llama, and Mistral. Also hardens MiniMax extract_tool_calls() to handle missing closing tags. 2. Frequency-aware cache eviction: Replace pure LRU with LRU-LFU hybrid. Adds access_count to CacheBlock and popleft_lfu() which examines a window of 8 LRU candidates and evicts the one with lowest frequency. System prompt blocks accessed 100x are no longer evicted as easily as one-shot blocks. 3. KV block reuse ordering: When multiple blocks share the same hash, get_best_block() returns the one with highest access_count instead of an arbitrary one, improving cache locality for hybrid models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Prefix cache Turn 2 TTFT improved 21.5% (0.775s -> 0.608s), T4/T1 ratio improved from 2.09x to 2.03x. Zero regressions in decode throughput, tool calling accuracy, or long generation stability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- #2: /v1/models now returns both the full HF name and the alias used to start the model, so SDK users see a recognizable name - #3: API responses return the actual loaded model name instead of echoing back whatever the client sent (prevents "gpt-4o" confusion) - #4: SECURITY WARNING downgraded to debug — local inference doesn't need auth, the warning was causing unnecessary anxiety for new users - Pass alias from CLI to server for /v1/models listing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dels (#41) * fix: 4 UX friction points from user testing v2 - #2: /v1/models now returns both the full HF name and the alias used to start the model, so SDK users see a recognizable name - #3: API responses return the actual loaded model name instead of echoing back whatever the client sent (prevents "gpt-4o" confusion) - #4: SECURITY WARNING downgraded to debug — local inference doesn't need auth, the warning was causing unnecessary anxiety for new users - Pass alias from CLI to server for /v1/models listing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: non-streaming Anthropic response also returns actual model name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Raullen and others added 2 commits February 24, 2026 21:35

raullenchai merged commit e262fa2 into main Feb 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse#2

feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse#2
raullenchai merged 2 commits intomainfrom
feat-minimax-parser

raullenchai commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raullenchai commented Feb 25, 2026

Summary

1. Fix Streaming Tool Truncation

2. Frequency-Aware Cache Eviction (LRU-LFU Hybrid)

3. KV Block Reuse Ordering

Files Changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant