Skip to content

feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse#2

Merged
raullenchai merged 2 commits intomainfrom
feat-minimax-parser
Feb 25, 2026
Merged

feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse#2
raullenchai merged 2 commits intomainfrom
feat-minimax-parser

Conversation

@raullenchai
Copy link
Copy Markdown
Owner

Summary

Implements the top 3 Tier 1 optimizations for MiniMax-M2.5 + OpenClaw on M3 Ultra, building on the merged Tier 0 work (GC control, pinned prefix cache, schema hardening).

1. Fix Streaming Tool Truncation

  • Problem: The fallback in stream_chat_completion() only checked for "<tool_call>" in accumulated text. MiniMax uses <minimax:tool_call>, Llama uses <function=, and Mistral uses [TOOL_CALLS] — so the fallback never triggered for these parsers, and incomplete tool calls were silently lost.
  • Fix: Added has_pending_tool_call() to ToolParser base class with parser-specific overrides. Server fallback now uses tool_parser.has_pending_tool_call() instead of hardcoded string check.
  • Bonus: Hardened MiniMax extract_tool_calls() with partial regex patterns (INVOKE_PARTIAL, PARAM_PARTIAL) to extract tool calls even when </parameter> or </invoke> closing tags are missing due to truncated streams.

2. Frequency-Aware Cache Eviction (LRU-LFU Hybrid)

  • Problem: Pure LRU eviction doesn't account for block reuse frequency. A system prompt block accessed 100 times gets evicted just as easily as a one-shot block if it hasn't been touched recently.
  • Fix: Added access_count field to CacheBlock, incremented on every touch(). New popleft_lfu() method examines a window of 8 LRU candidates and picks the one with the lowest access_count. Used in allocate_block() and evict_lru_blocks() when prefix caching is enabled.
  • Result: High-frequency blocks (system prompts, common prefixes) survive much longer under memory pressure.

3. KV Block Reuse Ordering

  • Problem: When multiple blocks share the same hash (e.g., in hybrid models), get_block() returned an arbitrary block via next(iter(blocks.values())).
  • Fix: Added get_best_block() to BlockHashToBlockMap that returns the block with the highest access_count. Used in get_computed_blocks() for better cache locality.

Files Changed

File Changes
vllm_mlx/server.py Use tool_parser.has_pending_tool_call() in streaming fallback
vllm_mlx/tool_parsers/abstract_tool_parser.py Add has_pending_tool_call() base method
vllm_mlx/tool_parsers/minimax_tool_parser.py Override has_pending_tool_call(), add partial regex patterns for truncated input
vllm_mlx/tool_parsers/llama_tool_parser.py Override has_pending_tool_call() for <function=
vllm_mlx/tool_parsers/mistral_tool_parser.py Override has_pending_tool_call() for [TOOL_CALLS]
vllm_mlx/paged_cache.py access_count on CacheBlock, popleft_lfu(), get_best_block(), frequency-aware eviction
README.md Update optimization roadmap — Tier 0 marked merged, Tier 1 top 3 marked merged

Test plan

  • pytest tests/test_paged_cache.py tests/test_prefix_cache.py tests/test_tool_parsers.py — all 136 tests pass
  • Full test suite: 902 passed (20 pre-existing async infra failures, unrelated)
  • Manual: Start server with --tool-call-parser minimax --enable-auto-tool-choice, send streaming tool call requests, verify fallback triggers on truncated streams
  • Benchmark: Run benchmark_minmax.py to compare prefix cache Turn4/Turn1 ratio and tool call reliability vs Tier 0 baseline

🤖 Generated with Claude Code

Raullen and others added 2 commits February 24, 2026 21:35
…cache, block reuse ordering

Three high-impact improvements for MiniMax-M2.5 + OpenClaw on M3 Ultra:

1. Fix streaming tool truncation: The fallback in stream_chat_completion()
   only checked for "<tool_call>" — MiniMax uses <minimax:tool_call>, so
   incomplete tool calls were silently lost. Now uses parser-aware
   has_pending_tool_call() with overrides for MiniMax, Llama, and Mistral.
   Also hardens MiniMax extract_tool_calls() to handle missing closing tags.

2. Frequency-aware cache eviction: Replace pure LRU with LRU-LFU hybrid.
   Adds access_count to CacheBlock and popleft_lfu() which examines a
   window of 8 LRU candidates and evicts the one with lowest frequency.
   System prompt blocks accessed 100x are no longer evicted as easily as
   one-shot blocks.

3. KV block reuse ordering: When multiple blocks share the same hash,
   get_best_block() returns the one with highest access_count instead of
   an arbitrary one, improving cache locality for hybrid models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prefix cache Turn 2 TTFT improved 21.5% (0.775s -> 0.608s), T4/T1 ratio
improved from 2.09x to 2.03x. Zero regressions in decode throughput,
tool calling accuracy, or long generation stability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@raullenchai raullenchai merged commit e262fa2 into main Feb 25, 2026
raullenchai pushed a commit that referenced this pull request Mar 21, 2026
- #2: /v1/models now returns both the full HF name and the alias used
  to start the model, so SDK users see a recognizable name
- #3: API responses return the actual loaded model name instead of
  echoing back whatever the client sent (prevents "gpt-4o" confusion)
- #4: SECURITY WARNING downgraded to debug — local inference doesn't
  need auth, the warning was causing unnecessary anxiety for new users
- Pass alias from CLI to server for /v1/models listing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai added a commit that referenced this pull request Mar 21, 2026
…dels (#41)

* fix: 4 UX friction points from user testing v2

- #2: /v1/models now returns both the full HF name and the alias used
  to start the model, so SDK users see a recognizable name
- #3: API responses return the actual loaded model name instead of
  echoing back whatever the client sent (prevents "gpt-4o" confusion)
- #4: SECURITY WARNING downgraded to debug — local inference doesn't
  need auth, the warning was causing unnecessary anxiety for new users
- Pass alias from CLI to server for /v1/models listing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: non-streaming Anthropic response also returns actual model name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant