feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse#2
Merged
raullenchai merged 2 commits intomainfrom Feb 25, 2026
Merged
Conversation
…cache, block reuse ordering Three high-impact improvements for MiniMax-M2.5 + OpenClaw on M3 Ultra: 1. Fix streaming tool truncation: The fallback in stream_chat_completion() only checked for "<tool_call>" — MiniMax uses <minimax:tool_call>, so incomplete tool calls were silently lost. Now uses parser-aware has_pending_tool_call() with overrides for MiniMax, Llama, and Mistral. Also hardens MiniMax extract_tool_calls() to handle missing closing tags. 2. Frequency-aware cache eviction: Replace pure LRU with LRU-LFU hybrid. Adds access_count to CacheBlock and popleft_lfu() which examines a window of 8 LRU candidates and evicts the one with lowest frequency. System prompt blocks accessed 100x are no longer evicted as easily as one-shot blocks. 3. KV block reuse ordering: When multiple blocks share the same hash, get_best_block() returns the one with highest access_count instead of an arbitrary one, improving cache locality for hybrid models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prefix cache Turn 2 TTFT improved 21.5% (0.775s -> 0.608s), T4/T1 ratio improved from 2.09x to 2.03x. Zero regressions in decode throughput, tool calling accuracy, or long generation stability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai
pushed a commit
that referenced
this pull request
Mar 21, 2026
- #2: /v1/models now returns both the full HF name and the alias used to start the model, so SDK users see a recognizable name - #3: API responses return the actual loaded model name instead of echoing back whatever the client sent (prevents "gpt-4o" confusion) - #4: SECURITY WARNING downgraded to debug — local inference doesn't need auth, the warning was causing unnecessary anxiety for new users - Pass alias from CLI to server for /v1/models listing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai
added a commit
that referenced
this pull request
Mar 21, 2026
…dels (#41) * fix: 4 UX friction points from user testing v2 - #2: /v1/models now returns both the full HF name and the alias used to start the model, so SDK users see a recognizable name - #3: API responses return the actual loaded model name instead of echoing back whatever the client sent (prevents "gpt-4o" confusion) - #4: SECURITY WARNING downgraded to debug — local inference doesn't need auth, the warning was causing unnecessary anxiety for new users - Pass alias from CLI to server for /v1/models listing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: non-streaming Anthropic response also returns actual model name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the top 3 Tier 1 optimizations for MiniMax-M2.5 + OpenClaw on M3 Ultra, building on the merged Tier 0 work (GC control, pinned prefix cache, schema hardening).
1. Fix Streaming Tool Truncation
stream_chat_completion()only checked for"<tool_call>"in accumulated text. MiniMax uses<minimax:tool_call>, Llama uses<function=, and Mistral uses[TOOL_CALLS]— so the fallback never triggered for these parsers, and incomplete tool calls were silently lost.has_pending_tool_call()toToolParserbase class with parser-specific overrides. Server fallback now usestool_parser.has_pending_tool_call()instead of hardcoded string check.extract_tool_calls()with partial regex patterns (INVOKE_PARTIAL,PARAM_PARTIAL) to extract tool calls even when</parameter>or</invoke>closing tags are missing due to truncated streams.2. Frequency-Aware Cache Eviction (LRU-LFU Hybrid)
access_countfield toCacheBlock, incremented on everytouch(). Newpopleft_lfu()method examines a window of 8 LRU candidates and picks the one with the lowestaccess_count. Used inallocate_block()andevict_lru_blocks()when prefix caching is enabled.3. KV Block Reuse Ordering
get_block()returned an arbitrary block vianext(iter(blocks.values())).get_best_block()toBlockHashToBlockMapthat returns the block with the highestaccess_count. Used inget_computed_blocks()for better cache locality.Files Changed
vllm_mlx/server.pytool_parser.has_pending_tool_call()in streaming fallbackvllm_mlx/tool_parsers/abstract_tool_parser.pyhas_pending_tool_call()base methodvllm_mlx/tool_parsers/minimax_tool_parser.pyhas_pending_tool_call(), add partial regex patterns for truncated inputvllm_mlx/tool_parsers/llama_tool_parser.pyhas_pending_tool_call()for<function=vllm_mlx/tool_parsers/mistral_tool_parser.pyhas_pending_tool_call()for[TOOL_CALLS]vllm_mlx/paged_cache.pyaccess_counton CacheBlock,popleft_lfu(),get_best_block(), frequency-aware evictionREADME.mdTest plan
pytest tests/test_paged_cache.py tests/test_prefix_cache.py tests/test_tool_parsers.py— all 136 tests pass--tool-call-parser minimax --enable-auto-tool-choice, send streaming tool call requests, verify fallback triggers on truncated streamsbenchmark_minmax.pyto compare prefix cache Turn4/Turn1 ratio and tool call reliability vs Tier 0 baseline🤖 Generated with Claude Code