feat: TTFT cache fix, MiniMax reasoning parser, logprobs API, tool logits by raullenchai · Pull Request #7 · raullenchai/Rapid-MLX

raullenchai · 2026-02-25T20:33:31Z

Summary

Fix broken prompt cache — _save_cache_snapshot() was called after yield in a generator; caller breaks before it executes. Moved before yield. Results: 10K token prompt 127s → 0.32s (113x speedup on cache hit)
MiniMax reasoning parser — Heuristic pattern matching for inline reasoning (no <think> tags). 0/10 reasoning leaks in benchmarks
Logprobs API — top_logprobs support in chat completions (streaming + non-streaming)
Structural tag constraint — Parameter-level JSON schema validation in tool logits processor
prompt_tokens fix — Accurate reporting via StreamingOutput.prompt_tokens (removed double tokenization)
Tool-use system prompt — Auto-inject concise tool-use instructions when tools are provided
CLI args — --prefill-step-size, --kv-bits, --kv-group-size for TTFT tuning
TTFT timing logs — Breakdown of tokenize, prefill, and total time per request

Benchmark Results

Prompt Size	Cold	Cache Hit	Speedup
500 tokens	2.76s	0.22s	12.5x
2,000 tokens	6.42s	0.24s	26.7x
5,000 tokens	16.4s	0.28s	58.2x
10,000 tokens	36.4s	0.32s	113.7x

Test plan

899 unit tests pass (all non-async tests)
Tool calling: 5/5 correct in real-world tests
Reasoning separation: 0/10 leaks
Streaming + non-streaming verified
Cache hit verified across prompt sizes

🤖 Generated with Claude Code

The prompt cache was saving state AFTER yielding the finished chunk, but the caller breaks before the generator resumes — so _save_cache_snapshot() never executed. Every request did full prefill regardless of cache state (only 7 template tokens matched). Fix: move _save_cache_snapshot() BEFORE the final yield. Results (10K token prompt): 127s → 0.32s (113x speedup on cache hit) Results (5K token prompt): 21s → 0.28s (58x speedup on cache hit) Partial cache hits also work: same system prompt + different user message → 0.45s vs 16.4s cold (36x speedup). Additional changes: - Add prompt_tokens to StreamingOutput, removing double tokenization in SimpleEngine (was encoding the full prompt twice per request) - Add --prefill-step-size CLI arg (tune prefill chunk size) - Add --kv-bits / --kv-group-size CLI args (KV cache quantization) - Add TTFT breakdown logging (tokenize, prefill, total times) - Set _reasoning_parser_name in cli.py (was only set in server.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ing edge case Addresses review findings #1, #3, #7: 1. Thread safety (high): Split step() into _step_no_queue() (GPU work, safe for executor thread) and _distribute_outputs() (queue writes, event loop thread only). _process_loop now calls _step_no_queue in the executor and distributes outputs on the event loop thread, preventing races on asyncio.Queue which is not thread-safe. 2. Stop-sequence streaming (high): When a stop string appears mid-token, the valid prefix before the marker is now emitted in new_text instead of being silently dropped. Streaming clients no longer lose content. 3. Empty-string truthiness (medium): Stop-string finalization now uses an explicit `stop_trimmed` flag instead of `if not request.output_text`, which is falsy for empty string. A stop match at position 0 no longer re-decodes the full token sequence and leaks the stop text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

raullenchai merged commit 10ab7ad into main Feb 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: TTFT cache fix, MiniMax reasoning parser, logprobs API, tool logits#7

feat: TTFT cache fix, MiniMax reasoning parser, logprobs API, tool logits#7
raullenchai merged 1 commit intomainfrom
feat/grammar-error-handling

raullenchai commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raullenchai commented Feb 25, 2026

Summary

Benchmark Results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant