feat: TTFT cache fix, MiniMax reasoning parser, logprobs API, tool logits#7
Merged
raullenchai merged 1 commit intomainfrom Feb 25, 2026
Merged
Conversation
The prompt cache was saving state AFTER yielding the finished chunk, but the caller breaks before the generator resumes — so _save_cache_snapshot() never executed. Every request did full prefill regardless of cache state (only 7 template tokens matched). Fix: move _save_cache_snapshot() BEFORE the final yield. Results (10K token prompt): 127s → 0.32s (113x speedup on cache hit) Results (5K token prompt): 21s → 0.28s (58x speedup on cache hit) Partial cache hits also work: same system prompt + different user message → 0.45s vs 16.4s cold (36x speedup). Additional changes: - Add prompt_tokens to StreamingOutput, removing double tokenization in SimpleEngine (was encoding the full prompt twice per request) - Add --prefill-step-size CLI arg (tune prefill chunk size) - Add --kv-bits / --kv-group-size CLI args (KV cache quantization) - Add TTFT breakdown logging (tokenize, prefill, total times) - Set _reasoning_parser_name in cli.py (was only set in server.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai
added a commit
that referenced
this pull request
Mar 15, 2026
…ing edge case Addresses review findings #1, #3, #7: 1. Thread safety (high): Split step() into _step_no_queue() (GPU work, safe for executor thread) and _distribute_outputs() (queue writes, event loop thread only). _process_loop now calls _step_no_queue in the executor and distributes outputs on the event loop thread, preventing races on asyncio.Queue which is not thread-safe. 2. Stop-sequence streaming (high): When a stop string appears mid-token, the valid prefix before the marker is now emitted in new_text instead of being silently dropped. Streaming clients no longer lose content. 3. Empty-string truthiness (medium): Stop-string finalization now uses an explicit `stop_trimmed` flag instead of `if not request.output_text`, which is falsy for empty string. A stop match at position 0 no longer re-decodes the full token sequence and leaks the stop text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_save_cache_snapshot()was called afteryieldin a generator; caller breaks before it executes. Moved before yield. Results: 10K token prompt 127s → 0.32s (113x speedup on cache hit)<think>tags). 0/10 reasoning leaks in benchmarkstop_logprobssupport in chat completions (streaming + non-streaming)StreamingOutput.prompt_tokens(removed double tokenization)--prefill-step-size,--kv-bits,--kv-group-sizefor TTFT tuningBenchmark Results
Test plan
🤖 Generated with Claude Code