Skip to content

feat: TTFT cache fix, MiniMax reasoning parser, logprobs API, tool logits#7

Merged
raullenchai merged 1 commit intomainfrom
feat/grammar-error-handling
Feb 25, 2026
Merged

feat: TTFT cache fix, MiniMax reasoning parser, logprobs API, tool logits#7
raullenchai merged 1 commit intomainfrom
feat/grammar-error-handling

Conversation

@raullenchai
Copy link
Copy Markdown
Owner

Summary

  • Fix broken prompt cache_save_cache_snapshot() was called after yield in a generator; caller breaks before it executes. Moved before yield. Results: 10K token prompt 127s → 0.32s (113x speedup on cache hit)
  • MiniMax reasoning parser — Heuristic pattern matching for inline reasoning (no <think> tags). 0/10 reasoning leaks in benchmarks
  • Logprobs APItop_logprobs support in chat completions (streaming + non-streaming)
  • Structural tag constraint — Parameter-level JSON schema validation in tool logits processor
  • prompt_tokens fix — Accurate reporting via StreamingOutput.prompt_tokens (removed double tokenization)
  • Tool-use system prompt — Auto-inject concise tool-use instructions when tools are provided
  • CLI args--prefill-step-size, --kv-bits, --kv-group-size for TTFT tuning
  • TTFT timing logs — Breakdown of tokenize, prefill, and total time per request

Benchmark Results

Prompt Size Cold Cache Hit Speedup
500 tokens 2.76s 0.22s 12.5x
2,000 tokens 6.42s 0.24s 26.7x
5,000 tokens 16.4s 0.28s 58.2x
10,000 tokens 36.4s 0.32s 113.7x

Test plan

  • 899 unit tests pass (all non-async tests)
  • Tool calling: 5/5 correct in real-world tests
  • Reasoning separation: 0/10 leaks
  • Streaming + non-streaming verified
  • Cache hit verified across prompt sizes

🤖 Generated with Claude Code

The prompt cache was saving state AFTER yielding the finished chunk,
but the caller breaks before the generator resumes — so
_save_cache_snapshot() never executed. Every request did full prefill
regardless of cache state (only 7 template tokens matched).

Fix: move _save_cache_snapshot() BEFORE the final yield.

Results (10K token prompt): 127s → 0.32s (113x speedup on cache hit)
Results (5K token prompt): 21s → 0.28s (58x speedup on cache hit)
Partial cache hits also work: same system prompt + different user
message → 0.45s vs 16.4s cold (36x speedup).

Additional changes:
- Add prompt_tokens to StreamingOutput, removing double tokenization
  in SimpleEngine (was encoding the full prompt twice per request)
- Add --prefill-step-size CLI arg (tune prefill chunk size)
- Add --kv-bits / --kv-group-size CLI args (KV cache quantization)
- Add TTFT breakdown logging (tokenize, prefill, total times)
- Set _reasoning_parser_name in cli.py (was only set in server.py)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@raullenchai raullenchai merged commit 10ab7ad into main Feb 25, 2026
raullenchai added a commit that referenced this pull request Mar 15, 2026
…ing edge case

Addresses review findings #1, #3, #7:

1. Thread safety (high): Split step() into _step_no_queue() (GPU work,
   safe for executor thread) and _distribute_outputs() (queue writes,
   event loop thread only). _process_loop now calls _step_no_queue in
   the executor and distributes outputs on the event loop thread,
   preventing races on asyncio.Queue which is not thread-safe.

2. Stop-sequence streaming (high): When a stop string appears mid-token,
   the valid prefix before the marker is now emitted in new_text instead
   of being silently dropped. Streaming clients no longer lose content.

3. Empty-string truthiness (medium): Stop-string finalization now uses
   an explicit `stop_trimmed` flag instead of `if not request.output_text`,
   which is falsy for empty string. A stop match at position 0 no longer
   re-decodes the full token sequence and leaks the stop text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant