feat: Prompt cache for SimpleEngine + tool logits safety#3
Merged
raullenchai merged 1 commit intomainfrom Feb 25, 2026
Merged
Conversation
Add KV cache reuse across requests in SimpleEngine mode. When consecutive requests share a common prefix (system prompt + tools), only new suffix tokens are processed, dramatically reducing prefill. - MLXLanguageModel: persistent prompt_cache with prefix matching - Trim cache based on actual offset (includes generated tokens) - Handle empty suffix (exact repeat) by re-processing last token - Tool logits: add escape hatch (max 50 consecutive biased tokens) OpenClaw benchmark: 23-30s → 7-10s per request (3-4x speedup), saving 12,000-18,000 tokens of prefill per warm request. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3 tasks
raullenchai
added a commit
that referenced
this pull request
Mar 15, 2026
…ing edge case Addresses review findings #1, #3, #7: 1. Thread safety (high): Split step() into _step_no_queue() (GPU work, safe for executor thread) and _distribute_outputs() (queue writes, event loop thread only). _process_loop now calls _step_no_queue in the executor and distributes outputs on the event loop thread, preventing races on asyncio.Queue which is not thread-safe. 2. Stop-sequence streaming (high): When a stop string appears mid-token, the valid prefix before the marker is now emitted in new_text instead of being silently dropped. Streaming clients no longer lose content. 3. Empty-string truthiness (medium): Stop-string finalization now uses an explicit `stop_trimmed` flag instead of `if not request.output_text`, which is falsy for empty string. A stop match at position 0 no longer re-decodes the full token sequence and leaks the stop text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
raullenchai
pushed a commit
that referenced
this pull request
Mar 21, 2026
- #2: /v1/models now returns both the full HF name and the alias used to start the model, so SDK users see a recognizable name - #3: API responses return the actual loaded model name instead of echoing back whatever the client sent (prevents "gpt-4o" confusion) - #4: SECURITY WARNING downgraded to debug — local inference doesn't need auth, the warning was causing unnecessary anxiety for new users - Pass alias from CLI to server for /v1/models listing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai
added a commit
that referenced
this pull request
Mar 21, 2026
…dels (#41) * fix: 4 UX friction points from user testing v2 - #2: /v1/models now returns both the full HF name and the alias used to start the model, so SDK users see a recognizable name - #3: API responses return the actual loaded model name instead of echoing back whatever the client sent (prevents "gpt-4o" confusion) - #4: SECURITY WARNING downgraded to debug — local inference doesn't need auth, the warning was causing unnecessary anxiety for new users - Pass alias from CLI to server for /v1/models listing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: non-streaming Anthropic response also returns actual model name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance Impact (OpenClaw workload)
How it works
MLXLanguageModelmaintains a persistent_prompt_cache(mlx-lm native KV cache)stream_generate()with the pre-populated cacheTest plan
🤖 Generated with Claude Code