feat: Prompt cache for SimpleEngine + tool logits safety by raullenchai · Pull Request #3 · raullenchai/Rapid-MLX

raullenchai · 2026-02-25T15:08:47Z

Summary

Prompt cache for SimpleEngine: Persistent KV cache across requests with automatic prefix matching. When consecutive requests share a common prefix (system prompt + tools), only new suffix tokens are processed.
Tool logits safety: Added escape hatch (max 50 consecutive biased tokens) to prevent stuck patterns in the logits processor.

Performance Impact (OpenClaw workload)

Metric	Before	After
Warm request latency	23-30s	7-10s (3-4x faster)
Prefill tokens saved	0	12,000-18,000 per request
Effective throughput	4-10 tok/s	22-31 tok/s

How it works

MLXLanguageModel maintains a persistent _prompt_cache (mlx-lm native KV cache)
Each request tokenizes the full prompt, finds the common prefix with the cached state
Trims the cache to the common prefix (accounting for generated tokens from previous call)
Passes only the suffix tokens to stream_generate() with the pre-populated cache
Handles edge case: empty suffix (exact repeat) by trimming 1 token and re-processing

Test plan

Cold request works correctly (no cache)
Warm request with shared prefix hits cache
Exact repeat prompt works (empty suffix edge case)
Tool calls parse correctly with cache
Multi-tool calls work end-to-end
OpenClaw heartbeat running 2+ hours with stable cache hits (12k-18k tokens saved)
Cache invalidation works when prefix changes (e.g., after compaction)
Tool logits safety escape hatch resets after 50 consecutive biased tokens

🤖 Generated with Claude Code

Add KV cache reuse across requests in SimpleEngine mode. When consecutive requests share a common prefix (system prompt + tools), only new suffix tokens are processed, dramatically reducing prefill. - MLXLanguageModel: persistent prompt_cache with prefix matching - Trim cache based on actual offset (includes generated tokens) - Handle empty suffix (exact repeat) by re-processing last token - Tool logits: add escape hatch (max 50 consecutive biased tokens) OpenClaw benchmark: 23-30s → 7-10s per request (3-4x speedup), saving 12,000-18,000 tokens of prefill per warm request. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ing edge case Addresses review findings #1, #3, #7: 1. Thread safety (high): Split step() into _step_no_queue() (GPU work, safe for executor thread) and _distribute_outputs() (queue writes, event loop thread only). _process_loop now calls _step_no_queue in the executor and distributes outputs on the event loop thread, preventing races on asyncio.Queue which is not thread-safe. 2. Stop-sequence streaming (high): When a stop string appears mid-token, the valid prefix before the marker is now emitted in new_text instead of being silently dropped. Streaming clients no longer lose content. 3. Empty-string truthiness (medium): Stop-string finalization now uses an explicit `stop_trimmed` flag instead of `if not request.output_text`, which is falsy for empty string. A stop match at position 0 no longer re-decodes the full token sequence and leaks the stop text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- #2: /v1/models now returns both the full HF name and the alias used to start the model, so SDK users see a recognizable name - #3: API responses return the actual loaded model name instead of echoing back whatever the client sent (prevents "gpt-4o" confusion) - #4: SECURITY WARNING downgraded to debug — local inference doesn't need auth, the warning was causing unnecessary anxiety for new users - Pass alias from CLI to server for /v1/models listing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dels (#41) * fix: 4 UX friction points from user testing v2 - #2: /v1/models now returns both the full HF name and the alias used to start the model, so SDK users see a recognizable name - #3: API responses return the actual loaded model name instead of echoing back whatever the client sent (prevents "gpt-4o" confusion) - #4: SECURITY WARNING downgraded to debug — local inference doesn't need auth, the warning was causing unnecessary anxiety for new users - Pass alias from CLI to server for /v1/models listing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: non-streaming Anthropic response also returns actual model name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

raullenchai merged commit 2bc9aaa into main Feb 25, 2026

raullenchai mentioned this pull request Feb 25, 2026

fix: Prevent server crash from malformed response_format schemas #4

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Prompt cache for SimpleEngine + tool logits safety#3

feat: Prompt cache for SimpleEngine + tool logits safety#3
raullenchai merged 1 commit intomainfrom
feat/prompt-cache-simple-engine

raullenchai commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raullenchai commented Feb 25, 2026

Summary

Performance Impact (OpenClaw workload)

How it works

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant