Skip to content

feat: Prompt cache for SimpleEngine + tool logits safety#3

Merged
raullenchai merged 1 commit intomainfrom
feat/prompt-cache-simple-engine
Feb 25, 2026
Merged

feat: Prompt cache for SimpleEngine + tool logits safety#3
raullenchai merged 1 commit intomainfrom
feat/prompt-cache-simple-engine

Conversation

@raullenchai
Copy link
Copy Markdown
Owner

Summary

  • Prompt cache for SimpleEngine: Persistent KV cache across requests with automatic prefix matching. When consecutive requests share a common prefix (system prompt + tools), only new suffix tokens are processed.
  • Tool logits safety: Added escape hatch (max 50 consecutive biased tokens) to prevent stuck patterns in the logits processor.

Performance Impact (OpenClaw workload)

Metric Before After
Warm request latency 23-30s 7-10s (3-4x faster)
Prefill tokens saved 0 12,000-18,000 per request
Effective throughput 4-10 tok/s 22-31 tok/s

How it works

  • MLXLanguageModel maintains a persistent _prompt_cache (mlx-lm native KV cache)
  • Each request tokenizes the full prompt, finds the common prefix with the cached state
  • Trims the cache to the common prefix (accounting for generated tokens from previous call)
  • Passes only the suffix tokens to stream_generate() with the pre-populated cache
  • Handles edge case: empty suffix (exact repeat) by trimming 1 token and re-processing

Test plan

  • Cold request works correctly (no cache)
  • Warm request with shared prefix hits cache
  • Exact repeat prompt works (empty suffix edge case)
  • Tool calls parse correctly with cache
  • Multi-tool calls work end-to-end
  • OpenClaw heartbeat running 2+ hours with stable cache hits (12k-18k tokens saved)
  • Cache invalidation works when prefix changes (e.g., after compaction)
  • Tool logits safety escape hatch resets after 50 consecutive biased tokens

🤖 Generated with Claude Code

Add KV cache reuse across requests in SimpleEngine mode. When
consecutive requests share a common prefix (system prompt + tools),
only new suffix tokens are processed, dramatically reducing prefill.

- MLXLanguageModel: persistent prompt_cache with prefix matching
- Trim cache based on actual offset (includes generated tokens)
- Handle empty suffix (exact repeat) by re-processing last token
- Tool logits: add escape hatch (max 50 consecutive biased tokens)

OpenClaw benchmark: 23-30s → 7-10s per request (3-4x speedup),
saving 12,000-18,000 tokens of prefill per warm request.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@raullenchai raullenchai merged commit 2bc9aaa into main Feb 25, 2026
raullenchai added a commit that referenced this pull request Mar 15, 2026
…ing edge case

Addresses review findings #1, #3, #7:

1. Thread safety (high): Split step() into _step_no_queue() (GPU work,
   safe for executor thread) and _distribute_outputs() (queue writes,
   event loop thread only). _process_loop now calls _step_no_queue in
   the executor and distributes outputs on the event loop thread,
   preventing races on asyncio.Queue which is not thread-safe.

2. Stop-sequence streaming (high): When a stop string appears mid-token,
   the valid prefix before the marker is now emitted in new_text instead
   of being silently dropped. Streaming clients no longer lose content.

3. Empty-string truthiness (medium): Stop-string finalization now uses
   an explicit `stop_trimmed` flag instead of `if not request.output_text`,
   which is falsy for empty string. A stop match at position 0 no longer
   re-decodes the full token sequence and leaks the stop text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
raullenchai pushed a commit that referenced this pull request Mar 21, 2026
- #2: /v1/models now returns both the full HF name and the alias used
  to start the model, so SDK users see a recognizable name
- #3: API responses return the actual loaded model name instead of
  echoing back whatever the client sent (prevents "gpt-4o" confusion)
- #4: SECURITY WARNING downgraded to debug — local inference doesn't
  need auth, the warning was causing unnecessary anxiety for new users
- Pass alias from CLI to server for /v1/models listing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raullenchai added a commit that referenced this pull request Mar 21, 2026
…dels (#41)

* fix: 4 UX friction points from user testing v2

- #2: /v1/models now returns both the full HF name and the alias used
  to start the model, so SDK users see a recognizable name
- #3: API responses return the actual loaded model name instead of
  echoing back whatever the client sent (prevents "gpt-4o" confusion)
- #4: SECURITY WARNING downgraded to debug — local inference doesn't
  need auth, the warning was causing unnecessary anxiety for new users
- Pass alias from CLI to server for /v1/models listing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: non-streaming Anthropic response also returns actual model name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant