Skip to content

fix: init crash, JSON corruption, GC leak, cloud routing gaps#11

Merged
raullenchai merged 7 commits intomainfrom
fix/p1-review-fixes
Mar 3, 2026
Merged

fix: init crash, JSON corruption, GC leak, cloud routing gaps#11
raullenchai merged 7 commits intomainfrom
fix/p1-review-fixes

Conversation

@raullenchai
Copy link
Copy Markdown
Owner

Summary

5 bugs found during code review, all verified against source:

  • P1: _inject_shared_model crashMLXLanguageModel.__new__ skips __init__, missing 6 attributes (prefill_step_size, kv_bits, kv_group_size, _prompt_cache, _cached_token_ids, _cache_lock). HybridEngine always uses this path for non-MLLM, so first stream_generate call would crash with AttributeError
  • P1: JSON extraction corrupts normal responsesextract_json_from_response ran unconditionally on all non-streaming responses. Any response ending in } or ] (code, prose) would get truncated to just the trailing JSON-like substring. Now guarded by if response_format
  • P1: GC permanently disabled after stream disconnectgc.disable() at generator entry, gc.enable() only at normal exit. Client disconnect abandons generator, GC stays off for the entire process. Wrapped in try/finally
  • P2: Cloud streaming lacks disconnect guard — Local streaming uses _disconnect_guard, cloud streaming didn't. Now wrapped
  • P2: Cloud routing skips response_format — Cloud kwargs didn't include response_format, so structured output requests routed to cloud would ignore the format constraint. Now forwarded

Test plan

  • 924 tests pass (1 pre-existing failure: test_platform.py missing torch)
  • Syntax validated on both changed files
  • No test assertions changed

🤖 Generated with Claude Code

Your Name and others added 7 commits March 3, 2026 10:57
…cloud gaps

1. SimpleEngine._inject_shared_model: set missing MLXLanguageModel attributes
   (prefill_step_size, kv_bits, kv_group_size, _prompt_cache, _cached_token_ids,
   _cache_lock) that __new__ skips, preventing AttributeError on first generate

2. Non-streaming chat: guard extract_json_from_response with `if response_format`
   so plain text responses aren't corrupted by JSON extraction

3. stream_chat_completion: wrap generator body in try/finally so gc.enable()
   runs even on client disconnect, preventing permanent GC disable

4. Cloud streaming: wrap with _disconnect_guard like local streaming path

5. Cloud routing: forward response_format to cloud provider so structured
   output works consistently regardless of routing decision

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_choice

Cloud routing was using locally-mutated messages (tool→user conversion,
developer→system normalization, suffix injection) instead of original
OpenAI-format messages. Also forward stop and tool_choice parameters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PrefixCacheManager.pin_prefix was silently undone by store_cache and
_touch_lru re-adding entries to LRU. Added _pinned set to track pinned
entries, ensuring they stay out of LRU. Pinned entries now count toward
capacity to prevent unbounded cache growth.

Fixed generate_json/generate_json_object return type from str to str|None
to match actual behavior (returns None on failure).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RateLimiter._requests dict grew unbounded with unique client keys that
stopped making requests. Added periodic purge of stale keys when dict
exceeds 100 entries.

Demoted user message preview logging from INFO to DEBUG to prevent
PII/sensitive content from appearing in production logs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. CloudRouter._build_call_kwargs now forwards response_format to
   litellm so structured output works on cloud-routed requests.

2. _inject_shared_model uses engine config (self._prefill_step_size,
   self._kv_bits, self._kv_group_size) instead of hardcoded defaults.

3. pin_prefix rejects when pinned count reaches max_size, preventing
   capacity from becoming unenforceable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_passes_through_response_format: verifies response_format is
  forwarded through _build_call_kwargs (was silently dropped)
- TestInjectSharedModelConfig: verifies _inject_shared_model propagates
  engine config (prefill_step_size, kv_bits, kv_group_size) instead of
  hardcoded defaults
- TestPrefixCachePinning: verifies pin survives store/touch, capacity
  guard rejects at max_size, unpin restores evictability, clear resets

Also adds docstring note to pin_prefix about capacity policy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_rate_limiter_stale_key_purge: verifies stale client keys are
  purged when dict exceeds 100 entries
- TestExtractJsonFromResponse: documents why extract_json_from_response
  must be guarded by `if response_format` — it corrupts plain text
  that ends with balanced braces

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@raullenchai raullenchai merged commit 3e6548f into main Mar 3, 2026
@raullenchai raullenchai deleted the fix/p1-review-fixes branch March 3, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant