fix: init crash, JSON corruption, GC leak, cloud routing gaps#11
Merged
raullenchai merged 7 commits intomainfrom Mar 3, 2026
Merged
fix: init crash, JSON corruption, GC leak, cloud routing gaps#11raullenchai merged 7 commits intomainfrom
raullenchai merged 7 commits intomainfrom
Conversation
…cloud gaps 1. SimpleEngine._inject_shared_model: set missing MLXLanguageModel attributes (prefill_step_size, kv_bits, kv_group_size, _prompt_cache, _cached_token_ids, _cache_lock) that __new__ skips, preventing AttributeError on first generate 2. Non-streaming chat: guard extract_json_from_response with `if response_format` so plain text responses aren't corrupted by JSON extraction 3. stream_chat_completion: wrap generator body in try/finally so gc.enable() runs even on client disconnect, preventing permanent GC disable 4. Cloud streaming: wrap with _disconnect_guard like local streaming path 5. Cloud routing: forward response_format to cloud provider so structured output works consistently regardless of routing decision Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_choice Cloud routing was using locally-mutated messages (tool→user conversion, developer→system normalization, suffix injection) instead of original OpenAI-format messages. Also forward stop and tool_choice parameters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PrefixCacheManager.pin_prefix was silently undone by store_cache and _touch_lru re-adding entries to LRU. Added _pinned set to track pinned entries, ensuring they stay out of LRU. Pinned entries now count toward capacity to prevent unbounded cache growth. Fixed generate_json/generate_json_object return type from str to str|None to match actual behavior (returns None on failure). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RateLimiter._requests dict grew unbounded with unique client keys that stopped making requests. Added periodic purge of stale keys when dict exceeds 100 entries. Demoted user message preview logging from INFO to DEBUG to prevent PII/sensitive content from appearing in production logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. CloudRouter._build_call_kwargs now forwards response_format to litellm so structured output works on cloud-routed requests. 2. _inject_shared_model uses engine config (self._prefill_step_size, self._kv_bits, self._kv_group_size) instead of hardcoded defaults. 3. pin_prefix rejects when pinned count reaches max_size, preventing capacity from becoming unenforceable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_passes_through_response_format: verifies response_format is forwarded through _build_call_kwargs (was silently dropped) - TestInjectSharedModelConfig: verifies _inject_shared_model propagates engine config (prefill_step_size, kv_bits, kv_group_size) instead of hardcoded defaults - TestPrefixCachePinning: verifies pin survives store/touch, capacity guard rejects at max_size, unpin restores evictability, clear resets Also adds docstring note to pin_prefix about capacity policy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_rate_limiter_stale_key_purge: verifies stale client keys are purged when dict exceeds 100 entries - TestExtractJsonFromResponse: documents why extract_json_from_response must be guarded by `if response_format` — it corrupts plain text that ends with balanced braces Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
5 bugs found during code review, all verified against source:
_inject_shared_modelcrash —MLXLanguageModel.__new__skips__init__, missing 6 attributes (prefill_step_size,kv_bits,kv_group_size,_prompt_cache,_cached_token_ids,_cache_lock). HybridEngine always uses this path for non-MLLM, so firststream_generatecall would crash withAttributeErrorextract_json_from_responseran unconditionally on all non-streaming responses. Any response ending in}or](code, prose) would get truncated to just the trailing JSON-like substring. Now guarded byif response_formatgc.disable()at generator entry,gc.enable()only at normal exit. Client disconnect abandons generator, GC stays off for the entire process. Wrapped intry/finally_disconnect_guard, cloud streaming didn't. Now wrappedresponse_format— Cloud kwargs didn't includeresponse_format, so structured output requests routed to cloud would ignore the format constraint. Now forwardedTest plan
test_platform.pymissing torch)🤖 Generated with Claude Code