fix: init crash, JSON corruption, GC leak, cloud routing gaps by raullenchai · Pull Request #11 · raullenchai/Rapid-MLX

raullenchai · 2026-03-03T18:58:10Z

Summary

5 bugs found during code review, all verified against source:

P1: _inject_shared_model crash — MLXLanguageModel.__new__ skips __init__, missing 6 attributes (prefill_step_size, kv_bits, kv_group_size, _prompt_cache, _cached_token_ids, _cache_lock). HybridEngine always uses this path for non-MLLM, so first stream_generate call would crash with AttributeError
P1: JSON extraction corrupts normal responses — extract_json_from_response ran unconditionally on all non-streaming responses. Any response ending in } or ] (code, prose) would get truncated to just the trailing JSON-like substring. Now guarded by if response_format
P1: GC permanently disabled after stream disconnect — gc.disable() at generator entry, gc.enable() only at normal exit. Client disconnect abandons generator, GC stays off for the entire process. Wrapped in try/finally
P2: Cloud streaming lacks disconnect guard — Local streaming uses _disconnect_guard, cloud streaming didn't. Now wrapped
P2: Cloud routing skips response_format — Cloud kwargs didn't include response_format, so structured output requests routed to cloud would ignore the format constraint. Now forwarded

Test plan

924 tests pass (1 pre-existing failure: test_platform.py missing torch)
Syntax validated on both changed files
No test assertions changed

🤖 Generated with Claude Code

…cloud gaps 1. SimpleEngine._inject_shared_model: set missing MLXLanguageModel attributes (prefill_step_size, kv_bits, kv_group_size, _prompt_cache, _cached_token_ids, _cache_lock) that __new__ skips, preventing AttributeError on first generate 2. Non-streaming chat: guard extract_json_from_response with `if response_format` so plain text responses aren't corrupted by JSON extraction 3. stream_chat_completion: wrap generator body in try/finally so gc.enable() runs even on client disconnect, preventing permanent GC disable 4. Cloud streaming: wrap with _disconnect_guard like local streaming path 5. Cloud routing: forward response_format to cloud provider so structured output works consistently regardless of routing decision Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…_choice Cloud routing was using locally-mutated messages (tool→user conversion, developer→system normalization, suffix injection) instead of original OpenAI-format messages. Also forward stop and tool_choice parameters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PrefixCacheManager.pin_prefix was silently undone by store_cache and _touch_lru re-adding entries to LRU. Added _pinned set to track pinned entries, ensuring they stay out of LRU. Pinned entries now count toward capacity to prevent unbounded cache growth. Fixed generate_json/generate_json_object return type from str to str|None to match actual behavior (returns None on failure). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

RateLimiter._requests dict grew unbounded with unique client keys that stopped making requests. Added periodic purge of stale keys when dict exceeds 100 entries. Demoted user message preview logging from INFO to DEBUG to prevent PII/sensitive content from appearing in production logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1. CloudRouter._build_call_kwargs now forwards response_format to litellm so structured output works on cloud-routed requests. 2. _inject_shared_model uses engine config (self._prefill_step_size, self._kv_bits, self._kv_group_size) instead of hardcoded defaults. 3. pin_prefix rejects when pinned count reaches max_size, preventing capacity from becoming unenforceable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- test_passes_through_response_format: verifies response_format is forwarded through _build_call_kwargs (was silently dropped) - TestInjectSharedModelConfig: verifies _inject_shared_model propagates engine config (prefill_step_size, kv_bits, kv_group_size) instead of hardcoded defaults - TestPrefixCachePinning: verifies pin survives store/touch, capacity guard rejects at max_size, unpin restores evictability, clear resets Also adds docstring note to pin_prefix about capacity policy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- test_rate_limiter_stale_key_purge: verifies stale client keys are purged when dict exceeds 100 entries - TestExtractJsonFromResponse: documents why extract_json_from_response must be guarded by `if response_format` — it corrupts plain text that ends with balanced braces Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Your Name and others added 7 commits March 3, 2026 10:57

raullenchai merged commit 3e6548f into main Mar 3, 2026

raullenchai deleted the fix/p1-review-fixes branch March 3, 2026 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: init crash, JSON corruption, GC leak, cloud routing gaps#11

fix: init crash, JSON corruption, GC leak, cloud routing gaps#11
raullenchai merged 7 commits intomainfrom
fix/p1-review-fixes

raullenchai commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raullenchai commented Mar 3, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant