Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
ddced67
fix: Use streaming detokenizer for UTF-8-safe incremental decode
janhilgard Feb 24, 2026
85bae64
Add --served-model-name CLI parameter
otarkhan Feb 28, 2026
41b4e76
Fix prefix cache dir using served name instead of model path
otarkhan Feb 28, 2026
7ca702d
Add Qwen3.5 model support with text-only loading and fix reasoning+to…
otarkhan Feb 28, 2026
e765db8
fix: check trim method existence before calling
Mar 11, 2026
a445b23
fix(batched): add exclude_none=True to model_dump in image extraction
kol22 Mar 11, 2026
295d690
fix: filter None values from dict() fallback and api/utils.py seriali…
kol22 Mar 12, 2026
8670c38
fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid m…
neomody77 Mar 14, 2026
31a3cc5
fix: compatibility with mlx-lm 0.31.x (prompt_checkpoints tuple)
hkstrongside Mar 20, 2026
80c6849
fix(mllm_scheduler): add adaptive periodic cache clearing (#157)
kol22 Mar 20, 2026
b353aab
fix: rename platform.py to vllm_platform.py to avoid stdlib shadowing
dan-j-cooper Mar 20, 2026
0e8ac18
fix: handle video_url content type and fix video frame token counting
patanet7 Mar 10, 2026
cf9a753
feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE
patanet7 Mar 10, 2026
eb56c7d
style: ruff format + lint fixes for new code
patanet7 Mar 10, 2026
92b3556
Fix video native init, import guard, empty source and has_media detec…
waybarrios Mar 12, 2026
f518c07
feat: SpecPrefill — attention-based sparse prefill for TTFT reduction…
Thump604 Mar 21, 2026
d90486e
remove streaming tool fix (covered by #148) and fix eos_token_ids in …
waybarrios Mar 21, 2026
90eac21
Add Qwen3.5 text-only loading and dynamic memory threshold (#127)
waybarrios Mar 21, 2026
7b3f875
fix: address PR #150 review — tool forwarding, kwargs safety, video_g…
patanet7 Mar 21, 2026
913bfd0
fix lint CI to use python 3.13 for black compatibility
waybarrios Mar 21, 2026
0b07872
format engine_core.py long line
waybarrios Mar 21, 2026
6e413f6
resolve merge conflicts with main
waybarrios Mar 21, 2026
c609b59
Merge pull request #125 from otarkhan/feature/served-model-name
waybarrios Mar 21, 2026
c70b80b
fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-…
janhilgard Feb 18, 2026
35c77ec
resolve merge conflicts with main
waybarrios Mar 21, 2026
ede4e30
format test_video.py
waybarrios Mar 21, 2026
2a79216
Merge pull request #150 from patanet7/feat/native-video-support
waybarrios Mar 21, 2026
74c2f02
remove dead code in _load_strict_false
waybarrios Mar 22, 2026
d235c37
Merge pull request #97 from janhilgard/fix/hybrid-model-batching-mtp-…
waybarrios Mar 22, 2026
8dd33e7
Don’t truncate base64 images before hashing.
BelieveDiffusion Mar 22, 2026
d8601d7
fix: bump mlx-lm minimum to 0.31.0 for hybrid model batching
krystophny Mar 25, 2026
5b4042d
merge: sync upstream origin/main — SpecPrefill, native video, MTP inj…
Mar 26, 2026
140958e
fix: alias validation, Hub model MTP routing, non-streaming text path…
Mar 26, 2026
1328d7f
fix: non-streaming text-only MTP deadlock and accumulation bug
Mar 26, 2026
cd08bb2
fix: forward stop sequences to text-only MTP generation path
Mar 26, 2026
4ce9f23
fix: truncate new_text on stop hit so SSE streams omit stop sequence
Mar 26, 2026
63b999a
fix: use self.max_kv_size instead of None in _make_cache call
Mar 27, 2026
38479fe
fix: report prompt_tokens correctly for LLM models in SimpleEngine
sjswerdloff Mar 30, 2026
9c92428
Merge pull request #153 from kol22/fix/batched-engine-exclude-none
waybarrios Mar 31, 2026
f6fb594
Merge pull request #152 from Jah-yee/fix/arrayscache-trim-attributeerror
waybarrios Mar 31, 2026
053f270
format scheduler.py trim checks from PR #152
waybarrios Mar 31, 2026
54b4d65
cleanup: remove redundant fallback tokenization and defensive hasattr…
waybarrios Mar 31, 2026
b64f12c
Merge pull request #236 from sjswerdloff/fix/prompt-token-counting
waybarrios Mar 31, 2026
5f4593b
bump version to 0.2.7
waybarrios Mar 31, 2026
a301766
Merge pull request #206 from BelieveDiffusion/fix/dont-truncate-base6…
waybarrios Mar 31, 2026
4d8c21b
Merge pull request #183 from hkstrongside/fix/mlx-lm-031-scheduler-co…
waybarrios Mar 31, 2026
7b0fc7f
Merge pull request #227 from computor-org/fix/bump-mlx-lm-for-hybrid-…
waybarrios Mar 31, 2026
ecfa8be
format scheduler.py _make_cache call from PR #183
waybarrios Mar 31, 2026
6b22f32
remove unused HAS_MAMBA_CACHE flag
waybarrios Mar 31, 2026
80d1cbf
Merge pull request #160 from neomody77/fix/qwen35-arrayscache-batching
waybarrios Mar 31, 2026
682ec4a
Merge pull request #185 from dan-j-cooper/fix/platform-rename
waybarrios Mar 31, 2026
a19cbac
fix: clean up detokenizer pool in abort, reset, and error recovery paths
waybarrios Mar 31, 2026
0197873
fix: skip stop tokens in mllm_scheduler detokenizer to match schedule…
waybarrios Mar 31, 2026
4ede902
Merge pull request #109 from janhilgard/fix/streaming-utf8-detokenizer
waybarrios Mar 31, 2026
951b8b7
fix: suppress tool call XML from streaming text content (#129)
sjswerdloff Mar 29, 2026
55c61b9
fix: also filter Qwen3 bracket-style tool calls from streaming
sjswerdloff Mar 29, 2026
af632ad
fix: filter all tool call format variants from streaming
sjswerdloff Mar 29, 2026
7c64416
fix: add Llama function format to streaming filter
sjswerdloff Mar 29, 2026
03b81b1
feat: route <think> blocks to Anthropic thinking content blocks
sjswerdloff Mar 29, 2026
030721c
chore: remove uv.lock from PR
sjswerdloff Mar 30, 2026
c516a7d
fix: track prompt_tokens in Anthropic streaming endpoint
sjswerdloff Mar 30, 2026
1cad029
address review: add ThinkRouter tests, integration tests, refactor bl…
sjswerdloff Mar 31, 2026
32dcecd
style: apply black formatting to pass CI lint
sjswerdloff Mar 31, 2026
23222f0
fix: address 3 IMPORTANT items from medical-grade review
sjswerdloff Mar 31, 2026
b4fa030
Merge pull request #232 from sjswerdloff/fix/streaming-tool-call-cont…
Thump604 Apr 1, 2026
d23e393
merge: sync upstream origin/main — streaming filters, Anthropic think…
Apr 3, 2026
2e339b5
fix: missing return in load_model_with_fallback success path
Apr 3, 2026
11b660b
bench: re-run benchmarks post-merge on 4 models
Apr 3, 2026
03b51f8
bench: comprehensive 14-model benchmark + agent integration tests
Apr 3, 2026
3fae3b3
feat: add Gemma 4 sanitize monkey-patch for mlx-vlm weight loading
Apr 4, 2026
d2f7c05
revert: remove Gemma 4 monkey-patch, use mlx-vlm 0.4.3 as-is
Apr 4, 2026
64da00d
fix: MLLM streaming crash — build_prompt not supported for VLM models
Apr 4, 2026
6b0d1c8
fix: graceful handling of MLLM stream_chat cleanup errors
Apr 4, 2026
2ecb9a0
deps: bump mlx-vlm minimum to 0.4.4 for Gemma 4 support
Apr 5, 2026
874d1b9
feat: add Gemma 4 tool call parser
Apr 5, 2026
12fc35e
fix: add global exception handler + MLLM error logging for prod resil…
Apr 5, 2026
e0c0ecf
feat: Gemma 4 text-only LLM path — prompt cache + all optimizations
Apr 5, 2026
c1a1713
bench: Gemma 4 31B — LLM path 5.2x faster than MLLM path
Apr 5, 2026
83483ab
fix: mixed quantization support for Gemma 4 E4B/E2B models
Apr 5, 2026
ac36c3f
fix: strip Gemma 4 thinking tags and turn markers from output
Apr 5, 2026
b4a0d5b
fix: mixed quant path matching + filter override keys
Apr 5, 2026
8944fe3
docs: add Gemma 4 benchmarks to README
Apr 6, 2026
176b866
feat: Gemma 4 reasoning parser — streaming thought/content separation
Apr 6, 2026
372d6f8
feat: token-level OutputRouter — config-driven output channel routing
Apr 6, 2026
c0eec6e
feat: integrate OutputRouter into streaming pipeline
Apr 6, 2026
ac02bbc
fix: 6 issues from self-review on OutputRouter integration
Apr 6, 2026
c719eb0
docs: update Gemma 4 benchmarks — full lineup with OutputRouter
Apr 6, 2026
46800dc
release: v0.4.0 — Gemma 4 day-0 support + token-level OutputRouter
Apr 6, 2026
4490b79
fix: suppress orphan tool/response tokens in OutputRouter
Apr 6, 2026
a714411
fix: recover text-format tool calls from degraded Gemma 4 output
Apr 6, 2026
54ac43e
feat: final sanitizer — last-mile catch-all against markup leakage
Apr 6, 2026
5fb31c4
fix: 3 bugs from self-review on final sanitizer
Apr 6, 2026
14e75d2
bench: final Gemma 4 verification — 0% leak, 6/6 agent tests, all models
Apr 6, 2026
6cb0f38
Merge remote-tracking branch 'raullenchai/main' into feat/upstream-sy…
Apr 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -299,11 +299,14 @@ All 17 parsers include automatic recovery — if a quantized model outputs broke
| **GPT-OSS 20B** | **127** tok/s · 100% tools | 79 (mlx-lm serve) | **1.6x** |
| **Qwen3.5-9B** | **108** tok/s | 46 (Ollama) | **2.3x** |
| **Kimi-Linear-48B** | **94** tok/s · 100% tools | — (only engine) | — |
| 🆕 **Gemma 4 26B-A4B** | **94** tok/s · 100% tools | — (day-0, only engine) | — |
| 🆕 **Gemma 4 E4B** | **83** tok/s · 100% tools | — (day-0, only engine) | — |
| **Qwen3.5-35B-A3B** | **83** tok/s · 100% tools | 75 (oMLX) | **1.1x** |
| **Qwen3-Coder 80B** | **74** tok/s · 100% tools | 69 (mlx-lm serve) | **1.1x** |
| **Qwen3.5-122B** | **44** tok/s · 100% tools | 43 (mlx-lm serve) | ~1.0x |
| 🆕 **Gemma 4 31B** | **31** tok/s · 100% tools | 10.9 (mlx-vlm bf16) | **2.8x** |

*Full benchmark data with all 18 models, TTFT tables, DeltaNet snapshots, and engine comparison below.*
*Full benchmark data with all models, TTFT tables, DeltaNet snapshots, and engine comparison below.*

<details>
<summary><strong>TTFT — Prompt Cache Advantage</strong></summary>
Expand All @@ -325,6 +328,9 @@ Prompt cache keeps multi-turn conversations fast. For standard transformers, KV
| Qwen3-Coder-Next 80B | **0.16s** | 0.27s | 1.7x |
| GPT-OSS 20B | **0.16s** | 0.27s | 1.7x |
| Qwen3.5-9B | **0.22s** | 0.26s | 1.2x |
| 🆕 Gemma 4 E4B | **0.25s** | — (day-0) | — |
| 🆕 Gemma 4 26B-A4B | **0.25s** | — (day-0) | — |
| 🆕 Gemma 4 31B | **0.34s** | 0.57s (mlx-vlm bf16) | **1.7x** |

**DeltaNet state snapshots (hybrid RNN + attention):**

Expand Down Expand Up @@ -368,7 +374,7 @@ Qwen3.5 uses Gated DeltaNet (75% RNN) + full attention (25% KV). Other engines r
| **DeltaNet state snapshots** | Deep-copy RNN state at prefix boundary, restore in ~0.1ms | Qwen3.5 (4B, 9B, 27B, 35B, 122B), Qwen3-Coder-Next |
| **Hybrid cache sync** | Keep trimmable KV + non-trimmable RNN layers in sync | Qwen3.5 (Gated DeltaNet + attention) |
| **Tool logits bias** | Jump-forward decoding — bias logits toward structured tokens | All models with `--enable-tool-logits-bias` |
| **Auto tool recovery** | Detect broken text-format tool calls, convert to structured | All 17 parser formats |
| **Auto tool recovery** | Detect broken text-format tool calls, convert to structured | All 18 parser formats (incl. Gemma 4) |
| **Speculative decoding** | Draft model generates candidates, main model verifies | Any model + `--draft-model` |
| **KV quantization** | 4/8-bit KV cache for longer contexts in less memory | All models with `--kv-bits` |
| **Prefill chunking** | Configurable step size for large-prompt throughput | All models |
Expand All @@ -379,10 +385,13 @@ Qwen3.5 uses Gated DeltaNet (75% RNN) + full attention (25% KV). Other engines r
<details>
<summary><strong>Eval benchmarks (17 models, 4 suites)</strong></summary>

17 models across tool calling (30 scenarios), coding (HumanEval+), reasoning (MATH-500), and general knowledge (MMLU-Pro). All with `enable_thinking: false` on M3 Ultra.
19 models across tool calling (30 scenarios), coding (HumanEval+), reasoning (MATH-500), and general knowledge (MMLU-Pro). All with `enable_thinking: false` on M3 Ultra. 🆕 = Gemma 4 (day-0 support).

| Model | Quant | RAM | Decode | Tools | Code | Reason | General | Avg |
|-------|-------|-----|--------|-------|------|--------|---------|-----|
| 🆕 Gemma 4 26B-A4B | 4bit | 14.4 GB | 94 t/s | **100%** | — | — | — | — |
| 🆕 Gemma 4 E4B | 4bit | 6.4 GB | 83 t/s | **100%** | — | — | — | — |
| 🆕 Gemma 4 31B | 4bit | 17.0 GB | 31 t/s | **100%** | — | — | — | — |
| Qwen3.5-122B-A10B | 8bit | 129.8 GB | 44 t/s | 87% | **90%** | **90%** | **90%** | **89%** |
| Qwen3.5-122B-A10B | mxfp4 | 65.0 GB | 57 t/s | **90%** | **90%** | 80% | **90%** | 88% |
| Qwen3.5-35B-A3B | 8bit | 36.9 GB | 83 t/s | **90%** | **90%** | 80% | 80% | 85% |
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "rapid-mlx"
version = "0.2.7"
version = "0.4.0"
description = "Rapid-MLX — AI inference for Apple Silicon. Drop-in OpenAI API, 2-4x faster than Ollama."
readme = "README.md"
license = {text = "Apache-2.0"}
Expand All @@ -31,7 +31,7 @@ dependencies = [
# Core — these are all you need for `rapid-mlx serve <text-model>`
"mlx>=0.29.0",
"mlx-lm>=0.31.0", # 0.31+ required for ArraysCache native batching (hybrid models)
"mlx-vlm>=0.1.0", # VLM support
"mlx-vlm>=0.4.4", # 0.4.4+ required for Gemma 4 support
"transformers>=5.0.0", # mlx-lm 0.30.5+ requires transformers 5.0 (rc3 bug fixed in stable)
"tokenizers>=0.19.0",
"huggingface-hub>=0.23.0",
Expand Down
34 changes: 34 additions & 0 deletions reports/benchmarks/gemma4-26b-a4b-4bit.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
[
{
"engine": "Rapid-MLX",
"model": "/Volumes/Extreme SSD/mlx-models/gemma-4-26b-a4b-it-4bit",
"short_decode_tps": {
"mean": 91.99284441095028,
"median": 91.9997184259893,
"min": 91.90088449538918,
"max": 92.07793031147239
},
"short_prefill_tps": {
"median": 104.6567741055699
},
"long_decode_tps": {
"mean": 90.20737344249109,
"median": 90.15377328174793,
"min": 90.10046743781253,
"max": 90.36787960791281
},
"long_prefill_tps": {
"median": 404.29261432785694
},
"ttft_cold_s": 0.6939874159870669,
"ttft_cached_s": 0.25729072900139727,
"multi_turn_ttft_cold_s": 0.3698056659777649,
"multi_turn_ttft_cached_s": 0.257844791514799,
"peak_ram_mb": 14697.5,
"tool_call_rate": 1.0,
"recovery_rate": 1.0,
"leak_rate": 0.0,
"vision": true,
"audio": false
}
]
34 changes: 34 additions & 0 deletions reports/benchmarks/gemma4-31b-4bit.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
[
{
"engine": "Rapid-MLX",
"model": "/Volumes/Extreme SSD/mlx-models/gemma-4-31b-it-4bit-local",
"short_decode_tps": {
"mean": 30.658969382957626,
"median": 30.650768843603235,
"min": 30.636982911649486,
"max": 30.68915639362016
},
"short_prefill_tps": {
"median": 77.82051641509116
},
"long_decode_tps": {
"mean": 29.81875147521854,
"median": 29.834616923274258,
"min": 29.772306347596366,
"max": 29.849331154785002
},
"long_prefill_tps": {
"median": 318.28190673479276
},
"ttft_cold_s": 9.772502000036184,
"ttft_cached_s": 0.34381089551607147,
"multi_turn_ttft_cold_s": 0.7450880000251345,
"multi_turn_ttft_cached_s": 0.34492891700938344,
"peak_ram_mb": 17363.453125,
"tool_call_rate": 1.0,
"recovery_rate": 1.0,
"leak_rate": 0.0,
"vision": true,
"audio": false
}
]
34 changes: 34 additions & 0 deletions reports/benchmarks/gemma4-31b-bf16-mllm.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
[
{
"engine": "Rapid-MLX",
"model": "/Volumes/Extreme SSD/mlx-models/gemma-4-31b-it-bf16",
"short_decode_tps": {
"mean": 7.684495219859486,
"median": 7.685015108337882,
"min": 7.683350416504045,
"max": 7.685120134736532
},
"short_prefill_tps": {
"median": 49.61073354493354
},
"long_decode_tps": {
"mean": 6.150148014216069,
"median": 6.149420465554755,
"min": 6.148029410337342,
"max": 6.152994166756111
},
"long_prefill_tps": {
"median": 130.33556741428563
},
"ttft_cold_s": 0.8671420829778071,
"ttft_cached_s": 0.503123354021227,
"multi_turn_ttft_cold_s": 0.878063625015784,
"multi_turn_ttft_cached_s": 0.8742528125003446,
"peak_ram_mb": 60796.328125,
"tool_call_rate": 1.0,
"recovery_rate": 1.0,
"leak_rate": 0.0,
"vision": false,
"audio": false
}
]
34 changes: 34 additions & 0 deletions reports/benchmarks/gemma4-31b-bf16.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
[
{
"engine": "Rapid-MLX",
"model": "/Volumes/Extreme SSD/mlx-models/gemma-4-31b-it-bf16",
"short_decode_tps": {
"mean": 10.877661903575952,
"median": 10.881409537294747,
"min": 10.8682413908779,
"max": 10.883334782555206
},
"short_prefill_tps": {
"median": 46.99511568078357
},
"long_decode_tps": {
"mean": 10.730247908489643,
"median": 10.733271737703564,
"min": 10.722421680460178,
"max": 10.735050307305189
},
"long_prefill_tps": {
"median": 186.58741680810584
},
"ttft_cold_s": 76.44581962499069,
"ttft_cached_s": 0.5739832909894176,
"multi_turn_ttft_cold_s": 1.105832208006177,
"multi_turn_ttft_cached_s": 0.5784412914945278,
"peak_ram_mb": 59444.0625,
"tool_call_rate": 1.0,
"recovery_rate": 1.0,
"leak_rate": 0.0,
"vision": true,
"audio": false
}
]
34 changes: 34 additions & 0 deletions reports/benchmarks/gemma4-e4b-4bit.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
[
{
"engine": "Rapid-MLX",
"model": "/Volumes/Extreme SSD/mlx-models/gemma-4-e4b-it-4bit-local",
"short_decode_tps": {
"mean": 82.22621304400961,
"median": 82.2157561516956,
"min": 82.17578086740563,
"max": 82.28710211292758
},
"short_prefill_tps": {
"median": 101.84400488869173
},
"long_decode_tps": {
"mean": 79.74346950172897,
"median": 80.09642988741999,
"min": 78.86214671758383,
"max": 80.27183190018309
},
"long_prefill_tps": {
"median": 349.3339133508353
},
"ttft_cold_s": 2.396504874981474,
"ttft_cached_s": 0.2615705000353046,
"multi_turn_ttft_cold_s": 0.3181427090312354,
"multi_turn_ttft_cached_s": 0.25800837500719354,
"peak_ram_mb": 6519.265625,
"tool_call_rate": 1.0,
"recovery_rate": 1.0,
"leak_rate": 0.0,
"vision": true,
"audio": false
}
]
Loading
Loading