Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
ddced67
fix: Use streaming detokenizer for UTF-8-safe incremental decode
janhilgard Feb 24, 2026
85bae64
Add --served-model-name CLI parameter
otarkhan Feb 28, 2026
41b4e76
Fix prefix cache dir using served name instead of model path
otarkhan Feb 28, 2026
7ca702d
Add Qwen3.5 model support with text-only loading and fix reasoning+to…
otarkhan Feb 28, 2026
e765db8
fix: check trim method existence before calling
Mar 11, 2026
a445b23
fix(batched): add exclude_none=True to model_dump in image extraction
kol22 Mar 11, 2026
295d690
fix: filter None values from dict() fallback and api/utils.py seriali…
kol22 Mar 12, 2026
8670c38
fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid m…
neomody77 Mar 14, 2026
31a3cc5
fix: compatibility with mlx-lm 0.31.x (prompt_checkpoints tuple)
hkstrongside Mar 20, 2026
80c6849
fix(mllm_scheduler): add adaptive periodic cache clearing (#157)
kol22 Mar 20, 2026
b353aab
fix: rename platform.py to vllm_platform.py to avoid stdlib shadowing
dan-j-cooper Mar 20, 2026
0e8ac18
fix: handle video_url content type and fix video frame token counting
patanet7 Mar 10, 2026
cf9a753
feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE
patanet7 Mar 10, 2026
eb56c7d
style: ruff format + lint fixes for new code
patanet7 Mar 10, 2026
92b3556
Fix video native init, import guard, empty source and has_media detec…
waybarrios Mar 12, 2026
f518c07
feat: SpecPrefill — attention-based sparse prefill for TTFT reduction…
Thump604 Mar 21, 2026
d90486e
remove streaming tool fix (covered by #148) and fix eos_token_ids in …
waybarrios Mar 21, 2026
90eac21
Add Qwen3.5 text-only loading and dynamic memory threshold (#127)
waybarrios Mar 21, 2026
7b3f875
fix: address PR #150 review — tool forwarding, kwargs safety, video_g…
patanet7 Mar 21, 2026
913bfd0
fix lint CI to use python 3.13 for black compatibility
waybarrios Mar 21, 2026
0b07872
format engine_core.py long line
waybarrios Mar 21, 2026
6e413f6
resolve merge conflicts with main
waybarrios Mar 21, 2026
c609b59
Merge pull request #125 from otarkhan/feature/served-model-name
waybarrios Mar 21, 2026
c70b80b
fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-…
janhilgard Feb 18, 2026
35c77ec
resolve merge conflicts with main
waybarrios Mar 21, 2026
ede4e30
format test_video.py
waybarrios Mar 21, 2026
2a79216
Merge pull request #150 from patanet7/feat/native-video-support
waybarrios Mar 21, 2026
74c2f02
remove dead code in _load_strict_false
waybarrios Mar 22, 2026
d235c37
Merge pull request #97 from janhilgard/fix/hybrid-model-batching-mtp-…
waybarrios Mar 22, 2026
8dd33e7
Don’t truncate base64 images before hashing.
BelieveDiffusion Mar 22, 2026
d8601d7
fix: bump mlx-lm minimum to 0.31.0 for hybrid model batching
krystophny Mar 25, 2026
5b4042d
merge: sync upstream origin/main — SpecPrefill, native video, MTP inj…
Mar 26, 2026
140958e
fix: alias validation, Hub model MTP routing, non-streaming text path…
Mar 26, 2026
1328d7f
fix: non-streaming text-only MTP deadlock and accumulation bug
Mar 26, 2026
cd08bb2
fix: forward stop sequences to text-only MTP generation path
Mar 26, 2026
4ce9f23
fix: truncate new_text on stop hit so SSE streams omit stop sequence
Mar 26, 2026
63b999a
fix: use self.max_kv_size instead of None in _make_cache call
Mar 27, 2026
38479fe
fix: report prompt_tokens correctly for LLM models in SimpleEngine
sjswerdloff Mar 30, 2026
9c92428
Merge pull request #153 from kol22/fix/batched-engine-exclude-none
waybarrios Mar 31, 2026
f6fb594
Merge pull request #152 from Jah-yee/fix/arrayscache-trim-attributeerror
waybarrios Mar 31, 2026
053f270
format scheduler.py trim checks from PR #152
waybarrios Mar 31, 2026
54b4d65
cleanup: remove redundant fallback tokenization and defensive hasattr…
waybarrios Mar 31, 2026
b64f12c
Merge pull request #236 from sjswerdloff/fix/prompt-token-counting
waybarrios Mar 31, 2026
5f4593b
bump version to 0.2.7
waybarrios Mar 31, 2026
a301766
Merge pull request #206 from BelieveDiffusion/fix/dont-truncate-base6…
waybarrios Mar 31, 2026
4d8c21b
Merge pull request #183 from hkstrongside/fix/mlx-lm-031-scheduler-co…
waybarrios Mar 31, 2026
7b0fc7f
Merge pull request #227 from computor-org/fix/bump-mlx-lm-for-hybrid-…
waybarrios Mar 31, 2026
ecfa8be
format scheduler.py _make_cache call from PR #183
waybarrios Mar 31, 2026
6b22f32
remove unused HAS_MAMBA_CACHE flag
waybarrios Mar 31, 2026
80d1cbf
Merge pull request #160 from neomody77/fix/qwen35-arrayscache-batching
waybarrios Mar 31, 2026
682ec4a
Merge pull request #185 from dan-j-cooper/fix/platform-rename
waybarrios Mar 31, 2026
a19cbac
fix: clean up detokenizer pool in abort, reset, and error recovery paths
waybarrios Mar 31, 2026
0197873
fix: skip stop tokens in mllm_scheduler detokenizer to match schedule…
waybarrios Mar 31, 2026
4ede902
Merge pull request #109 from janhilgard/fix/streaming-utf8-detokenizer
waybarrios Mar 31, 2026
951b8b7
fix: suppress tool call XML from streaming text content (#129)
sjswerdloff Mar 29, 2026
55c61b9
fix: also filter Qwen3 bracket-style tool calls from streaming
sjswerdloff Mar 29, 2026
af632ad
fix: filter all tool call format variants from streaming
sjswerdloff Mar 29, 2026
7c64416
fix: add Llama function format to streaming filter
sjswerdloff Mar 29, 2026
03b81b1
feat: route <think> blocks to Anthropic thinking content blocks
sjswerdloff Mar 29, 2026
030721c
chore: remove uv.lock from PR
sjswerdloff Mar 30, 2026
c516a7d
fix: track prompt_tokens in Anthropic streaming endpoint
sjswerdloff Mar 30, 2026
1cad029
address review: add ThinkRouter tests, integration tests, refactor bl…
sjswerdloff Mar 31, 2026
32dcecd
style: apply black formatting to pass CI lint
sjswerdloff Mar 31, 2026
23222f0
fix: address 3 IMPORTANT items from medical-grade review
sjswerdloff Mar 31, 2026
b4fa030
Merge pull request #232 from sjswerdloff/fix/streaming-tool-call-cont…
Thump604 Apr 1, 2026
d23e393
merge: sync upstream origin/main — streaming filters, Anthropic think…
Apr 3, 2026
2e339b5
fix: missing return in load_model_with_fallback success path
Apr 3, 2026
11b660b
bench: re-run benchmarks post-merge on 4 models
Apr 3, 2026
03b51f8
bench: comprehensive 14-model benchmark + agent integration tests
Apr 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
python-version: "3.13"

- name: Install dependencies
run: |
Expand Down
7 changes: 4 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "rapid-mlx"
version = "0.3.12"
version = "0.2.7"
description = "Rapid-MLX — AI inference for Apple Silicon. Drop-in OpenAI API, 2-4x faster than Ollama."
readme = "README.md"
license = {text = "Apache-2.0"}
Expand All @@ -30,8 +30,9 @@ classifiers = [
dependencies = [
# Core — these are all you need for `rapid-mlx serve <text-model>`
"mlx>=0.29.0",
"mlx-lm>=0.30.5",
"transformers>=5.0.0",
"mlx-lm>=0.31.0", # 0.31+ required for ArraysCache native batching (hybrid models)
"mlx-vlm>=0.1.0", # VLM support
"transformers>=5.0.0", # mlx-lm 0.30.5+ requires transformers 5.0 (rc3 bug fixed in stable)
"tokenizers>=0.19.0",
"huggingface-hub>=0.23.0",
"numpy>=1.24.0",
Expand Down
34 changes: 34 additions & 0 deletions reports/benchmarks/devstral-24b.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
[
{
"engine": "Rapid-MLX",
"model": "/Volumes/Extreme SSD/LMStudio-Models/mlx-community/Devstral-Small-2-24B-Instruct-2512-4bit",
"short_decode_tps": {
"mean": 29.546224945131634,
"median": 29.5528121992651,
"min": 29.523008740000495,
"max": 29.562853896129308
},
"short_prefill_tps": {
"median": 78.50722284202406
},
"long_decode_tps": {
"mean": 29.190642870240584,
"median": 29.196905393551695,
"min": 29.17665464544716,
"max": 29.198368571722895
},
"long_prefill_tps": {
"median": 555.7740024656049
},
"ttft_cold_s": 0.3981518749933457,
"ttft_cached_s": 0.17480293699918548,
"multi_turn_ttft_cold_s": 0.5294090829993365,
"multi_turn_ttft_cached_s": 0.1781606875010766,
"peak_ram_mb": 13286.703125,
"tool_call_rate": 0.0,
"recovery_rate": 0,
"leak_rate": 0.0,
"vision": true,
"audio": false
}
]
31 changes: 31 additions & 0 deletions reports/benchmarks/glm45-air.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
[
{
"engine": "Rapid-MLX",
"model": "/Volumes/Extreme SSD/mlx-models/GLM-4.5-Air-MLX-4bit",
"short_decode_tps": {
"mean": 0.27992552863199427,
"median": 0.27974669915489814,
"min": 0.27946756029371045,
"max": 0.2805623264473743
},
"long_decode_tps": {
"mean": 16.547349065413208,
"median": 0.11041879751955017,
"min": 0.10995964577211123,
"max": 49.421668752947966
},
"long_prefill_tps": {
"median": 720.5478369401173
},
"ttft_cold_s": 0.689568749992759,
"ttft_cached_s": 0.13348579200101085,
"multi_turn_ttft_cold_s": 0.5098579159966903,
"multi_turn_ttft_cached_s": 0.13851270850136643,
"peak_ram_mb": 58026.484375,
"tool_call_rate": 1.0,
"recovery_rate": 1.0,
"leak_rate": 0.0,
"vision": true,
"audio": false
}
]
34 changes: 17 additions & 17 deletions reports/benchmarks/gpt-oss-20b.json
Original file line number Diff line number Diff line change
@@ -1,34 +1,34 @@
[
{
"engine": "Rapid-MLX",
"model": "default",
"model": "/Volumes/Extreme SSD/LMStudio-Models/mlx-community/gpt-oss-20b-MXFP4-Q8",
"short_decode_tps": {
"mean": 122.94637623722063,
"median": 122.80723229657188,
"min": 121.5261824468642,
"max": 124.50571396822582
"mean": 59.23904874483382,
"median": 58.460670099788786,
"min": 58.3865845503003,
"max": 60.86989158441239
},
"short_prefill_tps": {
"median": 658.6315031611078
"median": 180.25974764376866
},
"long_decode_tps": {
"mean": 123.56692922034931,
"median": 123.55530128557712,
"min": 123.47945313209522,
"max": 123.6660332433756
"mean": 59.21073424542031,
"median": 59.209885676983895,
"min": 59.014062596876016,
"max": 59.40825446240103
},
"long_prefill_tps": {
"median": 1413.1066416100443
"median": 452.4523802707095
},
"ttft_cold_s": 0.3050392910372466,
"ttft_cached_s": 0.112084332969971,
"multi_turn_ttft_cold_s": 0.3241620830958709,
"multi_turn_ttft_cached_s": 0.11514252098277211,
"peak_ram_mb": 12061.125,
"ttft_cold_s": 0.43270891599240713,
"ttft_cached_s": 1.639991458003351,
"multi_turn_ttft_cold_s": 0.46583316699252464,
"multi_turn_ttft_cached_s": 0.2966201045055641,
"peak_ram_mb": 12314.140625,
"tool_call_rate": 0.0,
"recovery_rate": 0,
"leak_rate": 0.0,
"vision": true,
"vision": false,
"audio": false
}
]
32 changes: 16 additions & 16 deletions reports/benchmarks/hermes3-llama31-8b.json
Original file line number Diff line number Diff line change
@@ -1,30 +1,30 @@
[
{
"engine": "Rapid-MLX",
"model": "default",
"model": "/Volumes/Extreme SSD/LMStudio-Models/mlx-community/Hermes-3-Llama-3.1-8B-4bit",
"short_decode_tps": {
"mean": 124.08034651519463,
"median": 124.36739745875063,
"min": 123.39537136101775,
"max": 124.4782707258155
"mean": 123.8564234267494,
"median": 123.42606066686089,
"min": 123.31491031465332,
"max": 124.82829929873398
},
"short_prefill_tps": {
"median": 247.71035750769664
"median": 190.87991603464098
},
"long_decode_tps": {
"mean": 122.80223356011224,
"median": 122.77578368076514,
"min": 122.77411550738096,
"max": 122.85680149219063
"mean": 123.22420766220122,
"median": 122.97184960224764,
"min": 122.68944338901483,
"max": 124.0113299953412
},
"long_prefill_tps": {
"median": 1351.9491427231967
"median": 980.9118874888566
},
"ttft_cold_s": 0.3690061660017818,
"ttft_cached_s": 0.08006683352869004,
"multi_turn_ttft_cold_s": 0.23345162498299032,
"multi_turn_ttft_cached_s": 0.0775428750202991,
"peak_ram_mb": 4711.265625,
"ttft_cold_s": 0.14250754201202653,
"ttft_cached_s": 0.10385847950237803,
"multi_turn_ttft_cold_s": 0.19274274999042973,
"multi_turn_ttft_cached_s": 0.10551808349555358,
"peak_ram_mb": 4940.90625,
"tool_call_rate": 0.0,
"recovery_rate": 0.0,
"leak_rate": 0.0,
Expand Down
32 changes: 16 additions & 16 deletions reports/benchmarks/llama32-3b.json
Original file line number Diff line number Diff line change
@@ -1,30 +1,30 @@
[
{
"engine": "Rapid-MLX",
"model": "default",
"model": "/Volumes/Extreme SSD/LMStudio-Models/mlx-community/Llama-3.2-3B-Instruct-4bit",
"short_decode_tps": {
"mean": 225.29973710147297,
"median": 225.3064908515301,
"min": 224.56576334387896,
"max": 226.02695710900986
"mean": 226.52675688833313,
"median": 226.5153487509708,
"min": 226.30568834910332,
"max": 226.75923356492524
},
"short_prefill_tps": {
"median": 684.1153251386287
"median": 475.41882123651436
},
"long_decode_tps": {
"mean": 219.96123540145632,
"median": 219.99973600342884,
"min": 219.82171612576647,
"max": 220.06225407517366
"mean": 220.57800958459168,
"median": 220.6666855788961,
"min": 220.287971451054,
"max": 220.7793717238249
},
"long_prefill_tps": {
"median": 1912.1065873841285
"median": 1328.9680471540132
},
"ttft_cold_s": 0.12779508298262954,
"ttft_cached_s": 0.06659756251610816,
"multi_turn_ttft_cold_s": 0.1073445410002023,
"multi_turn_ttft_cached_s": 0.06532658298965544,
"peak_ram_mb": 2120.53125,
"ttft_cold_s": 0.12346550000074785,
"ttft_cached_s": 0.0960028545014211,
"multi_turn_ttft_cold_s": 0.15037445900088642,
"multi_turn_ttft_cached_s": 0.09493745800136821,
"peak_ram_mb": 2348.03125,
"tool_call_rate": 0.0,
"recovery_rate": 0.0,
"leak_rate": 0.0,
Expand Down
34 changes: 17 additions & 17 deletions reports/benchmarks/minimax-m25.json
Original file line number Diff line number Diff line change
@@ -1,33 +1,33 @@
[
{
"engine": "Rapid-MLX",
"model": "default",
"model": "/Volumes/Extreme SSD/mlx-models/MiniMax-M2.5-MLX-4bit",
"short_decode_tps": {
"mean": 51.84681788276677,
"median": 51.86400138987982,
"min": 51.76456896445916,
"max": 51.91188329396134
"mean": 51.67176982233886,
"median": 51.65256027149127,
"min": 51.61833640541185,
"max": 51.74441279011345
},
"short_prefill_tps": {
"median": 373.7141658642875
"median": 137.99610273516524
},
"long_decode_tps": {
"mean": 50.95070780816328,
"median": 50.97072297393578,
"min": 50.88414682013725,
"max": 50.9972536304168
"mean": 51.14445303382068,
"median": 51.20566577490106,
"min": 51.00525329304959,
"max": 51.222440033511404
},
"long_prefill_tps": {
"median": 993.181355104207
"median": 347.7178246515437
},
"ttft_cold_s": 1.1762665420000076,
"ttft_cached_s": 0.13059816650002176,
"multi_turn_ttft_cold_s": 0.49031412499994076,
"multi_turn_ttft_cached_s": 0.13266239600000063,
"peak_ram_mb": 123113.28125,
"ttft_cold_s": 1.5327279580087634,
"ttft_cached_s": 0.47744062499987194,
"multi_turn_ttft_cold_s": 1.049200916007976,
"multi_turn_ttft_cached_s": 0.4468816875014454,
"peak_ram_mb": 123325.296875,
"tool_call_rate": 1.0,
"recovery_rate": 1.0,
"leak_rate": 0.8,
"leak_rate": 0.0,
"vision": true,
"audio": false
}
Expand Down
32 changes: 16 additions & 16 deletions reports/benchmarks/mistral-small-24b.json
Original file line number Diff line number Diff line change
@@ -1,30 +1,30 @@
[
{
"engine": "Rapid-MLX",
"model": "default",
"model": "/Volumes/Extreme SSD/LMStudio-Models/lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-4bit",
"short_decode_tps": {
"mean": 48.45506038907022,
"median": 48.444075696160446,
"min": 48.43399621727914,
"max": 48.487109253771074
"mean": 48.376265707848326,
"median": 48.39510538749555,
"min": 48.28727936639898,
"max": 48.44641236965046
},
"short_prefill_tps": {
"median": 3075.293887102689
"median": 2439.4643799749797
},
"long_decode_tps": {
"mean": 47.902364420568546,
"median": 47.9061503187498,
"min": 47.85985295159887,
"max": 47.941089991356975
"mean": 41.24710965899371,
"median": 47.605138559651195,
"min": 28.290110300102153,
"max": 47.84608011722777
},
"long_prefill_tps": {
"median": 4034.624900255464
"median": 3025.09303781437
},
"ttft_cold_s": 1.1419446660438552,
"ttft_cached_s": 0.1070051045389846,
"multi_turn_ttft_cold_s": 0.392231083009392,
"multi_turn_ttft_cached_s": 0.0995690205018036,
"peak_ram_mb": 13007.984375,
"ttft_cold_s": 1.164492041003541,
"ttft_cached_s": 0.13583050000306685,
"multi_turn_ttft_cold_s": 0.5366168750042561,
"multi_turn_ttft_cached_s": 0.18406710399722215,
"peak_ram_mb": 13272.046875,
"tool_call_rate": 0.0,
"recovery_rate": 0.0,
"leak_rate": 0.0,
Expand Down
34 changes: 34 additions & 0 deletions reports/benchmarks/phi4-mini.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
[
{
"engine": "Rapid-MLX",
"model": "/Volumes/Extreme SSD/LMStudio-Models/lmstudio-community/Phi-4-mini-reasoning-MLX-4bit",
"short_decode_tps": {
"mean": 174.0167093209013,
"median": 174.0297507031804,
"min": 173.96538436660595,
"max": 174.0549928929175
},
"short_prefill_tps": {
"median": 212.18942802486131
},
"long_decode_tps": {
"mean": 169.99027159980744,
"median": 169.88826976220724,
"min": 169.87693495786576,
"max": 170.20561007934936
},
"long_prefill_tps": {
"median": 840.6429382159101
},
"ttft_cold_s": 0.1561205420002807,
"ttft_cached_s": 0.13479174999520183,
"multi_turn_ttft_cold_s": 0.19009308399108704,
"multi_turn_ttft_cached_s": 0.13267095800256357,
"peak_ram_mb": 2651.3125,
"tool_call_rate": 0.0,
"recovery_rate": 0,
"leak_rate": 1.0,
"vision": true,
"audio": false
}
]
Loading
Loading