Support ArraysCache-only hybrid caches (mlx-lm 0.30.6+) by LxYuan0420 · Pull Request #110 · vllm-project/vllm-metal

LxYuan0420 · 2026-02-21T17:02:13Z

This PR is:

To restore compatibility with mlx-lm>=0.30.6 where MambaCache was removed by treating hybrid-layer caches as ArraysCache.
To keep batched decode working on older mlx-lm versions that lack ArraysCache.merge/extract by providing local merge/extract helpers for ArraysCache.
To bump minimum MLX deps to the first versions that support batched KV cache merge/extract (mlx-lm>=0.28.4, mlx>=0.29.2).

How to test:

ruff check .
mypy vllm_metal/
pytest tests/

Minimal end-to-end smoke test:

vllm serve HuggingFaceTB/SmolLM2-135M-Instruct --port 8000

In another terminal:

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"HuggingFaceTB/SmolLM2-135M-Instruct","messages":[{"role":"user","content":"hi"}],"max_tokens":16}' \
  | python -m json.tool >/dev/null

Optional: Hybrid ArraysCache smoke test (downloads ~500MB once)

import mlx.core as mx
from mlx_lm import load
from mlx_lm.models.cache import make_prompt_cache, KVCache, ArraysCache
import vllm_metal.v1.model_runner as mr

MODEL = "state-spaces/mamba-130m-hf"

model, tok = load(MODEL)


def prefill(prompt: str):
    token_ids = tok.encode(prompt)
    cache = make_prompt_cache(model)
    _ = model(mx.array([token_ids], dtype=mx.int32), cache=cache)
    return token_ids, cache


(req0_toks, req0_cache) = prefill("Hello")
(req1_toks, req1_cache) = prefill("Hi")

print("KV layers:", sum(isinstance(c, KVCache) for c in req0_cache))
print("Arrays layers:", sum(isinstance(c, ArraysCache) for c in req0_cache))

batch_cache = mr._merge_kv_caches([req0_cache, req1_cache])
batched_input = mx.array([req0_toks[-1], req1_toks[-1]], dtype=mx.int32)[:, None]
out = model(batched_input, cache=batch_cache)
logits = out.logits if hasattr(out, "logits") else out
mx.eval(logits)

_ = mr._extract_kv_cache(batch_cache, 0)
_ = mr._extract_kv_cache(batch_cache, 1)
print("batched forward + extract OK")

"""
output:

KV layers: 30
Arrays layers: 0
batched forward + extract OK
"""

Related: #100

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

LxYuan0420 · 2026-02-22T03:29:19Z

Additional note:

mlx-lm>=0.28.4 will typically resolve to the newest available (e.g. 0.30.6, 0.30.7, …). so that is OK with this fix.

Why this works across versions:

We no longer import or reference MambaCache. Hybrid-layer caches are treated as ArraysCache.
On mlx-lm<=0.29.x, MambaCache still exists and is a subclass of ArraysCache, so isinstance(cache, ArraysCache) continues to work and our batching path applies.
On mlx-lm>=0.30.6, MambaCache is removed and hybrid caches are ArraysCache directly — which is exactly what we handle (verified with mlx-lm==0.30.6).

What we don’t support (by design):

mlx-lm==0.28.3 and below, because BatchKVCache.merge/extract don’t exist and we rely on them for batched decode.

Versions set by this PR:

vllm-metal (Darwin/arm64): mlx>=0.29.2, mlx-lm>=0.28.4
vLLM:
- pyproject.toml extra: vllm>=0.14.0
- install.sh pins vllm==0.14.1 (default for installer users)

LxYuan0420 added 5 commits February 21, 2026 20:31

model_runner: support ArraysCache-only mlx-lm

1698ebf

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

tests: clarify ArraysCache batching semantics

411a3dd

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

deps: unpin mlx-lm after ArraysCache compatibility

6ad640c

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

Bump mlx/mlx-lm minimum versions

6eab1d9

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

style: ruff format prefix cache tests

c51d3c0

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

LxYuan0420 requested a review from ericcurtin February 21, 2026 17:10

LxYuan0420 self-assigned this Feb 21, 2026

solarpunkin mentioned this pull request Feb 21, 2026

Update Mamba cache to support ArraysCache #94

Closed

ericcurtin approved these changes Feb 22, 2026

View reviewed changes

ericcurtin merged commit d68b90a into vllm-project:main Feb 22, 2026
5 checks passed

WindChimeRan mentioned this pull request Mar 23, 2026

Fix batched decode crash for hybrid cache models (Qwen3.5) #121

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ArraysCache-only hybrid caches (mlx-lm 0.30.6+)#110

Support ArraysCache-only hybrid caches (mlx-lm 0.30.6+)#110
ericcurtin merged 5 commits intovllm-project:mainfrom
LxYuan0420:fix/mlx-lm-arrayscache

LxYuan0420 commented Feb 21, 2026 •

edited

Loading

Uh oh!

LxYuan0420 commented Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LxYuan0420 commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LxYuan0420 commented Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LxYuan0420 commented Feb 21, 2026 •

edited

Loading