Skip to content

Support ArraysCache-only hybrid caches (mlx-lm 0.30.6+)#110

Merged
ericcurtin merged 5 commits intovllm-project:mainfrom
LxYuan0420:fix/mlx-lm-arrayscache
Feb 22, 2026
Merged

Support ArraysCache-only hybrid caches (mlx-lm 0.30.6+)#110
ericcurtin merged 5 commits intovllm-project:mainfrom
LxYuan0420:fix/mlx-lm-arrayscache

Conversation

@LxYuan0420
Copy link
Copy Markdown
Collaborator

@LxYuan0420 LxYuan0420 commented Feb 21, 2026

This PR is:

  • To restore compatibility with mlx-lm>=0.30.6 where MambaCache was removed by treating hybrid-layer caches as ArraysCache.
  • To keep batched decode working on older mlx-lm versions that lack ArraysCache.merge/extract by providing local merge/extract helpers for ArraysCache.
  • To bump minimum MLX deps to the first versions that support batched KV cache merge/extract (mlx-lm>=0.28.4, mlx>=0.29.2).

How to test:

ruff check .
mypy vllm_metal/
pytest tests/

Minimal end-to-end smoke test:

vllm serve HuggingFaceTB/SmolLM2-135M-Instruct --port 8000

In another terminal:

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"HuggingFaceTB/SmolLM2-135M-Instruct","messages":[{"role":"user","content":"hi"}],"max_tokens":16}' \
  | python -m json.tool >/dev/null

Optional: Hybrid ArraysCache smoke test (downloads ~500MB once)

import mlx.core as mx
from mlx_lm import load
from mlx_lm.models.cache import make_prompt_cache, KVCache, ArraysCache
import vllm_metal.v1.model_runner as mr

MODEL = "state-spaces/mamba-130m-hf"

model, tok = load(MODEL)


def prefill(prompt: str):
    token_ids = tok.encode(prompt)
    cache = make_prompt_cache(model)
    _ = model(mx.array([token_ids], dtype=mx.int32), cache=cache)
    return token_ids, cache


(req0_toks, req0_cache) = prefill("Hello")
(req1_toks, req1_cache) = prefill("Hi")

print("KV layers:", sum(isinstance(c, KVCache) for c in req0_cache))
print("Arrays layers:", sum(isinstance(c, ArraysCache) for c in req0_cache))

batch_cache = mr._merge_kv_caches([req0_cache, req1_cache])
batched_input = mx.array([req0_toks[-1], req1_toks[-1]], dtype=mx.int32)[:, None]
out = model(batched_input, cache=batch_cache)
logits = out.logits if hasattr(out, "logits") else out
mx.eval(logits)

_ = mr._extract_kv_cache(batch_cache, 0)
_ = mr._extract_kv_cache(batch_cache, 1)
print("batched forward + extract OK")

"""
output:

KV layers: 30
Arrays layers: 0
batched forward + extract OK
"""

Related: #100

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>
Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>
Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>
Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>
Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>
@LxYuan0420
Copy link
Copy Markdown
Collaborator Author

Additional note:

mlx-lm>=0.28.4 will typically resolve to the newest available (e.g. 0.30.6, 0.30.7, …). so that is OK with this fix.

Why this works across versions:

  • We no longer import or reference MambaCache. Hybrid-layer caches are treated as ArraysCache.
  • On mlx-lm<=0.29.x, MambaCache still exists and is a subclass of ArraysCache, so isinstance(cache, ArraysCache) continues to work and our batching path applies.
  • On mlx-lm>=0.30.6, MambaCache is removed and hybrid caches are ArraysCache directly — which is exactly what we handle (verified with mlx-lm==0.30.6).

What we don’t support (by design):

  • mlx-lm==0.28.3 and below, because BatchKVCache.merge/extract don’t exist and we rely on them for batched decode.

Versions set by this PR:

  • vllm-metal (Darwin/arm64): mlx>=0.29.2, mlx-lm>=0.28.4
  • vLLM:
    • pyproject.toml extra: vllm>=0.14.0
    • install.sh pins vllm==0.14.1 (default for installer users)

@ericcurtin ericcurtin merged commit d68b90a into vllm-project:main Feb 22, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants