feat: Add --gpu-memory-utilization for configurable memory limits by janhilgard · Pull Request #108 · waybarrios/vllm-mlx

janhilgard · 2026-02-23T22:51:16Z

Summary

Add --gpu-memory-utilization CLI flag (float, 0.0-1.0, default 0.90) that controls both the Metal soft allocation limit (mx.set_memory_limit) and the emergency cache clear threshold in the engine loop
Default 0.90 preserves identical behavior to current upstream
For large models (200GB+), --gpu-memory-utilization 0.95 eliminates cache thrashing caused by the hardcoded 200GB emergency threshold, recovering ~3.5x throughput (e.g. 11 → 38 tok/s on Qwen3.5-397B)

Motivation

Two hardcoded memory thresholds cause severe performance degradation on large models:

engine_core.py:154 — Emergency threshold fixed at 200GB triggers mx.clear_cache() every 64 steps when the model footprint exceeds 200GB
engine/batched.py:286 — Metal soft limit fixed at 90% of max_recommended_working_set_size

For a 209GB model on a 256GB machine, the emergency threshold fires continuously, clearing the cache and forcing re-computation. The fix: make both limits configurable via a single flag, following the same --gpu-memory-utilization convention used by vLLM (GPU).

Changes

File	Change
`vllm_mlx/engine_core.py`	Add `gpu_memory_utilization` to `EngineConfig`; compute emergency threshold dynamically as `device_memory × min(util + 0.05, 0.99)` with 200GB fallback
`vllm_mlx/engine/batched.py`	Accept `gpu_memory_utilization` in `__init__`; use it for `mx.set_memory_limit` and pass to `EngineConfig`
`vllm_mlx/cli.py`	Add `--gpu-memory-utilization` argument with validation
`vllm_mlx/server.py`	Pass through `gpu_memory_utilization` to `BatchedEngine`
`docs/reference/cli.md`	Document the new flag with example for large models

Threshold logic

--gpu-memory-utilization 0.90 (default):
  Metal soft limit = 90% of max_recommended_working_set_size
  Emergency threshold = 95% of device_memory (= 0.90 + 0.05)

--gpu-memory-utilization 0.95 (large models):
  Metal soft limit = 95% of max_recommended_working_set_size
  Emergency threshold = 99% of device_memory (= min(0.95 + 0.05, 0.99))

The 5% gap between soft limit and emergency is intentional — it gives MLX room for temporary allocations before the hard cache clear kicks in.

Test plan

Verify default (0.90) produces identical behavior to upstream (same log messages, same thresholds)
Verify --gpu-memory-utilization 0.95 raises both limits in logs
Verify validation rejects values outside (0.0, 1.0]
uvx black --check vllm_mlx/ passes
uvx ruff check vllm_mlx/ — no new errors (pre-existing issues only)

🤖 Generated with Claude Code

Add a single CLI flag to control both the Metal soft allocation limit (mx.set_memory_limit) and the emergency cache clear threshold in the engine loop. Default 0.90 preserves existing behavior. For large models (200GB+), the previous hardcoded 200GB emergency threshold and fixed 90% soft limit caused excessive cache clearing, resulting in ~3.5x slowdown. With --gpu-memory-utilization 0.95 both limits scale to the actual device memory, eliminating the thrashing. The emergency threshold is always 5% above the soft limit (capped at 99%) to give MLX headroom for temporary allocations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Thump604 · 2026-04-08T00:25:57Z

@waybarrios, @janhilgard: brief endorsement.

This is a real perf fix with concrete evidence. Two hardcoded memory thresholds (Metal soft allocation limit + emergency cache clear) cause severe degradation on large models, and the 11 to 38 tok/s recovery on Qwen3.5-397B at --gpu-memory-utilization 0.95 is hard evidence that the existing 200GB hardcoded threshold is wrong for systems with more memory.

The CLI flag is the right shape: --gpu-memory-utilization float 0.0-1.0 default 0.90, controlling both mx.set_memory_limit and the emergency-clear threshold. Default preserves identical current behavior for users who do not set the flag. Mergeable on current main.

Last activity Mar 21, ~3 weeks ago. Worth a status check.

janhilgard · 2026-04-11T14:18:20Z

@Thump604 Superseded — --gpu-memory-utilization is already in main via #278. Closing.

janhilgard force-pushed the feat/gpu-memory-utilization branch from 2be1926 to 1ac9bd6 Compare March 21, 2026 22:21

janhilgard closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add --gpu-memory-utilization for configurable memory limits#108

feat: Add --gpu-memory-utilization for configurable memory limits#108
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/gpu-memory-utilization

janhilgard commented Feb 23, 2026

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

janhilgard commented Feb 23, 2026

Summary

Motivation

Changes

Threshold logic

Test plan

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants