Skip to content

feat: Add --gpu-memory-utilization for configurable memory limits#108

Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/gpu-memory-utilization
Closed

feat: Add --gpu-memory-utilization for configurable memory limits#108
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/gpu-memory-utilization

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • Add --gpu-memory-utilization CLI flag (float, 0.0-1.0, default 0.90) that controls both the Metal soft allocation limit (mx.set_memory_limit) and the emergency cache clear threshold in the engine loop
  • Default 0.90 preserves identical behavior to current upstream
  • For large models (200GB+), --gpu-memory-utilization 0.95 eliminates cache thrashing caused by the hardcoded 200GB emergency threshold, recovering ~3.5x throughput (e.g. 11 → 38 tok/s on Qwen3.5-397B)

Motivation

Two hardcoded memory thresholds cause severe performance degradation on large models:

  1. engine_core.py:154 — Emergency threshold fixed at 200GB triggers mx.clear_cache() every 64 steps when the model footprint exceeds 200GB
  2. engine/batched.py:286 — Metal soft limit fixed at 90% of max_recommended_working_set_size

For a 209GB model on a 256GB machine, the emergency threshold fires continuously, clearing the cache and forcing re-computation. The fix: make both limits configurable via a single flag, following the same --gpu-memory-utilization convention used by vLLM (GPU).

Changes

File Change
vllm_mlx/engine_core.py Add gpu_memory_utilization to EngineConfig; compute emergency threshold dynamically as device_memory × min(util + 0.05, 0.99) with 200GB fallback
vllm_mlx/engine/batched.py Accept gpu_memory_utilization in __init__; use it for mx.set_memory_limit and pass to EngineConfig
vllm_mlx/cli.py Add --gpu-memory-utilization argument with validation
vllm_mlx/server.py Pass through gpu_memory_utilization to BatchedEngine
docs/reference/cli.md Document the new flag with example for large models

Threshold logic

--gpu-memory-utilization 0.90 (default):
  Metal soft limit = 90% of max_recommended_working_set_size
  Emergency threshold = 95% of device_memory (= 0.90 + 0.05)

--gpu-memory-utilization 0.95 (large models):
  Metal soft limit = 95% of max_recommended_working_set_size
  Emergency threshold = 99% of device_memory (= min(0.95 + 0.05, 0.99))

The 5% gap between soft limit and emergency is intentional — it gives MLX room for temporary allocations before the hard cache clear kicks in.

Test plan

  • Verify default (0.90) produces identical behavior to upstream (same log messages, same thresholds)
  • Verify --gpu-memory-utilization 0.95 raises both limits in logs
  • Verify validation rejects values outside (0.0, 1.0]
  • uvx black --check vllm_mlx/ passes
  • uvx ruff check vllm_mlx/ — no new errors (pre-existing issues only)

🤖 Generated with Claude Code

Add a single CLI flag to control both the Metal soft allocation limit
(mx.set_memory_limit) and the emergency cache clear threshold in the
engine loop. Default 0.90 preserves existing behavior.

For large models (200GB+), the previous hardcoded 200GB emergency
threshold and fixed 90% soft limit caused excessive cache clearing,
resulting in ~3.5x slowdown. With --gpu-memory-utilization 0.95
both limits scale to the actual device memory, eliminating the
thrashing.

The emergency threshold is always 5% above the soft limit (capped
at 99%) to give MLX headroom for temporary allocations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard janhilgard force-pushed the feat/gpu-memory-utilization branch from 2be1926 to 1ac9bd6 Compare March 21, 2026 22:21
@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 8, 2026

@waybarrios, @janhilgard: brief endorsement.

This is a real perf fix with concrete evidence. Two hardcoded memory thresholds (Metal soft allocation limit + emergency cache clear) cause severe degradation on large models, and the 11 to 38 tok/s recovery on Qwen3.5-397B at --gpu-memory-utilization 0.95 is hard evidence that the existing 200GB hardcoded threshold is wrong for systems with more memory.

The CLI flag is the right shape: --gpu-memory-utilization float 0.0-1.0 default 0.90, controlling both mx.set_memory_limit and the emergency-clear threshold. Default preserves identical current behavior for users who do not set the flag. Mergeable on current main.

Last activity Mar 21, ~3 weeks ago. Worth a status check.

@janhilgard
Copy link
Copy Markdown
Collaborator Author

@Thump604 Superseded — --gpu-memory-utilization is already in main via #278. Closing.

@janhilgard janhilgard closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants