feat: Add --gpu-memory-utilization for configurable memory limits#108
Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Closed
feat: Add --gpu-memory-utilization for configurable memory limits#108janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Conversation
Add a single CLI flag to control both the Metal soft allocation limit (mx.set_memory_limit) and the emergency cache clear threshold in the engine loop. Default 0.90 preserves existing behavior. For large models (200GB+), the previous hardcoded 200GB emergency threshold and fixed 90% soft limit caused excessive cache clearing, resulting in ~3.5x slowdown. With --gpu-memory-utilization 0.95 both limits scale to the actual device memory, eliminating the thrashing. The emergency threshold is always 5% above the soft limit (capped at 99%) to give MLX headroom for temporary allocations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2be1926 to
1ac9bd6
Compare
Collaborator
|
@waybarrios, @janhilgard: brief endorsement. This is a real perf fix with concrete evidence. Two hardcoded memory thresholds (Metal soft allocation limit + emergency cache clear) cause severe degradation on large models, and the 11 to 38 tok/s recovery on Qwen3.5-397B at The CLI flag is the right shape: Last activity Mar 21, ~3 weeks ago. Worth a status check. |
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--gpu-memory-utilizationCLI flag (float, 0.0-1.0, default 0.90) that controls both the Metal soft allocation limit (mx.set_memory_limit) and the emergency cache clear threshold in the engine loop--gpu-memory-utilization 0.95eliminates cache thrashing caused by the hardcoded 200GB emergency threshold, recovering ~3.5x throughput (e.g. 11 → 38 tok/s on Qwen3.5-397B)Motivation
Two hardcoded memory thresholds cause severe performance degradation on large models:
engine_core.py:154— Emergency threshold fixed at 200GB triggersmx.clear_cache()every 64 steps when the model footprint exceeds 200GBengine/batched.py:286— Metal soft limit fixed at 90% ofmax_recommended_working_set_sizeFor a 209GB model on a 256GB machine, the emergency threshold fires continuously, clearing the cache and forcing re-computation. The fix: make both limits configurable via a single flag, following the same
--gpu-memory-utilizationconvention used by vLLM (GPU).Changes
vllm_mlx/engine_core.pygpu_memory_utilizationtoEngineConfig; compute emergency threshold dynamically asdevice_memory × min(util + 0.05, 0.99)with 200GB fallbackvllm_mlx/engine/batched.pygpu_memory_utilizationin__init__; use it formx.set_memory_limitand pass toEngineConfigvllm_mlx/cli.py--gpu-memory-utilizationargument with validationvllm_mlx/server.pygpu_memory_utilizationtoBatchedEnginedocs/reference/cli.mdThreshold logic
The 5% gap between soft limit and emergency is intentional — it gives MLX room for temporary allocations before the hard cache clear kicks in.
Test plan
--gpu-memory-utilization 0.95raises both limits in logsuvx black --check vllm_mlx/passesuvx ruff check vllm_mlx/— no new errors (pre-existing issues only)🤖 Generated with Claude Code