Skip to content

feat: memory-aware admission controller#207

Closed
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:feat/admission-controller
Closed

feat: memory-aware admission controller#207
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:feat/admission-controller

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

Standalone admission controller that gates request admission based on available GPU memory. Prevents OOM under concurrent load by queuing requests until memory is available.

Zero dependencies on other PRs — this is a self-contained module.

Components

  • MemoryMonitor: reads Metal GPU memory via mx.get_active_memory()/mx.get_cache_memory()
  • RequestQueue: pluggable policies — fifo, shortest_first (with 120s starvation guard), priority
  • AdmissionController: gates requests by free_memory >= prefill_bytes + headroom

Design principle

Load affects latency, never quality. Once a request starts generating, it runs to completion. The controller only decides WHEN to start.

Files

  • vllm_mlx/admission.py — 376 lines
  • tests/test_admission.py — 30 tests

Status

Draft — running functional and quality validation. Test results incoming.

Standalone admission controller that gates request admission based on
available GPU memory. Prevents OOM under concurrent load by queuing
requests until memory is available.

Components:
- MemoryMonitor: reads Metal GPU memory via mx.get_active_memory/cache
- RequestQueue: pluggable policies (fifo, shortest_first, priority)
- AdmissionController: gates requests by free_memory >= prefill + headroom

Key design: load affects latency, never quality. Once a request starts
generating, it runs to completion. The controller only decides WHEN to start.

Features:
- kv_per_token estimation for MoE, dense, and hybrid (Mamba/Attention) models
- shortest_first policy with 120s starvation guard
- priority policy with numeric priority + FIFO tiebreak
- Eviction callback for prefix cache pressure relief
- Cancellation-safe (raises CancelledError, not silent proceed)
- 600s admission timeout (raises TimeoutError)
- 30 unit tests
@Thump604 Thump604 force-pushed the feat/admission-controller branch from 16fa54a to 272d7a7 Compare March 23, 2026 00:33
@Thump604
Copy link
Copy Markdown
Collaborator Author

Closing — BatchedEngine admission controller built on unreproducible venv state. Will resubmit when BatchedEngine work is properly tracked and tested.

@Thump604 Thump604 closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant