feat: BatchedEngine — admission, MTP routing, cooperative SpecPrefill by Thump604 · Pull Request #209 · waybarrios/vllm-mlx

Thump604 · 2026-03-23T00:28:10Z

Summary

Integrates the admission controller (#207), cooperative specprefill (#208), and MLLM+MTP per-request routing into the BatchedEngine for production multi-user serving on Apple Silicon.

Depends on: #207 (admission), #208 (cooperative specprefill), #165 (MLLM hybrid batching), #180 (SpecPrefill, merged)

Key features

Admission gates on all 4 public methods (chat, stream_chat, generate, stream_generate) with try/finally cleanup
MLLM+MTP routing: text-only → mlx_lm TextModel with MTP, media → mlx_vlm MLLM
System KV cache: prefix boundary detection + snapshot/restore (7x speedup on hits)
Cooperative specprefill: draft scoring outside the generation lock
Thread-safe snapshot access + stale cache-hit re-verification under lock
CLI: --scheduler-policy, --scheduler-headroom-gb

Files

vllm_mlx/engine/batched.py — core integration (~1900 lines)
vllm_mlx/specprefill.py — SpecPrefill scoring (builds on feat: SpecPrefill — attention-based sparse prefill for TTFT reduction #180)
vllm_mlx/text_model_from_vlm.py — zero-copy TextModel from VLM backbone
vllm_mlx/cli.py — scheduler CLI flags
vllm_mlx/scheduler.py — SchedulerConfig

Status

Draft — running functional and quality validation across 2B/4B/35B/122B at context lengths up to 1M (YaRN). Test results incoming.

Integrates the admission controller, cooperative specprefill, and MLLM+MTP per-request routing into the BatchedEngine for production multi-user serving. Key changes: - BatchedEngine: admission gates on all 4 public methods (chat, stream_chat, generate, stream_generate) with try/finally cleanup - MLLM+MTP routing: text-only requests → mlx_lm TextModel with MTP speculative decoding, media requests → mlx_vlm MLLM path - System KV cache: prefix boundary detection + snapshot/restore for repeated system prompts (7x speedup on cache hits) - Cooperative specprefill: draft scoring outside the generation lock, yielding between chunks for concurrent request progress - Thread-safe snapshot access (threading.Lock for cross-thread reads/writes) - Cache-hit re-verification under lock (prevents stale flag after queuing) - MLLM error loop: breaks after 10 consecutive errors (no infinite loop) - CLI: --scheduler-policy, --scheduler-headroom-gb flags Depends on: admission controller PR, cooperative specprefill PR, waybarrios#165, waybarrios#180 New files: - specprefill.py: SpecPrefill scoring + sparse prefill (builds on merged waybarrios#180) - text_model_from_vlm.py: zero-copy TextModel construction from VLM backbone

Thump604 · 2026-03-24T00:14:24Z

Closing — combined BatchedEngine PR bundled too many features and was built on unreproducible venv state. Will resubmit as focused, reproducible PRs after venv stabilization.

Thump604 mentioned this pull request Mar 23, 2026

feat: memory-aware admission controller for multi-user serving #204

Closed

6 tasks

Thump604 closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: BatchedEngine — admission, MTP routing, cooperative SpecPrefill#209

feat: BatchedEngine — admission, MTP routing, cooperative SpecPrefill#209
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:feat/batched-engine-phase-b

Thump604 commented Mar 23, 2026

Uh oh!

Thump604 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thump604 commented Mar 23, 2026

Summary

Key features

Files

Status

Uh oh!

Thump604 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant