feat: BatchedEngine — admission, MTP routing, cooperative SpecPrefill#209
Closed
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Closed
feat: BatchedEngine — admission, MTP routing, cooperative SpecPrefill#209Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Conversation
Integrates the admission controller, cooperative specprefill, and MLLM+MTP per-request routing into the BatchedEngine for production multi-user serving. Key changes: - BatchedEngine: admission gates on all 4 public methods (chat, stream_chat, generate, stream_generate) with try/finally cleanup - MLLM+MTP routing: text-only requests → mlx_lm TextModel with MTP speculative decoding, media requests → mlx_vlm MLLM path - System KV cache: prefix boundary detection + snapshot/restore for repeated system prompts (7x speedup on cache hits) - Cooperative specprefill: draft scoring outside the generation lock, yielding between chunks for concurrent request progress - Thread-safe snapshot access (threading.Lock for cross-thread reads/writes) - Cache-hit re-verification under lock (prevents stale flag after queuing) - MLLM error loop: breaks after 10 consecutive errors (no infinite loop) - CLI: --scheduler-policy, --scheduler-headroom-gb flags Depends on: admission controller PR, cooperative specprefill PR, waybarrios#165, waybarrios#180 New files: - specprefill.py: SpecPrefill scoring + sparse prefill (builds on merged waybarrios#180) - text_model_from_vlm.py: zero-copy TextModel construction from VLM backbone
6 tasks
Collaborator
Author
|
Closing — combined BatchedEngine PR bundled too many features and was built on unreproducible venv state. Will resubmit as focused, reproducible PRs after venv stabilization. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrates the admission controller (#207), cooperative specprefill (#208), and MLLM+MTP per-request routing into the BatchedEngine for production multi-user serving on Apple Silicon.
Depends on: #207 (admission), #208 (cooperative specprefill), #165 (MLLM hybrid batching), #180 (SpecPrefill, merged)
Key features
chat,stream_chat,generate,stream_generate) withtry/finallycleanup--scheduler-policy,--scheduler-headroom-gbFiles
vllm_mlx/engine/batched.py— core integration (~1900 lines)vllm_mlx/specprefill.py— SpecPrefill scoring (builds on feat: SpecPrefill — attention-based sparse prefill for TTFT reduction #180)vllm_mlx/text_model_from_vlm.py— zero-copy TextModel from VLM backbonevllm_mlx/cli.py— scheduler CLI flagsvllm_mlx/scheduler.py— SchedulerConfigStatus
Draft — running functional and quality validation across 2B/4B/35B/122B at context lengths up to 1M (YaRN). Test results incoming.