Skip to content

feat: BatchedEngine — admission, MTP routing, cooperative SpecPrefill#209

Closed
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:feat/batched-engine-phase-b
Closed

feat: BatchedEngine — admission, MTP routing, cooperative SpecPrefill#209
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:feat/batched-engine-phase-b

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

Integrates the admission controller (#207), cooperative specprefill (#208), and MLLM+MTP per-request routing into the BatchedEngine for production multi-user serving on Apple Silicon.

Depends on: #207 (admission), #208 (cooperative specprefill), #165 (MLLM hybrid batching), #180 (SpecPrefill, merged)

Key features

  • Admission gates on all 4 public methods (chat, stream_chat, generate, stream_generate) with try/finally cleanup
  • MLLM+MTP routing: text-only → mlx_lm TextModel with MTP, media → mlx_vlm MLLM
  • System KV cache: prefix boundary detection + snapshot/restore (7x speedup on hits)
  • Cooperative specprefill: draft scoring outside the generation lock
  • Thread-safe snapshot access + stale cache-hit re-verification under lock
  • CLI: --scheduler-policy, --scheduler-headroom-gb

Files

Status

Draft — running functional and quality validation across 2B/4B/35B/122B at context lengths up to 1M (YaRN). Test results incoming.

Integrates the admission controller, cooperative specprefill, and MLLM+MTP
per-request routing into the BatchedEngine for production multi-user serving.

Key changes:
- BatchedEngine: admission gates on all 4 public methods (chat, stream_chat,
  generate, stream_generate) with try/finally cleanup
- MLLM+MTP routing: text-only requests → mlx_lm TextModel with MTP
  speculative decoding, media requests → mlx_vlm MLLM path
- System KV cache: prefix boundary detection + snapshot/restore for
  repeated system prompts (7x speedup on cache hits)
- Cooperative specprefill: draft scoring outside the generation lock,
  yielding between chunks for concurrent request progress
- Thread-safe snapshot access (threading.Lock for cross-thread reads/writes)
- Cache-hit re-verification under lock (prevents stale flag after queuing)
- MLLM error loop: breaks after 10 consecutive errors (no infinite loop)
- CLI: --scheduler-policy, --scheduler-headroom-gb flags

Depends on: admission controller PR, cooperative specprefill PR, waybarrios#165, waybarrios#180

New files:
- specprefill.py: SpecPrefill scoring + sparse prefill (builds on merged waybarrios#180)
- text_model_from_vlm.py: zero-copy TextModel construction from VLM backbone
@Thump604
Copy link
Copy Markdown
Collaborator Author

Closing — combined BatchedEngine PR bundled too many features and was built on unreproducible venv state. Will resubmit as focused, reproducible PRs after venv stabilization.

@Thump604 Thump604 closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant