Skip to content

feat: BatchedEngine parity — MTP routing, normalization, SpecPrefill#203

Closed
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:feat/batched-engine-parity-v2
Closed

feat: BatchedEngine parity — MTP routing, normalization, SpecPrefill#203
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:feat/batched-engine-parity-v2

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Replaces #192 (rebased against main after merge of #180, #97, #127).

Ports SimpleEngine features to BatchedEngine for continuous batching mode:

  • Per-request MTP routing: text-only → TextModel with MTP speculative decoding, media → MLLM. Zero-copy weight sharing from VLM backbone.
  • message_utils.py: Shared _normalize_messages() — maps developer→system, merges consecutive same-role messages, hoists system to position [0]. Required for Qwen 3.5 templates that reject malformed sequences.
  • SpecPrefill: Draft model lifecycle, CLI arg wiring, per-request API in BatchedEngine.
  • System KV cache: ChatML boundary detection, hash-based snapshot/restore.
  • Tests: MTP routing, TextModel construction, speculative decoding, smoke test.

Context

PR #180 (SpecPrefill) merged SimpleEngine support. This PR extends the same features to BatchedEngine, which is the production path for continuous batching mode.

Test plan

  • Start with --continuous-batching --enable-mtp --mllm
  • Text-only request routes to TextModel+MTP
  • Media request routes to MLLM
  • SpecPrefill activates on long prompts
  • System prompt cached across turns
  • _normalize_messages prevents template crashes on malformed input

Port SimpleEngine features to BatchedEngine for continuous batching:

- Per-request MTP routing: text-only → TextModel (MTP), media → MLLM
- message_utils.py: shared _normalize_messages (developer→system,
  merge consecutive same-role, hoist system to [0])
- SpecPrefill config + draft model lifecycle in BatchedEngine
- System KV cache with ChatML boundary detection

Replaces PR waybarrios#192 (rebased against main after merge of waybarrios#180, waybarrios#97).
@Thump604
Copy link
Copy Markdown
Collaborator Author

Superseded by #204 (memory-aware scheduler includes BatchedEngine parity + admission control).

@Thump604 Thump604 closed this Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant