refactor(simple): SimpleEngine.generate() thin accumulator over stream_generate by Thump604 · Pull Request #266 · waybarrios/vllm-mlx

Thump604 · 2026-04-08T13:26:48Z

Summary

SimpleEngine.generate() is now a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. This closes a real /v1/completions contract gap: the old direct self._model.generate() path silently dropped per-request specprefill / specprefill_keep_pct overrides because mlx_lm.generate() does not consume those kwargs.

Why

stream_generate() is the only code path that pops specprefill / specprefill_keep_pct from kwargs, does the threshold check, and routes through _stream_generate_specprefill(). Non-streaming clients that send {"extra_body": {"specprefill": true}} to /v1/completions expected SpecPrefill to engage on the non-streaming route. Before this PR, the server (with the companion change in #265) forwarded the overrides into engine.generate(), and engine.generate() forwarded them via **kwargs into _model.generate(), which ignored them. The feature was advertised end-to-end but was a silent no-op at the engine boundary.

Matches the accumulator-over-streaming pattern established by #222 for tool-enabled chat.

Changes

vllm_mlx/engine/simple.py:

SimpleEngine.generate() body replaced with an async for loop over self.stream_generate(**kwargs) that captures the last GenerationOutput. Empty-stream edge case returns GenerationOutput(text="", finish_reason="stop") rather than raising.
Text is run through clean_output_text() the same way the old path did.
Returned GenerationOutput preserves tokens, prompt_tokens, completion_tokens, and finish_reason from the last yielded chunk, and sets finished=True.

tests/test_simple_engine.py:

New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator that yields two chunks, calls engine.generate() with specprefill=True and specprefill_keep_pct=0.2, and asserts (a) the final output fields match the last yielded chunk and (b) both SpecPrefill overrides reached stream_generate.
New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case.
Extended the mock_model fixture with a stream_generate side effect that tracks concurrency the same way the existing generate side effect does, so test_lock_prevents_concurrent_generate continues to observe serialization through the new accumulator path without behavior change.

Scope

Only generate() changes in this PR. chat() stays on its current path. Extending chat() to the full accumulator pattern is a separate follow-up that will layer on top of #222 once it merges.

Verification

Unit tests: the new test_generate_accumulates_over_stream_generate and test_generate_empty_stream_returns_safe_default both pass against a local checkout.

Live against Qwen 3.5 4B SimpleEngine + SpecPrefill on an M2 Ultra 128GB with extra_body.specprefill=true forcing SpecPrefill below the 8192-token threshold:

curl /v1/completions
  body: {"model":"qwen3.5-4b","prompt":"<~6000 tokens filler> Summarize in one sentence:","max_tokens":30,"extra_body":{"specprefill":true}}

response: usage.prompt_tokens=6007 (was silent 0 pre-PR)
stderr:   SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s

Before this PR the same request returned coherent content but never engaged SpecPrefill (no SpecPrefill: log line), and prompt_tokens was silently 0.

feat(api): per-request SpecPrefill overrides on /v1/completions #265 (same author): adds the server-side CompletionRequest schema fields and the create_completion handler plumbing that forwards specprefill / specprefill_keep_pct into engine.generate(**gen_kwargs). This PR closes the wire on the engine side. Landing in either order is fine; landing both gives the end-to-end plumbing.
fix: replace manual decode loop with pipelined generation in SpecPrefill Phase 4 #248 (Vigilans): Phase 4 decode pipelining fix. The accumulator routes through _stream_generate_specprefill which routes through Phase 4, so both PRs compound.
simple-engine: keep tool chat on the streaming execution path #222 (krystophny): the accumulator-over-streaming pattern established for tool chat. This PR applies the same architectural shape to generate().

Thump604 · 2026-04-10T12:20:04Z

Quick merge signal: this is the engine-side half of the /v1/completions SpecPrefill override work paired with #265. The branch is mergeable, CI is green, and the same accumulator shape is already carrying the local runtime path cleanly on our side.

If you're doing a small sweep, #265 + #266 are the pair to take together.

janhilgard · 2026-04-11T16:07:18Z

Nice refactor @Thump604 — the accumulator-over-stream pattern is clean and fixes the real SpecPrefill gap in the non-streaming path.

PR has merge conflicts with current main though (simple.py changed in recent merges). Would need a rebase.

…er stream_generate stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR waybarrios#222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR waybarrios#265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.

Thump604 · 2026-04-11T16:16:20Z

Restacked on current main.

This keeps generate() as a thin accumulator over stream_generate() so the non-streaming /v1/completions path actually shares the same per-request SpecPrefill override path and prompt-token accounting as streaming.

Local verification on the rebased branch:

python -m py_compile vllm_mlx/engine/simple.py tests/test_simple_engine.py
pytest -q tests/test_simple_engine.py -k 'generate_accumulates_over_stream_generate or generate_empty_stream_returns_safe_default or lock_prevents_concurrent_generate'

janhilgard

LGTM — clean refactor that fixes a real SpecPrefill gap in the non-streaming path.

Verified:

Serialization maintained: both code paths in stream_generate() properly acquire _generation_lock
SpecPrefill kwargs (specprefill, specprefill_keep_pct) now correctly flow through stream_generate() → _stream_generate_specprefill() for non-streaming /v1/completions callers
Empty stream edge case handled gracefully
Tests cover accumulator behavior, empty stream, and concurrency serialization
Rebase on current main is clean

Accumulator-over-streaming pattern matches the established shape from #222.

Thump604 mentioned this pull request Apr 10, 2026

feat(api): per-request SpecPrefill overrides on /v1/completions #265

Merged

5 tasks

Thump604 force-pushed the codex/simpleengine-generate-accumulator branch from 59420a8 to 7bb55f5 Compare April 11, 2026 16:16

janhilgard approved these changes Apr 12, 2026

View reviewed changes

janhilgard merged commit ab68d94 into waybarrios:main Apr 12, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(simple): SimpleEngine.generate() thin accumulator over stream_generate#266

refactor(simple): SimpleEngine.generate() thin accumulator over stream_generate#266
janhilgard merged 1 commit intowaybarrios:mainfrom
Thump604:codex/simpleengine-generate-accumulator

Thump604 commented Apr 8, 2026

Uh oh!

Thump604 commented Apr 10, 2026

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Thump604 commented Apr 8, 2026

Summary

Why

Changes

Scope

Verification

Related

Uh oh!

Thump604 commented Apr 10, 2026

Uh oh!

janhilgard commented Apr 11, 2026

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants