refactor(simple): SimpleEngine.generate() thin accumulator over stream_generate#266
Conversation
|
Quick merge signal: this is the engine-side half of the If you're doing a small sweep, #265 + #266 are the pair to take together. |
|
Nice refactor @Thump604 — the accumulator-over-stream pattern is clean and fixes the real SpecPrefill gap in the non-streaming path. PR has merge conflicts with current |
…er stream_generate
stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.
Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming
clients now get:
- SpecPrefill engagement when `specprefill=true` is set (top-level or
extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
streaming path
- Same thread-safety (stream_generate holds self._generation_lock
around the MLX call)
Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR waybarrios#222.
Tests:
- New test_generate_accumulates_over_stream_generate stubs
stream_generate with an async generator, calls generate() with
per-request specprefill kwargs, and asserts:
* final output fields (text, tokens, prompt_tokens,
completion_tokens, finish_reason, finished) match the last yielded
chunk
* specprefill / specprefill_keep_pct were forwarded through to
stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
empty-stream edge case (returns GenerationOutput(text="",
finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
so test_lock_prevents_concurrent_generate still observes
serialization through the new accumulator path
Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:
SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s
prompt_tokens reporting is now 6007 (was always 0 before).
Related: companion PR waybarrios#265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
59420a8 to
7bb55f5
Compare
|
Restacked on current main. This keeps Local verification on the rebased branch:
|
janhilgard
left a comment
There was a problem hiding this comment.
LGTM — clean refactor that fixes a real SpecPrefill gap in the non-streaming path.
Verified:
- Serialization maintained: both code paths in
stream_generate()properly acquire_generation_lock - SpecPrefill kwargs (
specprefill,specprefill_keep_pct) now correctly flow throughstream_generate()→_stream_generate_specprefill()for non-streaming/v1/completionscallers - Empty stream edge case handled gracefully
- Tests cover accumulator behavior, empty stream, and concurrency serialization
- Rebase on current
mainis clean
Accumulator-over-streaming pattern matches the established shape from #222.
Summary
SimpleEngine.generate()is now a thin accumulator that iteratesself.stream_generate()and returns the lastGenerationOutput. This closes a real/v1/completionscontract gap: the old directself._model.generate()path silently dropped per-requestspecprefill/specprefill_keep_pctoverrides becausemlx_lm.generate()does not consume those kwargs.Why
stream_generate()is the only code path that popsspecprefill/specprefill_keep_pctfrom kwargs, does the threshold check, and routes through_stream_generate_specprefill(). Non-streaming clients that send{"extra_body": {"specprefill": true}}to/v1/completionsexpected SpecPrefill to engage on the non-streaming route. Before this PR, the server (with the companion change in #265) forwarded the overrides intoengine.generate(), andengine.generate()forwarded them via**kwargsinto_model.generate(), which ignored them. The feature was advertised end-to-end but was a silent no-op at the engine boundary.Matches the accumulator-over-streaming pattern established by #222 for tool-enabled chat.
Changes
vllm_mlx/engine/simple.py:SimpleEngine.generate()body replaced with anasync forloop overself.stream_generate(**kwargs)that captures the lastGenerationOutput. Empty-stream edge case returnsGenerationOutput(text="", finish_reason="stop")rather than raising.clean_output_text()the same way the old path did.GenerationOutputpreservestokens,prompt_tokens,completion_tokens, andfinish_reasonfrom the last yielded chunk, and setsfinished=True.tests/test_simple_engine.py:test_generate_accumulates_over_stream_generatestubsstream_generatewith an async generator that yields two chunks, callsengine.generate()withspecprefill=Trueandspecprefill_keep_pct=0.2, and asserts (a) the final output fields match the last yielded chunk and (b) both SpecPrefill overrides reachedstream_generate.test_generate_empty_stream_returns_safe_defaultcovers the empty-stream edge case.mock_modelfixture with astream_generateside effect that tracks concurrency the same way the existinggenerateside effect does, sotest_lock_prevents_concurrent_generatecontinues to observe serialization through the new accumulator path without behavior change.Scope
Only
generate()changes in this PR.chat()stays on its current path. Extendingchat()to the full accumulator pattern is a separate follow-up that will layer on top of #222 once it merges.Verification
Unit tests: the new
test_generate_accumulates_over_stream_generateandtest_generate_empty_stream_returns_safe_defaultboth pass against a local checkout.Live against Qwen 3.5 4B SimpleEngine + SpecPrefill on an M2 Ultra 128GB with
extra_body.specprefill=trueforcing SpecPrefill below the 8192-token threshold:Before this PR the same request returned coherent content but never engaged SpecPrefill (no
SpecPrefill:log line), andprompt_tokenswas silently 0.Related
CompletionRequestschema fields and thecreate_completionhandler plumbing that forwardsspecprefill/specprefill_keep_pctintoengine.generate(**gen_kwargs). This PR closes the wire on the engine side. Landing in either order is fine; landing both gives the end-to-end plumbing._stream_generate_specprefillwhich routes through Phase 4, so both PRs compound.generate().