Skip to content

refactor(simple): SimpleEngine.generate() thin accumulator over stream_generate#266

Merged
janhilgard merged 1 commit intowaybarrios:mainfrom
Thump604:codex/simpleengine-generate-accumulator
Apr 12, 2026
Merged

refactor(simple): SimpleEngine.generate() thin accumulator over stream_generate#266
janhilgard merged 1 commit intowaybarrios:mainfrom
Thump604:codex/simpleengine-generate-accumulator

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 commented Apr 8, 2026

Summary

SimpleEngine.generate() is now a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. This closes a real /v1/completions contract gap: the old direct self._model.generate() path silently dropped per-request specprefill / specprefill_keep_pct overrides because mlx_lm.generate() does not consume those kwargs.

Why

stream_generate() is the only code path that pops specprefill / specprefill_keep_pct from kwargs, does the threshold check, and routes through _stream_generate_specprefill(). Non-streaming clients that send {"extra_body": {"specprefill": true}} to /v1/completions expected SpecPrefill to engage on the non-streaming route. Before this PR, the server (with the companion change in #265) forwarded the overrides into engine.generate(), and engine.generate() forwarded them via **kwargs into _model.generate(), which ignored them. The feature was advertised end-to-end but was a silent no-op at the engine boundary.

Matches the accumulator-over-streaming pattern established by #222 for tool-enabled chat.

Changes

vllm_mlx/engine/simple.py:

  • SimpleEngine.generate() body replaced with an async for loop over self.stream_generate(**kwargs) that captures the last GenerationOutput. Empty-stream edge case returns GenerationOutput(text="", finish_reason="stop") rather than raising.
  • Text is run through clean_output_text() the same way the old path did.
  • Returned GenerationOutput preserves tokens, prompt_tokens, completion_tokens, and finish_reason from the last yielded chunk, and sets finished=True.

tests/test_simple_engine.py:

  • New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator that yields two chunks, calls engine.generate() with specprefill=True and specprefill_keep_pct=0.2, and asserts (a) the final output fields match the last yielded chunk and (b) both SpecPrefill overrides reached stream_generate.
  • New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case.
  • Extended the mock_model fixture with a stream_generate side effect that tracks concurrency the same way the existing generate side effect does, so test_lock_prevents_concurrent_generate continues to observe serialization through the new accumulator path without behavior change.

Scope

Only generate() changes in this PR. chat() stays on its current path. Extending chat() to the full accumulator pattern is a separate follow-up that will layer on top of #222 once it merges.

Verification

Unit tests: the new test_generate_accumulates_over_stream_generate and test_generate_empty_stream_returns_safe_default both pass against a local checkout.

Live against Qwen 3.5 4B SimpleEngine + SpecPrefill on an M2 Ultra 128GB with extra_body.specprefill=true forcing SpecPrefill below the 8192-token threshold:

curl /v1/completions
  body: {"model":"qwen3.5-4b","prompt":"<~6000 tokens filler> Summarize in one sentence:","max_tokens":30,"extra_body":{"specprefill":true}}

response: usage.prompt_tokens=6007 (was silent 0 pre-PR)
stderr:   SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s

Before this PR the same request returned coherent content but never engaged SpecPrefill (no SpecPrefill: log line), and prompt_tokens was silently 0.

Related

@Thump604
Copy link
Copy Markdown
Collaborator Author

Quick merge signal: this is the engine-side half of the /v1/completions SpecPrefill override work paired with #265. The branch is mergeable, CI is green, and the same accumulator shape is already carrying the local runtime path cleanly on our side.

If you're doing a small sweep, #265 + #266 are the pair to take together.

@janhilgard
Copy link
Copy Markdown
Collaborator

Nice refactor @Thump604 — the accumulator-over-stream pattern is clean and fixes the real SpecPrefill gap in the non-streaming path.

PR has merge conflicts with current main though (simple.py changed in recent merges). Would need a rebase.

…er stream_generate

stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.

Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming
clients now get:

- SpecPrefill engagement when `specprefill=true` is set (top-level or
  extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
  mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
  streaming path
- Same thread-safety (stream_generate holds self._generation_lock
  around the MLX call)

Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR waybarrios#222.

Tests:
- New test_generate_accumulates_over_stream_generate stubs
  stream_generate with an async generator, calls generate() with
  per-request specprefill kwargs, and asserts:
  * final output fields (text, tokens, prompt_tokens,
    completion_tokens, finish_reason, finished) match the last yielded
    chunk
  * specprefill / specprefill_keep_pct were forwarded through to
    stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
  empty-stream edge case (returns GenerationOutput(text="",
  finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
  so test_lock_prevents_concurrent_generate still observes
  serialization through the new accumulator path

Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:

  SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s

prompt_tokens reporting is now 6007 (was always 0 before).

Related: companion PR waybarrios#265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
@Thump604 Thump604 force-pushed the codex/simpleengine-generate-accumulator branch from 59420a8 to 7bb55f5 Compare April 11, 2026 16:16
@Thump604
Copy link
Copy Markdown
Collaborator Author

Restacked on current main.

This keeps generate() as a thin accumulator over stream_generate() so the non-streaming /v1/completions path actually shares the same per-request SpecPrefill override path and prompt-token accounting as streaming.

Local verification on the rebased branch:

  • python -m py_compile vllm_mlx/engine/simple.py tests/test_simple_engine.py
  • pytest -q tests/test_simple_engine.py -k 'generate_accumulates_over_stream_generate or generate_empty_stream_returns_safe_default or lock_prevents_concurrent_generate'

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean refactor that fixes a real SpecPrefill gap in the non-streaming path.

Verified:

  • Serialization maintained: both code paths in stream_generate() properly acquire _generation_lock
  • SpecPrefill kwargs (specprefill, specprefill_keep_pct) now correctly flow through stream_generate()_stream_generate_specprefill() for non-streaming /v1/completions callers
  • Empty stream edge case handled gracefully
  • Tests cover accumulator behavior, empty stream, and concurrency serialization
  • Rebase on current main is clean

Accumulator-over-streaming pattern matches the established shape from #222.

@janhilgard janhilgard merged commit ab68d94 into waybarrios:main Apr 12, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants