feat(api): per-request SpecPrefill overrides on /v1/completions#265
Conversation
…er stream_generate
stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.
Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming
clients now get:
- SpecPrefill engagement when `specprefill=true` is set (top-level or
extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
streaming path
- Same thread-safety (stream_generate holds self._generation_lock
around the MLX call)
Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR waybarrios#222.
Tests:
- New test_generate_accumulates_over_stream_generate stubs
stream_generate with an async generator, calls generate() with
per-request specprefill kwargs, and asserts:
* final output fields (text, tokens, prompt_tokens,
completion_tokens, finish_reason, finished) match the last yielded
chunk
* specprefill / specprefill_keep_pct were forwarded through to
stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
empty-stream edge case (returns GenerationOutput(text="",
finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
so test_lock_prevents_concurrent_generate still observes
serialization through the new accumulator path
Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:
SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s
prompt_tokens reporting is now 6007 (was always 0 before).
Related: companion PR waybarrios#265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
…Prefill This commit is a local superset of upstream PR waybarrios#265 (waybarrios#265: feat(api): per-request SpecPrefill overrides on /v1/completions). It landed locally first to unblock /v1/completions on this runtime, which was returning 500 AttributeError on every request because server.py::create_completion accesses request.top_k / min_p / presence_penalty / repetition_penalty while the local CompletionRequest schema was missing those fields after a merge/rebase lost the bdf7dcc additions. Scope vs upstream PR waybarrios#265: Upstream PR waybarrios#265 (minimal, portable): adds `specprefill` and `specprefill_keep_pct` to CompletionRequest, extracts them in create_completion / stream_completion, threads as gen_kwargs. Local additions on top of PR waybarrios#265: - Restores `top_k`, `min_p`, `presence_penalty`, `repetition_penalty` to CompletionRequest (lost in rebase; were present in bdf7dcc). These match ChatCompletionRequest's local shape and unblock the existing create_completion access pattern. - Adds `stream_options: StreamOptions | None = None` for future include-usage support. - Adds `extra_body: dict[str, Any] | None = None` for OpenAI client compatibility. - Updates `_resolve_request_field` type hint to accept both ChatCompletionRequest and CompletionRequest; uses `getattr` to safely handle the extra_body fallback on either type. - create_completion uses _resolve_request_field instead of direct attribute access, so per-request SpecPrefill can come from either top-level fields or extra_body. When PR waybarrios#265 merges upstream, the specprefill/specprefill_keep_pct parts of this commit become a rebase no-op. The local-only additions (top_k/min_p/etc. restoration, extra_body helper, _resolve_request_field extension) stay as runtime-local fixes until they ship as separate upstream PRs. Verification: curl /v1/completions with plain prompt: 200 with content curl /v1/completions with extra_body.specprefill=true: 200 with content curl /v1/completions with top_k/min_p/presence_penalty: 200 with content (was 500 AttributeError on all three before this fix) Upstream: waybarrios#265
…l generate+stream_generate Pre-existing regression from an earlier rebase that dropped bdf7dcc's llm.py additions. The server.py request handlers still pass top_k, min_p, presence_penalty, repetition_penalty through to SimpleEngine, which forwards them via **kwargs to MLXLanguageModel.chat() (which accepts **kwargs) which then calls self.generate(..., **kwargs). But MLXLanguageModel.generate() and stream_generate() had been left with only (temperature, top_p, repetition_penalty) in their signatures, so any non-MLLM SimpleEngine request crashed with: TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in the feature matrix regression sweep after the Session 87 cherry-picks of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause this regression — they exposed it by finally running the LLM-path tests that no one had exercised since the rebase happened. Confirmed via stderr.log: TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k' TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Fix: restore the signatures and bodies of _create_sampler, _create_logits_processors, generate, and stream_generate to match bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter and non-str prompt support on stream_generate. Adds **kwargs to both generate and stream_generate so future param additions degrade gracefully instead of crashing. This is a runtime-local fix. The equivalent upstream fix lives in bdf7dcc which was never upstreamed (confirmed via git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR to upstream could carry this forward. Verification: bin/verify-patches: 33/33 clean Full feature matrix regression sweep pending re-run after this commit. Related: runtime PR waybarrios#265 (waybarrios#265) fixed the CompletionRequest schema side of the same bdf7dcc drop; this commit fixes the engine-model side.
waybarrios
left a comment
There was a problem hiding this comment.
Minor style note: in create_chat_completion the specprefill kwargs are added directly to the existing chat_kwargs dict, but here a separate gen_kwargs dict is created and spread with **gen_kwargs. Works the same, just a slightly different pattern — not a blocker, just a stylistic inconsistency.
waybarrios
left a comment
There was a problem hiding this comment.
To clarify my earlier comment — here's what I mean by the style difference.
In create_chat_completion, all kwargs go into a single dict:
chat_kwargs = {
"max_tokens": request.max_tokens or _default_max_tokens,
"temperature": _resolve_temperature(request.temperature),
"top_p": _resolve_top_p(request.top_p),
}
if request.specprefill is not None:
chat_kwargs["specprefill"] = request.specprefill
if request.specprefill_keep_pct is not None:
chat_kwargs["specprefill_keep_pct"] = request.specprefill_keep_pct
await engine.chat(..., **chat_kwargs)In this PR, the base args are passed inline and the specprefill ones go into a separate dict that gets spread:
gen_kwargs = {}
if request.specprefill is not None:
gen_kwargs["specprefill"] = request.specprefill
if request.specprefill_keep_pct is not None:
gen_kwargs["specprefill_keep_pct"] = request.specprefill_keep_pct
await engine.generate(
prompt=prompt,
max_tokens=...,
temperature=...,
top_p=...,
stop=...,
**gen_kwargs,
)For consistency, you could unify into a single dict like chat does:
gen_kwargs = {
"max_tokens": request.max_tokens or _default_max_tokens,
"temperature": _resolve_temperature(request.temperature),
"top_p": _resolve_top_p(request.top_p),
"stop": request.stop,
}
if request.specprefill is not None:
gen_kwargs["specprefill"] = request.specprefill
if request.specprefill_keep_pct is not None:
gen_kwargs["specprefill_keep_pct"] = request.specprefill_keep_pct
await engine.generate(prompt=prompt, **gen_kwargs)Just a nit — the PR works fine as-is.
|
I aligned the /v1/completions SpecPrefill override handling style in 77066d9 so it now builds and mutates a single generation kwargs dict, matching the chat-completions path. Behavior is unchanged; this just removes the stylistic inconsistency you called out. |
ChatCompletionRequest already accepts per-request `specprefill` and `specprefill_keep_pct` overrides, and /v1/chat/completions threads them into engine.chat(). CompletionRequest does not, so /v1/completions clients cannot opt a single request into (or out of) SpecPrefill, nor tune the keep percentage per request. Changes: - vllm_mlx/api/models.py: add specprefill and specprefill_keep_pct to CompletionRequest, matching the existing ChatCompletionRequest fields. - vllm_mlx/server.py::create_completion: extract both and thread into engine.generate(**gen_kwargs), mirroring the pattern used at server.py:1421 in create_chat_completion. - vllm_mlx/server.py::stream_completion: apply the same extraction so streaming /v1/completions clients get the same control. Both new fields default to None, so existing behavior is unchanged for clients that do not set them. No schema changes to ChatCompletionRequest. No engine-side changes needed: SimpleEngine.stream_generate already consumes these kwargs (see simple.py:307-308).
…er stream_generate
stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.
Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming
clients now get:
- SpecPrefill engagement when `specprefill=true` is set (top-level or
extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
streaming path
- Same thread-safety (stream_generate holds self._generation_lock
around the MLX call)
Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR waybarrios#222.
Tests:
- New test_generate_accumulates_over_stream_generate stubs
stream_generate with an async generator, calls generate() with
per-request specprefill kwargs, and asserts:
* final output fields (text, tokens, prompt_tokens,
completion_tokens, finish_reason, finished) match the last yielded
chunk
* specprefill / specprefill_keep_pct were forwarded through to
stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
empty-stream edge case (returns GenerationOutput(text="",
finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
so test_lock_prevents_concurrent_generate still observes
serialization through the new accumulator path
Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:
SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s
prompt_tokens reporting is now 6007 (was always 0 before).
Related: companion PR waybarrios#265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
77066d9 to
b310a08
Compare
|
Restacked on current main. I kept the single-dict kwargs style you called out, but preserved the current completions-path Local verification on the rebased branch:
|
…er stream_generate (#266) stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR #222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR #222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR #265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.
janhilgard
left a comment
There was a problem hiding this comment.
LGTM — clean API-side complement to #266 (already merged).
specprefillandspecprefill_keep_pctfields onCompletionRequestmirror the existingChatCompletionRequestschema symmetrically- Both
create_completionandstream_completionforward overrides via a singlegenerate_kwargsdict, matching the chat endpoint style - Default
Nonepreserves existing behavior for clients that don't set these fields - Rebase on current
mainis clean
Summary
/v1/chat/completions already accepts per-request
specprefillandspecprefill_keep_pctoverrides and threads them intoengine.chat(). /v1/completions does not. This PR wires the same two fields through to /v1/completions on both the non-streaming and streaming paths.Why
A SpecPrefill user with mixed /v1/chat/completions and /v1/completions traffic currently cannot opt individual /v1/completions requests into or out of SpecPrefill, and cannot tune
specprefill_keep_pctper request on that endpoint. The server config is the only control. This closes that gap symmetrically against the existing chat endpoint.Changes
vllm_mlx/api/models.py: addspecprefill: bool | Noneandspecprefill_keep_pct: float | NonetoCompletionRequest, matching the fields already onChatCompletionRequest.vllm_mlx/server.py::create_completion: extract both and pass asgen_kwargsintoengine.generate(), mirroring the pattern atserver.py:1421increate_chat_completion.vllm_mlx/server.py::stream_completion: apply the same extraction so streaming clients have the same control.Both new fields default to
None, so existing behavior is unchanged for clients that don't set them. No schema changes toChatCompletionRequest. No engine-side changes needed:SimpleEngine.stream_generatealready popsspecprefillandspecprefill_keep_pctfrom kwargs (simple.py:307-308).Total delta: +20 lines across 2 files.
Test plan
POST /v1/completionswithprompt+max_tokensbut nospecprefillfield still returns 200 with content (default behavior unchanged).POST /v1/completionswith{"specprefill": true}engages SpecPrefill on an eligible SimpleEngine model (non-MLLM with--specprefill-draft-model).POST /v1/completionswith{"specprefill": false}bypasses SpecPrefill on the same eligible model.POST /v1/completionswith{"specprefill_keep_pct": 0.2}uses 20% keep instead of the server default.POST /v1/completionswithstream=truerespects all three per-request controls above.I've verified the default-behavior case locally on a runtime whose
CompletionRequestschema is a strict superset of this PR's changes. The minimal version in this branch is structurally a subset and should behave identically for the default path.