feat(api): per-request SpecPrefill overrides on /v1/completions by Thump604 · Pull Request #265 · waybarrios/vllm-mlx

Thump604 · 2026-04-08T03:51:25Z

Summary

/v1/chat/completions already accepts per-request specprefill and specprefill_keep_pct overrides and threads them into engine.chat(). /v1/completions does not. This PR wires the same two fields through to /v1/completions on both the non-streaming and streaming paths.

Why

A SpecPrefill user with mixed /v1/chat/completions and /v1/completions traffic currently cannot opt individual /v1/completions requests into or out of SpecPrefill, and cannot tune specprefill_keep_pct per request on that endpoint. The server config is the only control. This closes that gap symmetrically against the existing chat endpoint.

Changes

vllm_mlx/api/models.py: add specprefill: bool | None and specprefill_keep_pct: float | None to CompletionRequest, matching the fields already on ChatCompletionRequest.
vllm_mlx/server.py::create_completion: extract both and pass as gen_kwargs into engine.generate(), mirroring the pattern at server.py:1421 in create_chat_completion.
vllm_mlx/server.py::stream_completion: apply the same extraction so streaming clients have the same control.

Both new fields default to None, so existing behavior is unchanged for clients that don't set them. No schema changes to ChatCompletionRequest. No engine-side changes needed: SimpleEngine.stream_generate already pops specprefill and specprefill_keep_pct from kwargs (simple.py:307-308).

Total delta: +20 lines across 2 files.

Test plan

POST /v1/completions with prompt + max_tokens but no specprefill field still returns 200 with content (default behavior unchanged).
POST /v1/completions with {"specprefill": true} engages SpecPrefill on an eligible SimpleEngine model (non-MLLM with --specprefill-draft-model).
POST /v1/completions with {"specprefill": false} bypasses SpecPrefill on the same eligible model.
POST /v1/completions with {"specprefill_keep_pct": 0.2} uses 20% keep instead of the server default.
POST /v1/completions with stream=true respects all three per-request controls above.

I've verified the default-behavior case locally on a runtime whose CompletionRequest schema is a strict superset of this PR's changes. The minimal version in this branch is structurally a subset and should behave identically for the default path.

…er stream_generate stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR waybarrios#222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR waybarrios#265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.

…Prefill This commit is a local superset of upstream PR waybarrios#265 (waybarrios#265: feat(api): per-request SpecPrefill overrides on /v1/completions). It landed locally first to unblock /v1/completions on this runtime, which was returning 500 AttributeError on every request because server.py::create_completion accesses request.top_k / min_p / presence_penalty / repetition_penalty while the local CompletionRequest schema was missing those fields after a merge/rebase lost the bdf7dcc additions. Scope vs upstream PR waybarrios#265: Upstream PR waybarrios#265 (minimal, portable): adds `specprefill` and `specprefill_keep_pct` to CompletionRequest, extracts them in create_completion / stream_completion, threads as gen_kwargs. Local additions on top of PR waybarrios#265: - Restores `top_k`, `min_p`, `presence_penalty`, `repetition_penalty` to CompletionRequest (lost in rebase; were present in bdf7dcc). These match ChatCompletionRequest's local shape and unblock the existing create_completion access pattern. - Adds `stream_options: StreamOptions | None = None` for future include-usage support. - Adds `extra_body: dict[str, Any] | None = None` for OpenAI client compatibility. - Updates `_resolve_request_field` type hint to accept both ChatCompletionRequest and CompletionRequest; uses `getattr` to safely handle the extra_body fallback on either type. - create_completion uses _resolve_request_field instead of direct attribute access, so per-request SpecPrefill can come from either top-level fields or extra_body. When PR waybarrios#265 merges upstream, the specprefill/specprefill_keep_pct parts of this commit become a rebase no-op. The local-only additions (top_k/min_p/etc. restoration, extra_body helper, _resolve_request_field extension) stay as runtime-local fixes until they ship as separate upstream PRs. Verification: curl /v1/completions with plain prompt: 200 with content curl /v1/completions with extra_body.specprefill=true: 200 with content curl /v1/completions with top_k/min_p/presence_penalty: 200 with content (was 500 AttributeError on all three before this fix) Upstream: waybarrios#265

…l generate+stream_generate Pre-existing regression from an earlier rebase that dropped bdf7dcc's llm.py additions. The server.py request handlers still pass top_k, min_p, presence_penalty, repetition_penalty through to SimpleEngine, which forwards them via **kwargs to MLXLanguageModel.chat() (which accepts **kwargs) which then calls self.generate(..., **kwargs). But MLXLanguageModel.generate() and stream_generate() had been left with only (temperature, top_p, repetition_penalty) in their signatures, so any non-MLLM SimpleEngine request crashed with: TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in the feature matrix regression sweep after the Session 87 cherry-picks of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause this regression — they exposed it by finally running the LLM-path tests that no one had exercised since the rebase happened. Confirmed via stderr.log: TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k' TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Fix: restore the signatures and bodies of _create_sampler, _create_logits_processors, generate, and stream_generate to match bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter and non-str prompt support on stream_generate. Adds **kwargs to both generate and stream_generate so future param additions degrade gracefully instead of crashing. This is a runtime-local fix. The equivalent upstream fix lives in bdf7dcc which was never upstreamed (confirmed via git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR to upstream could carry this forward. Verification: bin/verify-patches: 33/33 clean Full feature matrix regression sweep pending re-run after this commit. Related: runtime PR waybarrios#265 (waybarrios#265) fixed the CompletionRequest schema side of the same bdf7dcc drop; this commit fixes the engine-model side.

Thump604 · 2026-04-10T12:20:04Z

Quick merge signal: this is the API half of the /v1/completions SpecPrefill override pair with #266. The branch is mergeable, CI is green, and the local runtime validation on our side was clean once both halves were in place.

If you're doing a small sweep, #265 + #266 are the pair to take together.

waybarrios

Minor style note: in create_chat_completion the specprefill kwargs are added directly to the existing chat_kwargs dict, but here a separate gen_kwargs dict is created and spread with **gen_kwargs. Works the same, just a slightly different pattern — not a blocker, just a stylistic inconsistency.

waybarrios

To clarify my earlier comment — here's what I mean by the style difference.

In create_chat_completion, all kwargs go into a single dict:

chat_kwargs = {
    "max_tokens": request.max_tokens or _default_max_tokens,
    "temperature": _resolve_temperature(request.temperature),
    "top_p": _resolve_top_p(request.top_p),
}
if request.specprefill is not None:
    chat_kwargs["specprefill"] = request.specprefill
if request.specprefill_keep_pct is not None:
    chat_kwargs["specprefill_keep_pct"] = request.specprefill_keep_pct

await engine.chat(..., **chat_kwargs)

In this PR, the base args are passed inline and the specprefill ones go into a separate dict that gets spread:

gen_kwargs = {}
if request.specprefill is not None:
    gen_kwargs["specprefill"] = request.specprefill
if request.specprefill_keep_pct is not None:
    gen_kwargs["specprefill_keep_pct"] = request.specprefill_keep_pct

await engine.generate(
    prompt=prompt,
    max_tokens=...,
    temperature=...,
    top_p=...,
    stop=...,
    **gen_kwargs,
)

For consistency, you could unify into a single dict like chat does:

gen_kwargs = {
    "max_tokens": request.max_tokens or _default_max_tokens,
    "temperature": _resolve_temperature(request.temperature),
    "top_p": _resolve_top_p(request.top_p),
    "stop": request.stop,
}
if request.specprefill is not None:
    gen_kwargs["specprefill"] = request.specprefill
if request.specprefill_keep_pct is not None:
    gen_kwargs["specprefill_keep_pct"] = request.specprefill_keep_pct

await engine.generate(prompt=prompt, **gen_kwargs)

Just a nit — the PR works fine as-is.

Thump604 · 2026-04-11T02:10:51Z

I aligned the /v1/completions SpecPrefill override handling style in 77066d9 so it now builds and mutates a single generation kwargs dict, matching the chat-completions path. Behavior is unchanged; this just removes the stylistic inconsistency you called out.

ChatCompletionRequest already accepts per-request `specprefill` and `specprefill_keep_pct` overrides, and /v1/chat/completions threads them into engine.chat(). CompletionRequest does not, so /v1/completions clients cannot opt a single request into (or out of) SpecPrefill, nor tune the keep percentage per request. Changes: - vllm_mlx/api/models.py: add specprefill and specprefill_keep_pct to CompletionRequest, matching the existing ChatCompletionRequest fields. - vllm_mlx/server.py::create_completion: extract both and thread into engine.generate(**gen_kwargs), mirroring the pattern used at server.py:1421 in create_chat_completion. - vllm_mlx/server.py::stream_completion: apply the same extraction so streaming /v1/completions clients get the same control. Both new fields default to None, so existing behavior is unchanged for clients that do not set them. No schema changes to ChatCompletionRequest. No engine-side changes needed: SimpleEngine.stream_generate already consumes these kwargs (see simple.py:307-308).

…er stream_generate stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR waybarrios#222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR waybarrios#265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.

Thump604 · 2026-04-11T16:16:20Z

Restacked on current main.

I kept the single-dict kwargs style you called out, but preserved the current completions-path repetition_penalty handling while adding the per-request specprefill / specprefill_keep_pct overrides.

Local verification on the rebased branch:

python -m py_compile vllm_mlx/server.py vllm_mlx/api/models.py
pytest -q tests/test_api_models.py tests/test_server.py

…er stream_generate (#266) stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR #222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR #222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR #265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.

janhilgard

LGTM — clean API-side complement to #266 (already merged).

specprefill and specprefill_keep_pct fields on CompletionRequest mirror the existing ChatCompletionRequest schema symmetrically
Both create_completion and stream_completion forward overrides via a single generate_kwargs dict, matching the chat endpoint style
Default None preserves existing behavior for clients that don't set these fields
Rebase on current main is clean

Thump604 mentioned this pull request Apr 8, 2026

refactor(simple): SimpleEngine.generate() thin accumulator over stream_generate #266

Merged

waybarrios reviewed Apr 11, 2026

View reviewed changes

Thump604 added 2 commits April 11, 2026 11:14

style(server): align completions kwargs handling

b310a08

Thump604 force-pushed the codex/specprefill-completions-endpoint branch from 77066d9 to b310a08 Compare April 11, 2026 16:16

janhilgard approved these changes Apr 12, 2026

View reviewed changes

janhilgard merged commit 0cafaab into waybarrios:main Apr 12, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): per-request SpecPrefill overrides on /v1/completions#265

feat(api): per-request SpecPrefill overrides on /v1/completions#265
janhilgard merged 2 commits intowaybarrios:mainfrom
Thump604:codex/specprefill-completions-endpoint

Thump604 commented Apr 8, 2026

Uh oh!

Thump604 commented Apr 10, 2026

Uh oh!

waybarrios left a comment

Uh oh!

waybarrios left a comment

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Thump604 commented Apr 8, 2026

Summary

Why

Changes

Test plan

Uh oh!

Thump604 commented Apr 10, 2026

Uh oh!

waybarrios left a comment

Choose a reason for hiding this comment

Uh oh!

waybarrios left a comment

Choose a reason for hiding this comment

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants