Skip to content

feat(api): per-request SpecPrefill overrides on /v1/completions#265

Merged
janhilgard merged 2 commits intowaybarrios:mainfrom
Thump604:codex/specprefill-completions-endpoint
Apr 12, 2026
Merged

feat(api): per-request SpecPrefill overrides on /v1/completions#265
janhilgard merged 2 commits intowaybarrios:mainfrom
Thump604:codex/specprefill-completions-endpoint

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 commented Apr 8, 2026

Summary

/v1/chat/completions already accepts per-request specprefill and specprefill_keep_pct overrides and threads them into engine.chat(). /v1/completions does not. This PR wires the same two fields through to /v1/completions on both the non-streaming and streaming paths.

Why

A SpecPrefill user with mixed /v1/chat/completions and /v1/completions traffic currently cannot opt individual /v1/completions requests into or out of SpecPrefill, and cannot tune specprefill_keep_pct per request on that endpoint. The server config is the only control. This closes that gap symmetrically against the existing chat endpoint.

Changes

  • vllm_mlx/api/models.py: add specprefill: bool | None and specprefill_keep_pct: float | None to CompletionRequest, matching the fields already on ChatCompletionRequest.
  • vllm_mlx/server.py::create_completion: extract both and pass as gen_kwargs into engine.generate(), mirroring the pattern at server.py:1421 in create_chat_completion.
  • vllm_mlx/server.py::stream_completion: apply the same extraction so streaming clients have the same control.

Both new fields default to None, so existing behavior is unchanged for clients that don't set them. No schema changes to ChatCompletionRequest. No engine-side changes needed: SimpleEngine.stream_generate already pops specprefill and specprefill_keep_pct from kwargs (simple.py:307-308).

Total delta: +20 lines across 2 files.

Test plan

  • POST /v1/completions with prompt + max_tokens but no specprefill field still returns 200 with content (default behavior unchanged).
  • POST /v1/completions with {"specprefill": true} engages SpecPrefill on an eligible SimpleEngine model (non-MLLM with --specprefill-draft-model).
  • POST /v1/completions with {"specprefill": false} bypasses SpecPrefill on the same eligible model.
  • POST /v1/completions with {"specprefill_keep_pct": 0.2} uses 20% keep instead of the server default.
  • POST /v1/completions with stream=true respects all three per-request controls above.

I've verified the default-behavior case locally on a runtime whose CompletionRequest schema is a strict superset of this PR's changes. The minimal version in this branch is structurally a subset and should behave identically for the default path.

Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 8, 2026
…er stream_generate

stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.

Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming
clients now get:

- SpecPrefill engagement when `specprefill=true` is set (top-level or
  extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
  mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
  streaming path
- Same thread-safety (stream_generate holds self._generation_lock
  around the MLX call)

Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR waybarrios#222.

Tests:
- New test_generate_accumulates_over_stream_generate stubs
  stream_generate with an async generator, calls generate() with
  per-request specprefill kwargs, and asserts:
  * final output fields (text, tokens, prompt_tokens,
    completion_tokens, finish_reason, finished) match the last yielded
    chunk
  * specprefill / specprefill_keep_pct were forwarded through to
    stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
  empty-stream edge case (returns GenerationOutput(text="",
  finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
  so test_lock_prevents_concurrent_generate still observes
  serialization through the new accumulator path

Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:

  SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s

prompt_tokens reporting is now 6007 (was always 0 before).

Related: companion PR waybarrios#265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 9, 2026
…Prefill

This commit is a local superset of upstream PR waybarrios#265
(waybarrios#265: feat(api): per-request SpecPrefill overrides
on /v1/completions). It landed locally first to unblock /v1/completions
on this runtime, which was returning 500 AttributeError on every
request because server.py::create_completion accesses request.top_k /
min_p / presence_penalty / repetition_penalty while the local
CompletionRequest schema was missing those fields after a merge/rebase
lost the bdf7dcc additions.

Scope vs upstream PR waybarrios#265:

  Upstream PR waybarrios#265 (minimal, portable): adds `specprefill` and
  `specprefill_keep_pct` to CompletionRequest, extracts them in
  create_completion / stream_completion, threads as gen_kwargs.

  Local additions on top of PR waybarrios#265:
  - Restores `top_k`, `min_p`, `presence_penalty`, `repetition_penalty`
    to CompletionRequest (lost in rebase; were present in bdf7dcc).
    These match ChatCompletionRequest's local shape and unblock the
    existing create_completion access pattern.
  - Adds `stream_options: StreamOptions | None = None` for future
    include-usage support.
  - Adds `extra_body: dict[str, Any] | None = None` for OpenAI client
    compatibility.
  - Updates `_resolve_request_field` type hint to accept both
    ChatCompletionRequest and CompletionRequest; uses `getattr` to
    safely handle the extra_body fallback on either type.
  - create_completion uses _resolve_request_field instead of direct
    attribute access, so per-request SpecPrefill can come from either
    top-level fields or extra_body.

When PR waybarrios#265 merges upstream, the specprefill/specprefill_keep_pct
parts of this commit become a rebase no-op. The local-only additions
(top_k/min_p/etc. restoration, extra_body helper, _resolve_request_field
extension) stay as runtime-local fixes until they ship as separate
upstream PRs.

Verification:
  curl /v1/completions with plain prompt: 200 with content
  curl /v1/completions with extra_body.specprefill=true: 200 with content
  curl /v1/completions with top_k/min_p/presence_penalty: 200 with content
  (was 500 AttributeError on all three before this fix)

Upstream: waybarrios#265
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 9, 2026
…l generate+stream_generate

Pre-existing regression from an earlier rebase that dropped bdf7dcc's
llm.py additions. The server.py request handlers still pass top_k,
min_p, presence_penalty, repetition_penalty through to SimpleEngine,
which forwards them via **kwargs to MLXLanguageModel.chat() (which
accepts **kwargs) which then calls self.generate(..., **kwargs). But
MLXLanguageModel.generate() and stream_generate() had been left with
only (temperature, top_p, repetition_penalty) in their signatures, so
any non-MLLM SimpleEngine request crashed with:

    TypeError: MLXLanguageModel.stream_generate() got an unexpected
    keyword argument 'top_k'

Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in
the feature matrix regression sweep after the Session 87 cherry-picks
of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause
this regression — they exposed it by finally running the LLM-path
tests that no one had exercised since the rebase happened. Confirmed
via stderr.log:

  TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k'
  TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k'

Fix: restore the signatures and bodies of _create_sampler,
_create_logits_processors, generate, and stream_generate to match
bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter
and non-str prompt support on stream_generate. Adds **kwargs to both
generate and stream_generate so future param additions degrade
gracefully instead of crashing.

This is a runtime-local fix. The equivalent upstream fix lives in
bdf7dcc which was never upstreamed (confirmed via
git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR
to upstream could carry this forward.

Verification:
  bin/verify-patches: 33/33 clean
  Full feature matrix regression sweep pending re-run after this commit.

Related: runtime PR waybarrios#265 (waybarrios#265) fixed the
CompletionRequest schema side of the same bdf7dcc drop; this commit
fixes the engine-model side.
@Thump604
Copy link
Copy Markdown
Collaborator Author

Quick merge signal: this is the API half of the /v1/completions SpecPrefill override pair with #266. The branch is mergeable, CI is green, and the local runtime validation on our side was clean once both halves were in place.

If you're doing a small sweep, #265 + #266 are the pair to take together.

Copy link
Copy Markdown
Owner

@waybarrios waybarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style note: in create_chat_completion the specprefill kwargs are added directly to the existing chat_kwargs dict, but here a separate gen_kwargs dict is created and spread with **gen_kwargs. Works the same, just a slightly different pattern — not a blocker, just a stylistic inconsistency.

Copy link
Copy Markdown
Owner

@waybarrios waybarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my earlier comment — here's what I mean by the style difference.

In create_chat_completion, all kwargs go into a single dict:

chat_kwargs = {
    "max_tokens": request.max_tokens or _default_max_tokens,
    "temperature": _resolve_temperature(request.temperature),
    "top_p": _resolve_top_p(request.top_p),
}
if request.specprefill is not None:
    chat_kwargs["specprefill"] = request.specprefill
if request.specprefill_keep_pct is not None:
    chat_kwargs["specprefill_keep_pct"] = request.specprefill_keep_pct

await engine.chat(..., **chat_kwargs)

In this PR, the base args are passed inline and the specprefill ones go into a separate dict that gets spread:

gen_kwargs = {}
if request.specprefill is not None:
    gen_kwargs["specprefill"] = request.specprefill
if request.specprefill_keep_pct is not None:
    gen_kwargs["specprefill_keep_pct"] = request.specprefill_keep_pct

await engine.generate(
    prompt=prompt,
    max_tokens=...,
    temperature=...,
    top_p=...,
    stop=...,
    **gen_kwargs,
)

For consistency, you could unify into a single dict like chat does:

gen_kwargs = {
    "max_tokens": request.max_tokens or _default_max_tokens,
    "temperature": _resolve_temperature(request.temperature),
    "top_p": _resolve_top_p(request.top_p),
    "stop": request.stop,
}
if request.specprefill is not None:
    gen_kwargs["specprefill"] = request.specprefill
if request.specprefill_keep_pct is not None:
    gen_kwargs["specprefill_keep_pct"] = request.specprefill_keep_pct

await engine.generate(prompt=prompt, **gen_kwargs)

Just a nit — the PR works fine as-is.

@Thump604
Copy link
Copy Markdown
Collaborator Author

I aligned the /v1/completions SpecPrefill override handling style in 77066d9 so it now builds and mutates a single generation kwargs dict, matching the chat-completions path. Behavior is unchanged; this just removes the stylistic inconsistency you called out.

ChatCompletionRequest already accepts per-request `specprefill` and
`specprefill_keep_pct` overrides, and /v1/chat/completions threads
them into engine.chat(). CompletionRequest does not, so /v1/completions
clients cannot opt a single request into (or out of) SpecPrefill, nor
tune the keep percentage per request.

Changes:

- vllm_mlx/api/models.py: add specprefill and specprefill_keep_pct to
  CompletionRequest, matching the existing ChatCompletionRequest fields.
- vllm_mlx/server.py::create_completion: extract both and thread into
  engine.generate(**gen_kwargs), mirroring the pattern used at
  server.py:1421 in create_chat_completion.
- vllm_mlx/server.py::stream_completion: apply the same extraction so
  streaming /v1/completions clients get the same control.

Both new fields default to None, so existing behavior is unchanged for
clients that do not set them. No schema changes to ChatCompletionRequest.
No engine-side changes needed: SimpleEngine.stream_generate already
consumes these kwargs (see simple.py:307-308).
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 11, 2026
…er stream_generate

stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.

Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming
clients now get:

- SpecPrefill engagement when `specprefill=true` is set (top-level or
  extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
  mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
  streaming path
- Same thread-safety (stream_generate holds self._generation_lock
  around the MLX call)

Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR waybarrios#222.

Tests:
- New test_generate_accumulates_over_stream_generate stubs
  stream_generate with an async generator, calls generate() with
  per-request specprefill kwargs, and asserts:
  * final output fields (text, tokens, prompt_tokens,
    completion_tokens, finish_reason, finished) match the last yielded
    chunk
  * specprefill / specprefill_keep_pct were forwarded through to
    stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
  empty-stream edge case (returns GenerationOutput(text="",
  finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
  so test_lock_prevents_concurrent_generate still observes
  serialization through the new accumulator path

Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:

  SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s

prompt_tokens reporting is now 6007 (was always 0 before).

Related: companion PR waybarrios#265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
@Thump604 Thump604 force-pushed the codex/specprefill-completions-endpoint branch from 77066d9 to b310a08 Compare April 11, 2026 16:16
@Thump604
Copy link
Copy Markdown
Collaborator Author

Restacked on current main.

I kept the single-dict kwargs style you called out, but preserved the current completions-path repetition_penalty handling while adding the per-request specprefill / specprefill_keep_pct overrides.

Local verification on the rebased branch:

  • python -m py_compile vllm_mlx/server.py vllm_mlx/api/models.py
  • pytest -q tests/test_api_models.py tests/test_server.py

janhilgard pushed a commit that referenced this pull request Apr 12, 2026
…er stream_generate (#266)

stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.

Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR #222 established for tool-enabled chat(). Non-streaming
clients now get:

- SpecPrefill engagement when `specprefill=true` is set (top-level or
  extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
  mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
  streaming path
- Same thread-safety (stream_generate holds self._generation_lock
  around the MLX call)

Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR #222.

Tests:
- New test_generate_accumulates_over_stream_generate stubs
  stream_generate with an async generator, calls generate() with
  per-request specprefill kwargs, and asserts:
  * final output fields (text, tokens, prompt_tokens,
    completion_tokens, finish_reason, finished) match the last yielded
    chunk
  * specprefill / specprefill_keep_pct were forwarded through to
    stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
  empty-stream edge case (returns GenerationOutput(text="",
  finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
  so test_lock_prevents_concurrent_generate still observes
  serialization through the new accumulator path

Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:

  SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s

prompt_tokens reporting is now 6007 (was always 0 before).

Related: companion PR #265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean API-side complement to #266 (already merged).

  • specprefill and specprefill_keep_pct fields on CompletionRequest mirror the existing ChatCompletionRequest schema symmetrically
  • Both create_completion and stream_completion forward overrides via a single generate_kwargs dict, matching the chat endpoint style
  • Default None preserves existing behavior for clients that don't set these fields
  • Rebase on current main is clean

@janhilgard janhilgard merged commit 0cafaab into waybarrios:main Apr 12, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants