chat: forward chat_template_kwargs on simple-engine paths by krystophny · Pull Request #218 · waybarrios/vllm-mlx

krystophny · 2026-03-24T12:15:59Z

Summary

Honor chat_template_kwargs on the simple-engine paths that still ignored it and run the regression coverage in Apple Silicon CI.

Why

Before this branch, chat_template_kwargs was only reliably honored on the batched path and the plain LLM chat path. Simple-engine multimodal chat, multimodal stream chat, and the text-only MTP route still dropped the field.

What changed

forward chat_template_kwargs through simple-engine multimodal chat()
forward chat_template_kwargs through simple-engine multimodal stream_chat()
forward chat_template_kwargs into _stream_generate_text() for the text-only MTP route
include tests/test_chat_template_kwargs.py in Apple Silicon CI

Status

refreshed onto current upstream main (b4fa030) on 2026-04-09
no logic changes beyond the base refresh

Files to review

vllm_mlx/engine/simple.py
.github/workflows/ci.yml
tests/test_chat_template_kwargs.py

Validation

python -m pytest tests/test_chat_template_kwargs.py -q -> 6 passed
note: the older tests/test_simple_engine.py validation command now depends on the separate async-harness refresh in #226 on current upstream, so I kept validation scoped to this PR's dedicated regression file here

Thump604

Implementation is solid and addresses real coverage gaps. The forwarding is consistent across all simple-engine paths:

What works:

API model field properly declared with optional dict[str, Any]
SimpleEngine MLLM multimodal chat/stream_chat forward kwargs to model
SimpleEngine text-only MTP route in _stream_generate_text applies kwargs
LLMLanguageModel.chat applies kwargs with graceful TypeError fallback
BatchedEngine properly merges kwargs and propagates to prefix boundary computation
TypeError handling updated to remove arbitrary kwargs, not just tools

Pattern is defensive: chat_template_kwargs = dict(kwargs.pop("chat_template_kwargs", {}) or {}) safely handles None and creates fresh dict. Line 372 guard in BatchedEngine prevents tools being inserted twice when merging kwargs.

Test coverage is comprehensive — all paths have mocks covering the assertion. Adding to Apple Silicon CI ensures regression detection.

One implementation detail: line 607 (SimpleEngine text path) and line 380 (BatchedEngine) both retry on TypeError by removing all user-provided template kwargs. This is correct but slightly more aggressive than the original "tools only" approach. The exception is rare enough this won't be a problem, but if a template silently ignores an unknown kwarg instead of raising TypeError, those kwargs would pass through on first try. This is acceptable trade-off for simplicity.

Ready to merge from the implementation side.

Thump604 · 2026-04-07T23:52:17Z

@waybarrios, @krystophny: independent technical review of this PR.

Verification of the fix

Confirmed against current upstream main (b4fa030). The diff plumbs chat_template_kwargs through every place it was previously dropped:

vllm_mlx/api/models.py:172 adds the field to ChatCompletionRequest
vllm_mlx/server.py:1422 forwards it from the request into chat_kwargs
vllm_mlx/engine/simple.py forwards it through SimpleEngine chat() (LLM and MLLM branches), stream_chat() (MLLM and run_stream branches), and _stream_generate_text() (MTP path)
vllm_mlx/engine/batched.py forwards it through BatchedEngine chat(), stream_chat(), and _compute_prefix_boundary() so per-template-kwargs prefix caching works correctly
vllm_mlx/models/llm.py adds the parameter to MLXLanguageModel.chat() so the LLM path honors it

All template-apply call sites also gain a graceful fallback: if a tokenizer raises TypeError because it does not support a given kwarg, the failed kwargs are popped and the call retries.

Test coverage

tests/test_chat_template_kwargs.py adds 7 tests covering Pydantic field preservation, BatchedEngine _apply_chat_template, the HTTP endpoint via FakeEngine + TestClient, LLM chat applying kwargs before generate, SimpleEngine MLLM chat forwarding, and SimpleEngine _stream_generate_text applying kwargs. The CI workflow is updated to run the new test in the Apple Silicon job.

Why this matters

Per the PR description, before this branch chat_template_kwargs was honored on the batched path and the plain LLM chat path but silently dropped on simple-engine multimodal chat(), simple-engine multimodal stream_chat(), and the text-only MTP _stream_generate_text route. That means enable_thinking=false in chat_template_kwargs was being silently ignored on those three paths, which can cause Qwen 3.5 thinking-tag leakage in multimodal and MTP responses.

Recommendation

Merge candidate. Real fix to a real silently-ignored API field, comprehensive plumbing across all relevant call sites, good test coverage, and the CI workflow update means the regression cannot return without someone disabling the test job.

…l generate+stream_generate Pre-existing regression from an earlier rebase that dropped bdf7dcc's llm.py additions. The server.py request handlers still pass top_k, min_p, presence_penalty, repetition_penalty through to SimpleEngine, which forwards them via **kwargs to MLXLanguageModel.chat() (which accepts **kwargs) which then calls self.generate(..., **kwargs). But MLXLanguageModel.generate() and stream_generate() had been left with only (temperature, top_p, repetition_penalty) in their signatures, so any non-MLLM SimpleEngine request crashed with: TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in the feature matrix regression sweep after the Session 87 cherry-picks of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause this regression — they exposed it by finally running the LLM-path tests that no one had exercised since the rebase happened. Confirmed via stderr.log: TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k' TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Fix: restore the signatures and bodies of _create_sampler, _create_logits_processors, generate, and stream_generate to match bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter and non-str prompt support on stream_generate. Adds **kwargs to both generate and stream_generate so future param additions degrade gracefully instead of crashing. This is a runtime-local fix. The equivalent upstream fix lives in bdf7dcc which was never upstreamed (confirmed via git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR to upstream could carry this forward. Verification: bin/verify-patches: 33/33 clean Full feature matrix regression sweep pending re-run after this commit. Related: runtime PR waybarrios#265 (waybarrios#265) fixed the CompletionRequest schema side of the same bdf7dcc drop; this commit fixes the engine-model side.

krystophny · 2026-04-09T06:40:49Z

Force-pushed a refresh onto current upstream main (b4fa030). No logic change beyond the base refresh. Validation: python -m pytest tests/test_chat_template_kwargs.py -q -> 6 passed. The older tests/test_simple_engine.py validation command now depends on the separate async-harness refresh in #226 on current upstream, so I kept validation scoped to this PR's dedicated regression file.

Thump604 · 2026-04-09T12:30:52Z

Refresh confirmed on head 3c33f72 against upstream main b4fa030. The only delta on top of the previously approved be2ba60 is the "style: format chat template kwargs tests" commit, which is a no-op on the forwarding logic. The SimpleEngine multimodal chat, stream_chat, _stream_generate_text, BatchedEngine chat, models/llm.py, and server.py wiring all match the previously reviewed shape.

CI green on lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, tests. tests/test_chat_template_kwargs.py -> 6 passed on head. Prior APPROVED review at be2ba60 applies to the refreshed head.

Thump604 · 2026-04-13T19:18:31Z

Hey @krystophny - this covers all the simple-engine paths cleanly and the test coverage is good. Currently has a merge conflict with main - can you rebase? Ready to merge after that.

waybarrios · 2026-04-14T01:59:08Z

Merge Conflict Resolution

The branch conflicted with main in 7 conflict regions across two files. I've resolved all conflicts and pushed the result to waybarrios/vllm-mlx:fix/chat-template-kwargs-forwarding.

The core issue: main added enable_thinking support and refactored simple.py to use _run_blocking_serialized while this PR was adding chat_template_kwargs forwarding. Both changes touch the same call sites.

`vllm_mlx/engine/batched.py` (4 conflicts)

1. _apply_chat_template signature — both params are now accepted:

def _apply_chat_template(
    self,
    messages: list[dict[str, Any]],
    tools: list[dict] | None = None,
    num_images: int = 0,
    chat_template_kwargs: dict[str, Any] | None = None,
    enable_thinking: bool | None = None,
) -> str:

2. TypeError fallback — the retry loop now strips enable_thinking (from main) and any dynamic chat_template_kwargs keys (from this PR):

except TypeError as e:
    logger.debug(f"Chat template TypeError, retrying without extras: {e}")
    for key in ["tools", "enable_thinking", *(chat_template_kwargs or {}).keys()]:
        template_kwargs.pop(key, None)

3 & 4. chat() and stream_chat() call sites — both now forward both kwargs:

prompt = self._apply_chat_template(
    messages,
    template_tools,
    num_images=len(all_images),
    chat_template_kwargs=chat_template_kwargs,
    enable_thinking=enable_thinking,
)

`vllm_mlx/engine/simple.py` (3 conflicts)

1. chat() entry — kept the chat_template_kwargs extraction (this PR) and the tool-stall workaround from main that routes non-streaming tool chat through stream_chat. Also forwarded chat_template_kwargs into that workaround path:

chat_template_kwargs = dict(kwargs.pop("chat_template_kwargs", {}) or {})

# tool-stall workaround from main, now with chat_template_kwargs forwarded
if tools and not self._is_mllm:
    async for output in self.stream_chat(
        messages=messages,
        ...
        chat_template_kwargs=chat_template_kwargs,
        **kwargs,
    ):
        final_output = output
    ...

2. MLLM/LLM branching — adopted main's _run_blocking_serialized (replacing the old asyncio.to_thread + _generation_lock) and wove in chat_template_kwargs forwarding for both paths:

# MLLM path — injects into kwargs (mlx-vlm accepts it this way)
if self._is_mllm:
    if chat_template_kwargs:
        kwargs["chat_template_kwargs"] = chat_template_kwargs
    output = await self._run_blocking_serialized(self._model.chat, ...)

# LLM path — passes as named arg (mlx-lm accepts it directly)
else:
    output = await self._run_blocking_serialized(
        self._model.chat,
        ...,
        chat_template_kwargs=chat_template_kwargs,
        **kwargs,
    )

3. stream_chat MLLM path — adopted main's indentation (no lock context manager) and kept the PR's chat_template_kwargs forwarding via local_kwargs:

def run_stream():
    local_kwargs = dict(kwargs)
    if chat_template_kwargs:
        local_kwargs["chat_template_kwargs"] = chat_template_kwargs
    return list(
        self._model.stream_chat(
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature,
            tools=template_tools,
            **local_kwargs,
        )
    )

chunks = await self._run_blocking_serialized(run_stream)

Both files compile cleanly. To pull the resolved merge:

git remote add waybarrios https://github.com/waybarrios/vllm-mlx.git
git fetch waybarrios fix/chat-template-kwargs-forwarding
git checkout fix/chat-template-kwargs-forwarding
git reset --hard waybarrios/fix/chat-template-kwargs-forwarding
git push origin fix/chat-template-kwargs-forwarding --force

krystophny · 2026-04-14T09:31:43Z

Synced this branch to the maintainer-resolved conflict fix on current upstream/main and force-pushed. Re-ran python -m pytest tests/test_chat_template_kwargs.py -q locally (6 passed).

Thump604

Rebased cleanly with the maintainer-resolved conflicts — the merge result looks correct. Both chat_template_kwargs and enable_thinking are forwarded through all four paths (batched chat/stream_chat, simple chat/stream_chat) and the TypeError fallback strips both gracefully. CI all green. Approving.

janhilgard

Code Review — chat_template_kwargs forwarding

This PR systematically plumbs chat_template_kwargs through every engine path that previously dropped it. The coverage is complete and the implementation is consistent.

What works well

Consistent forwarding across all paths: SimpleEngine (LLM chat, MLLM chat, stream_chat, _stream_generate_text), BatchedEngine (chat, stream_chat, _compute_prefix_boundary), and MLXLanguageModel.chat all receive and forward the kwargs correctly.
Graceful TypeError fallback: The existing fallback pattern for templates that don't accept extra kwargs is properly extended to strip chat_template_kwargs keys in addition to tools and enable_thinking. The unpacking via *(chat_template_kwargs or {}).keys() is clean.
Test coverage: 6 dedicated tests covering the request model, BatchedEngine template application, the full endpoint round-trip, LLM.chat, SimpleEngine MLLM chat, and SimpleEngine _stream_generate_text. The tests use mocks appropriately and verify the actual call_args.
Prefix boundary propagation: The _compute_prefix_boundary method also receives chat_template_kwargs, which is important because template kwargs can change the tokenized prefix — missing this would cause prefix cache misses.

Minor observations (non-blocking)

In _apply_chat_template, when chat_template_kwargs includes a tools key, the explicit if tools and "tools" not in template_kwargs guard prevents overwriting. This is the right precedence (explicit kwargs override implicit tools). Worth noting in a comment for clarity, but not blocking.
The dict(kwargs.pop("chat_template_kwargs", {}) or {}) pattern appears in multiple places. A small helper like _extract_chat_template_kwargs(kwargs) could reduce repetition, but this is a style preference, not a correctness issue.

LGTM. Clean integration, good test coverage, ready to merge once the branch is clean against main.

Adds chat_template_kwargs to the Pydantic ChatCompletionRequest and plumbs it through every engine path that previously dropped it: SimpleEngine chat (LLM and MLLM), stream_chat, _stream_generate_text (MTP), BatchedEngine chat / stream_chat / _compute_prefix_boundary, and MLXLanguageModel.chat. apply_chat_template call sites gain a TypeError fallback that strips tools, enable_thinking, and dynamic chat_template_kwargs keys before retrying, so tokenizers that do not accept a given kwarg still render. Adds tests/test_chat_template_kwargs.py (7 cases covering the request model, BatchedEngine template application, the HTTP endpoint round-trip, MLXLanguageModel.chat, SimpleEngine MLLM chat, and SimpleEngine _stream_generate_text) and wires the new file into the Apple Silicon CI job.

krystophny · 2026-04-17T13:19:24Z

Rebased onto current upstream/main (b0a79f5). Auto-merge completed cleanly with no conflicts against the newer SpecPrefill Phase 4 and rotating-cache work. Squashed the branch into a single commit so the rebase reads as one coherent change — net diff is unchanged (+244/-14 across the same 7 files).

Local verification: python -m py_compile vllm_mlx/{engine/batched,engine/simple,models/llm,api/models,server}.py tests/test_chat_template_kwargs.py passes. CI on the rebased head is green across lint, type-check, test-matrix 3.10-3.13, test-apple-silicon 3.11/3.13, and tests.

krystophny changed the title ~~Forward chat template kwargs in batched chat~~ chat: forward chat_template_kwargs in batched path Mar 24, 2026

krystophny changed the title ~~chat: forward chat_template_kwargs in batched path~~ chat: forward chat_template_kwargs on simple-engine paths Mar 24, 2026

This was referenced Mar 25, 2026

[Tracking] Upstream backlog and merge plan computor-org/vllm-mlx#12

Open

chat: finish upstreaming chat_template_kwargs forwarding on simple-engine paths computor-org/vllm-mlx#19

Closed

Thump604 approved these changes Mar 31, 2026

View reviewed changes

This was referenced Apr 8, 2026

server: add OpenAI-compatible /v1/responses endpoint #214

Merged

Add per-request enable_thinking API parameter #262

Closed

krystophny force-pushed the fix/chat-template-kwargs-forwarding branch from 1e17fb1 to be2ba60 Compare April 9, 2026 06:35

Thump604 mentioned this pull request Apr 9, 2026

simple-engine: keep tool chat on the streaming execution path #222

Closed

waybarrios mentioned this pull request Apr 14, 2026

merge conflict resolution with main computor-org/vllm-mlx#36

Merged

Thump604 approved these changes Apr 14, 2026

View reviewed changes

janhilgard approved these changes Apr 15, 2026

View reviewed changes

krystophny force-pushed the fix/chat-template-kwargs-forwarding branch from 1721a6a to d85567b Compare April 17, 2026 13:17

janhilgard merged commit 39e40f5 into waybarrios:main Apr 17, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chat: forward chat_template_kwargs on simple-engine paths#218

chat: forward chat_template_kwargs on simple-engine paths#218
janhilgard merged 1 commit intowaybarrios:mainfrom
computor-org:fix/chat-template-kwargs-forwarding

krystophny commented Mar 24, 2026 •

edited

Loading

Uh oh!

Thump604 left a comment

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

krystophny commented Apr 9, 2026

Uh oh!

Thump604 commented Apr 9, 2026

Uh oh!

Thump604 commented Apr 13, 2026

Uh oh!

waybarrios commented Apr 14, 2026

Uh oh!

krystophny commented Apr 14, 2026

Uh oh!

Thump604 left a comment

Uh oh!

janhilgard left a comment

Uh oh!

Uh oh!

krystophny commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

krystophny commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Status

Files to review

Validation

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

Thump604 commented Apr 7, 2026

Verification of the fix

Test coverage

Why this matters

Recommendation

Uh oh!

krystophny commented Apr 9, 2026

Uh oh!

Thump604 commented Apr 9, 2026

Uh oh!

Thump604 commented Apr 13, 2026

Uh oh!

waybarrios commented Apr 14, 2026

Merge Conflict Resolution

vllm_mlx/engine/batched.py (4 conflicts)

vllm_mlx/engine/simple.py (3 conflicts)

Uh oh!

krystophny commented Apr 14, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Code Review — chat_template_kwargs forwarding

What works well

Minor observations (non-blocking)

Uh oh!

Uh oh!

krystophny commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

krystophny commented Mar 24, 2026 •

edited

Loading

`vllm_mlx/engine/batched.py` (4 conflicts)

`vllm_mlx/engine/simple.py` (3 conflicts)