Skip to content

chat: forward chat_template_kwargs on simple-engine paths#218

Merged
janhilgard merged 1 commit intowaybarrios:mainfrom
computor-org:fix/chat-template-kwargs-forwarding
Apr 17, 2026
Merged

chat: forward chat_template_kwargs on simple-engine paths#218
janhilgard merged 1 commit intowaybarrios:mainfrom
computor-org:fix/chat-template-kwargs-forwarding

Conversation

@krystophny
Copy link
Copy Markdown
Contributor

@krystophny krystophny commented Mar 24, 2026

Summary

Honor chat_template_kwargs on the simple-engine paths that still ignored it and run the regression coverage in Apple Silicon CI.

Why

Before this branch, chat_template_kwargs was only reliably honored on the batched path and the plain LLM chat path. Simple-engine multimodal chat, multimodal stream chat, and the text-only MTP route still dropped the field.

What changed

  • forward chat_template_kwargs through simple-engine multimodal chat()
  • forward chat_template_kwargs through simple-engine multimodal stream_chat()
  • forward chat_template_kwargs into _stream_generate_text() for the text-only MTP route
  • include tests/test_chat_template_kwargs.py in Apple Silicon CI

Status

  • refreshed onto current upstream main (b4fa030) on 2026-04-09
  • no logic changes beyond the base refresh

Files to review

  • vllm_mlx/engine/simple.py
  • .github/workflows/ci.yml
  • tests/test_chat_template_kwargs.py

Validation

  • python -m pytest tests/test_chat_template_kwargs.py -q -> 6 passed
  • note: the older tests/test_simple_engine.py validation command now depends on the separate async-harness refresh in #226 on current upstream, so I kept validation scoped to this PR's dedicated regression file here

@krystophny krystophny changed the title Forward chat template kwargs in batched chat chat: forward chat_template_kwargs in batched path Mar 24, 2026
@krystophny krystophny changed the title chat: forward chat_template_kwargs in batched path chat: forward chat_template_kwargs on simple-engine paths Mar 24, 2026
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation is solid and addresses real coverage gaps. The forwarding is consistent across all simple-engine paths:

What works:

  • API model field properly declared with optional dict[str, Any]
  • SimpleEngine MLLM multimodal chat/stream_chat forward kwargs to model
  • SimpleEngine text-only MTP route in _stream_generate_text applies kwargs
  • LLMLanguageModel.chat applies kwargs with graceful TypeError fallback
  • BatchedEngine properly merges kwargs and propagates to prefix boundary computation
  • TypeError handling updated to remove arbitrary kwargs, not just tools

Pattern is defensive: chat_template_kwargs = dict(kwargs.pop("chat_template_kwargs", {}) or {}) safely handles None and creates fresh dict. Line 372 guard in BatchedEngine prevents tools being inserted twice when merging kwargs.

Test coverage is comprehensive — all paths have mocks covering the assertion. Adding to Apple Silicon CI ensures regression detection.

One implementation detail: line 607 (SimpleEngine text path) and line 380 (BatchedEngine) both retry on TypeError by removing all user-provided template kwargs. This is correct but slightly more aggressive than the original "tools only" approach. The exception is rare enough this won't be a problem, but if a template silently ignores an unknown kwarg instead of raising TypeError, those kwargs would pass through on first try. This is acceptable trade-off for simplicity.

Ready to merge from the implementation side.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 7, 2026

@waybarrios, @krystophny: independent technical review of this PR.

Verification of the fix

Confirmed against current upstream main (b4fa030). The diff plumbs chat_template_kwargs through every place it was previously dropped:

  1. vllm_mlx/api/models.py:172 adds the field to ChatCompletionRequest
  2. vllm_mlx/server.py:1422 forwards it from the request into chat_kwargs
  3. vllm_mlx/engine/simple.py forwards it through SimpleEngine chat() (LLM and MLLM branches), stream_chat() (MLLM and run_stream branches), and _stream_generate_text() (MTP path)
  4. vllm_mlx/engine/batched.py forwards it through BatchedEngine chat(), stream_chat(), and _compute_prefix_boundary() so per-template-kwargs prefix caching works correctly
  5. vllm_mlx/models/llm.py adds the parameter to MLXLanguageModel.chat() so the LLM path honors it

All template-apply call sites also gain a graceful fallback: if a tokenizer raises TypeError because it does not support a given kwarg, the failed kwargs are popped and the call retries.

Test coverage

tests/test_chat_template_kwargs.py adds 7 tests covering Pydantic field preservation, BatchedEngine _apply_chat_template, the HTTP endpoint via FakeEngine + TestClient, LLM chat applying kwargs before generate, SimpleEngine MLLM chat forwarding, and SimpleEngine _stream_generate_text applying kwargs. The CI workflow is updated to run the new test in the Apple Silicon job.

Why this matters

Per the PR description, before this branch chat_template_kwargs was honored on the batched path and the plain LLM chat path but silently dropped on simple-engine multimodal chat(), simple-engine multimodal stream_chat(), and the text-only MTP _stream_generate_text route. That means enable_thinking=false in chat_template_kwargs was being silently ignored on those three paths, which can cause Qwen 3.5 thinking-tag leakage in multimodal and MTP responses.

Recommendation

Merge candidate. Real fix to a real silently-ignored API field, comprehensive plumbing across all relevant call sites, good test coverage, and the CI workflow update means the regression cannot return without someone disabling the test job.

Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 9, 2026
…l generate+stream_generate

Pre-existing regression from an earlier rebase that dropped bdf7dcc's
llm.py additions. The server.py request handlers still pass top_k,
min_p, presence_penalty, repetition_penalty through to SimpleEngine,
which forwards them via **kwargs to MLXLanguageModel.chat() (which
accepts **kwargs) which then calls self.generate(..., **kwargs). But
MLXLanguageModel.generate() and stream_generate() had been left with
only (temperature, top_p, repetition_penalty) in their signatures, so
any non-MLLM SimpleEngine request crashed with:

    TypeError: MLXLanguageModel.stream_generate() got an unexpected
    keyword argument 'top_k'

Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in
the feature matrix regression sweep after the Session 87 cherry-picks
of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause
this regression — they exposed it by finally running the LLM-path
tests that no one had exercised since the rebase happened. Confirmed
via stderr.log:

  TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k'
  TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k'

Fix: restore the signatures and bodies of _create_sampler,
_create_logits_processors, generate, and stream_generate to match
bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter
and non-str prompt support on stream_generate. Adds **kwargs to both
generate and stream_generate so future param additions degrade
gracefully instead of crashing.

This is a runtime-local fix. The equivalent upstream fix lives in
bdf7dcc which was never upstreamed (confirmed via
git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR
to upstream could carry this forward.

Verification:
  bin/verify-patches: 33/33 clean
  Full feature matrix regression sweep pending re-run after this commit.

Related: runtime PR waybarrios#265 (waybarrios#265) fixed the
CompletionRequest schema side of the same bdf7dcc drop; this commit
fixes the engine-model side.
@krystophny krystophny force-pushed the fix/chat-template-kwargs-forwarding branch from 1e17fb1 to be2ba60 Compare April 9, 2026 06:35
@krystophny
Copy link
Copy Markdown
Contributor Author

Force-pushed a refresh onto current upstream main (b4fa030). No logic change beyond the base refresh. Validation: python -m pytest tests/test_chat_template_kwargs.py -q -> 6 passed. The older tests/test_simple_engine.py validation command now depends on the separate async-harness refresh in #226 on current upstream, so I kept validation scoped to this PR's dedicated regression file.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 9, 2026

Refresh confirmed on head 3c33f72 against upstream main b4fa030. The only delta on top of the previously approved be2ba60 is the "style: format chat template kwargs tests" commit, which is a no-op on the forwarding logic. The SimpleEngine multimodal chat, stream_chat, _stream_generate_text, BatchedEngine chat, models/llm.py, and server.py wiring all match the previously reviewed shape.

CI green on lint, type-check, test-matrix 3.10-3.12, test-apple-silicon, tests. tests/test_chat_template_kwargs.py -> 6 passed on head. Prior APPROVED review at be2ba60 applies to the refreshed head.

@Thump604
Copy link
Copy Markdown
Collaborator

Hey @krystophny - this covers all the simple-engine paths cleanly and the test coverage is good. Currently has a merge conflict with main - can you rebase? Ready to merge after that.

@waybarrios
Copy link
Copy Markdown
Owner

Merge Conflict Resolution

The branch conflicted with main in 7 conflict regions across two files. I've resolved all conflicts and pushed the result to waybarrios/vllm-mlx:fix/chat-template-kwargs-forwarding.

The core issue: main added enable_thinking support and refactored simple.py to use _run_blocking_serialized while this PR was adding chat_template_kwargs forwarding. Both changes touch the same call sites.


vllm_mlx/engine/batched.py (4 conflicts)

1. _apply_chat_template signature — both params are now accepted:

def _apply_chat_template(
    self,
    messages: list[dict[str, Any]],
    tools: list[dict] | None = None,
    num_images: int = 0,
    chat_template_kwargs: dict[str, Any] | None = None,
    enable_thinking: bool | None = None,
) -> str:

2. TypeError fallback — the retry loop now strips enable_thinking (from main) and any dynamic chat_template_kwargs keys (from this PR):

except TypeError as e:
    logger.debug(f"Chat template TypeError, retrying without extras: {e}")
    for key in ["tools", "enable_thinking", *(chat_template_kwargs or {}).keys()]:
        template_kwargs.pop(key, None)

3 & 4. chat() and stream_chat() call sites — both now forward both kwargs:

prompt = self._apply_chat_template(
    messages,
    template_tools,
    num_images=len(all_images),
    chat_template_kwargs=chat_template_kwargs,
    enable_thinking=enable_thinking,
)

vllm_mlx/engine/simple.py (3 conflicts)

1. chat() entry — kept the chat_template_kwargs extraction (this PR) and the tool-stall workaround from main that routes non-streaming tool chat through stream_chat. Also forwarded chat_template_kwargs into that workaround path:

chat_template_kwargs = dict(kwargs.pop("chat_template_kwargs", {}) or {})

# tool-stall workaround from main, now with chat_template_kwargs forwarded
if tools and not self._is_mllm:
    async for output in self.stream_chat(
        messages=messages,
        ...
        chat_template_kwargs=chat_template_kwargs,
        **kwargs,
    ):
        final_output = output
    ...

2. MLLM/LLM branching — adopted main's _run_blocking_serialized (replacing the old asyncio.to_thread + _generation_lock) and wove in chat_template_kwargs forwarding for both paths:

# MLLM path — injects into kwargs (mlx-vlm accepts it this way)
if self._is_mllm:
    if chat_template_kwargs:
        kwargs["chat_template_kwargs"] = chat_template_kwargs
    output = await self._run_blocking_serialized(self._model.chat, ...)

# LLM path — passes as named arg (mlx-lm accepts it directly)
else:
    output = await self._run_blocking_serialized(
        self._model.chat,
        ...,
        chat_template_kwargs=chat_template_kwargs,
        **kwargs,
    )

3. stream_chat MLLM path — adopted main's indentation (no lock context manager) and kept the PR's chat_template_kwargs forwarding via local_kwargs:

def run_stream():
    local_kwargs = dict(kwargs)
    if chat_template_kwargs:
        local_kwargs["chat_template_kwargs"] = chat_template_kwargs
    return list(
        self._model.stream_chat(
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature,
            tools=template_tools,
            **local_kwargs,
        )
    )

chunks = await self._run_blocking_serialized(run_stream)

Both files compile cleanly. To pull the resolved merge:

git remote add waybarrios https://github.com/waybarrios/vllm-mlx.git
git fetch waybarrios fix/chat-template-kwargs-forwarding
git checkout fix/chat-template-kwargs-forwarding
git reset --hard waybarrios/fix/chat-template-kwargs-forwarding
git push origin fix/chat-template-kwargs-forwarding --force

@krystophny
Copy link
Copy Markdown
Contributor Author

Synced this branch to the maintainer-resolved conflict fix on current upstream/main and force-pushed. Re-ran python -m pytest tests/test_chat_template_kwargs.py -q locally (6 passed).

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased cleanly with the maintainer-resolved conflicts — the merge result looks correct. Both chat_template_kwargs and enable_thinking are forwarded through all four paths (batched chat/stream_chat, simple chat/stream_chat) and the TypeError fallback strips both gracefully. CI all green. Approving.

Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — chat_template_kwargs forwarding

This PR systematically plumbs chat_template_kwargs through every engine path that previously dropped it. The coverage is complete and the implementation is consistent.

What works well

  1. Consistent forwarding across all paths: SimpleEngine (LLM chat, MLLM chat, stream_chat, _stream_generate_text), BatchedEngine (chat, stream_chat, _compute_prefix_boundary), and MLXLanguageModel.chat all receive and forward the kwargs correctly.

  2. Graceful TypeError fallback: The existing fallback pattern for templates that don't accept extra kwargs is properly extended to strip chat_template_kwargs keys in addition to tools and enable_thinking. The unpacking via *(chat_template_kwargs or {}).keys() is clean.

  3. Test coverage: 6 dedicated tests covering the request model, BatchedEngine template application, the full endpoint round-trip, LLM.chat, SimpleEngine MLLM chat, and SimpleEngine _stream_generate_text. The tests use mocks appropriately and verify the actual call_args.

  4. Prefix boundary propagation: The _compute_prefix_boundary method also receives chat_template_kwargs, which is important because template kwargs can change the tokenized prefix — missing this would cause prefix cache misses.

Minor observations (non-blocking)

  • In _apply_chat_template, when chat_template_kwargs includes a tools key, the explicit if tools and "tools" not in template_kwargs guard prevents overwriting. This is the right precedence (explicit kwargs override implicit tools). Worth noting in a comment for clarity, but not blocking.

  • The dict(kwargs.pop("chat_template_kwargs", {}) or {}) pattern appears in multiple places. A small helper like _extract_chat_template_kwargs(kwargs) could reduce repetition, but this is a style preference, not a correctness issue.

LGTM. Clean integration, good test coverage, ready to merge once the branch is clean against main.

Adds chat_template_kwargs to the Pydantic ChatCompletionRequest and plumbs
it through every engine path that previously dropped it: SimpleEngine chat
(LLM and MLLM), stream_chat, _stream_generate_text (MTP), BatchedEngine
chat / stream_chat / _compute_prefix_boundary, and MLXLanguageModel.chat.
apply_chat_template call sites gain a TypeError fallback that strips
tools, enable_thinking, and dynamic chat_template_kwargs keys before
retrying, so tokenizers that do not accept a given kwarg still render.

Adds tests/test_chat_template_kwargs.py (7 cases covering the request
model, BatchedEngine template application, the HTTP endpoint round-trip,
MLXLanguageModel.chat, SimpleEngine MLLM chat, and SimpleEngine
_stream_generate_text) and wires the new file into the Apple Silicon
CI job.
@krystophny krystophny force-pushed the fix/chat-template-kwargs-forwarding branch from 1721a6a to d85567b Compare April 17, 2026 13:17
@janhilgard janhilgard merged commit 39e40f5 into waybarrios:main Apr 17, 2026
9 checks passed
@krystophny
Copy link
Copy Markdown
Contributor Author

Rebased onto current upstream/main (b0a79f5). Auto-merge completed cleanly with no conflicts against the newer SpecPrefill Phase 4 and rotating-cache work. Squashed the branch into a single commit so the rebase reads as one coherent change — net diff is unchanged (+244/-14 across the same 7 files).

Local verification: python -m py_compile vllm_mlx/{engine/batched,engine/simple,models/llm,api/models,server}.py tests/test_chat_template_kwargs.py passes. CI on the rebased head is green across lint, type-check, test-matrix 3.10-3.13, test-apple-silicon 3.11/3.13, and tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants