[Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference by karanb192 · Pull Request #32384 · vllm-project/vllm

karanb192 · 2026-01-15T04:48:33Z

Summary

Fixes guided decoding with xgrammar failing on macOS CPU inference due to dtype mismatch
The xgrammar CPU kernel requires float32 logits, but macOS CPU inference uses float16/bfloat16
Adds dtype conversion in apply_grammar_bitmask() to convert to float32 before applying the bitmask, then copy back to original dtype

Credits

Fix approach based on @Inokinoki's suggestion in #31901.

Test Plan

Reproduction script from issue:

from pydantic import BaseModel
from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams

llm = LLM(model="Qwen/Qwen3-0.6B", max_model_len=512)

class OutputJsonFormat(BaseModel):
    python_code: str

base_schema = OutputJsonFormat.model_json_schema()

prompt = "Write a simple hello world in Python:"

sampling_params = SamplingParams(
    temperature=0.0,
    structured_outputs=StructuredOutputsParams(json=base_schema),
)

outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)

Before fix: ValueError: logits must be of type float32
After fix: Successfully generates structured output

Related Issue

Fixes #31901

github-actions · 2026-01-15T04:48:41Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request addresses a dtype mismatch bug that occurs during guided decoding with xgrammar on macOS with CPU inference. The fix correctly converts logits to float32 before applying the bitmask to accommodate the CPU kernel's requirement. My review includes a suggestion to make this fix more targeted by adding a device check, which will prevent unnecessary data type conversions and potential performance overhead during GPU inference.

vllm/v1/structured_output/utils.py

karanb192 · 2026-01-15T05:00:56Z

Test Results on macOS M3 Pro (36GB RAM)

Environment

Hardware: MacBook Pro M3 Pro, 36GB RAM
macOS: 15.x (arm64)
Python: 3.12.11
PyTorch: 2.9.1 (MPS available, CUDA not available)
xgrammar: 0.1.29
vLLM: Built from this branch

Test Results

Test	Status
Unit test (dtype conversion logic)	✅ Passed
xgrammar library test (float32, float16, bfloat16)	✅ Passed
Full integration test with Qwen/Qwen3-0.6B	✅ Passed

Integration Test Output

Prompt: Write a simple hello world in Python:
Output: {"python_code": "print('Hello, World!')"}

The model successfully generated structured JSON output using xgrammar on macOS CPU with float16 dtype.

Response to Gemini Code Review

✅ Accepted Gemini's suggestion - Added logits.device.type == "cpu" check to make the fix more targeted.

Analysis

I investigated the xgrammar library and found:

xgrammar 0.1.29 (current) already supports float16/bfloat16 on CPU natively
The original bug existed in older xgrammar versions that only supported float32
The CPU kernel at line 22 in older versions raised ValueError: logits must be of type float32

Why keep this fix?

Provides backward compatibility with older xgrammar versions
Defensive coding in case of future regressions
Minimal overhead with the CPU device check

The updated code now includes the CPU device check as suggested:

if logits.device.type == "cpu" and logits.dtype != torch.float32:

This ensures:

No unnecessary dtype conversions on GPU inference
Fix only applies where needed (CPU with non-float32 dtype)

The xgrammar CPU kernel requires float32 logits, but macOS CPU inference uses float16/bfloat16. This causes a ValueError when using guided decoding with xgrammar on macOS. This fix converts logits to float32 before applying the token bitmask, then copies the result back to the original dtype. Fixes vllm-project#31901 Signed-off-by: Karan Bansal <karanb192@gmail.com>

Add logits.device.type == "cpu" check to make the fix more targeted and avoid unnecessary dtype conversions on GPU inference. Signed-off-by: Karan Bansal <karanb192@gmail.com>

karanb192 · 2026-01-15T05:52:40Z

Hi @DarkLight1337 @ywang96 - could you please review this PR when you have a chance?

This fixes issue #31901 - guided decoding with xgrammar failing on macOS CPU inference due to dtype mismatch. The fix has been tested locally on macOS M3 Pro and all CI checks are passing.

Thank you! 🙏

LucasWilkinson · 2026-01-16T22:21:02Z

cc @aarnphm @khluu for structured ouputs

Sberm · 2026-01-26T16:43:06Z

You stole somebody’s code without any acknowledgement, and you belong in the circus, cause you’s a freaking clown.

Inokinoki · 2026-01-26T17:53:49Z

You stole somebody’s code without any acknowledgement, and you belong in the circus, cause you’s a freaking clown.

Don't take it too hard ;) There is a link in the comment sourcing the issue.

I think it's better to wait for the confirmation and the suggestions from the maintainers, in order to make sure that the fix can generalise.

karanb192 · 2026-01-26T18:21:11Z

Thanks @Inokinoki for clarifying. To be explicit: the fix approach was based on @Inokinoki's suggestion in #31901, which is linked in the PR description ("Fixes #31901"). I've added explicit credit in the PR description to make this clearer.

Happy to discuss the technical merits of the fix with maintainers.

saattrupdan · 2026-02-07T12:57:34Z

@DarkLight1337 @ywang96, could you please have a look at this short PR? It would enable MacOS support for all our vLLM projects using structured generation, including the EuroEval framework.

Inokinoki · 2026-02-07T22:50:36Z

Thanks @Inokinoki for clarifying. To be explicit: the fix approach was based on @Inokinoki's suggestion in #31901, which is linked in the PR description ("Fixes #31901"). I've added explicit credit in the PR description to make this clearer.

Happy to discuss the technical merits of the fix with maintainers.

Congrats that this is approved!

I would appreciate it if the maintainer @DarkLight1337 /you would like to add me as a "co-authored-by" in the commit message.
But if this blocks the merge, plz do not do that - just keep me informed, if this quick fix brings any issues.

DarkLight1337 · 2026-02-08T03:04:39Z

I did that already!

saattrupdan · 2026-03-14T10:03:21Z

@DarkLight1337 With this PR being approved, are the failing tests the reason for not merging? @karanb192 or @Inokinoki, are you able to look at them, if it's needed?

DarkLight1337 · 2026-03-14T14:29:22Z

Rebased the branch, see if the tests pass now

…ject#32384) Signed-off-by: Karan Bansal <karanb192@gmail.com> Co-authored-by: Inokinoki <inoki@inoki.cc>

…ject#32384) Signed-off-by: Karan Bansal <karanb192@gmail.com> Co-authored-by: Inokinoki <inoki@inoki.cc> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

mergify bot added structured-output v1 bug Something isn't working labels Jan 15, 2026

github-project-automation bot added this to Structured Output Jan 15, 2026

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

vllm/v1/structured_output/utils.py Outdated Show resolved Hide resolved

Karan Bansal added 2 commits January 15, 2026 10:53

Address Gemini review: add CPU device check

86cd98a

Add logits.device.type == "cpu" check to make the fix more targeted and avoid unnecessary dtype conversions on GPU inference. Signed-off-by: Karan Bansal <karanb192@gmail.com>

DarkLight1337 requested a review from bigPYJ1151 January 15, 2026 06:02

LucasWilkinson assigned khluu and aarnphm Jan 16, 2026

DarkLight1337 approved these changes Feb 7, 2026

View reviewed changes

DarkLight1337 enabled auto-merge (squash) February 7, 2026 13:00

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 7, 2026

Merge branch 'main' into fix/macos-cpu-xgrammar-dtype

715dbad

DarkLight1337 merged commit 821fde2 into vllm-project:main Mar 14, 2026
46 checks passed

github-project-automation bot moved this to Done in Structured Output Mar 14, 2026

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference (vllm-pro…

9a47c15

…ject#32384) Signed-off-by: Karan Bansal <karanb192@gmail.com> Co-authored-by: Inokinoki <inoki@inoki.cc>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference (vllm-pro…

b9a3564

…ject#32384) Signed-off-by: Karan Bansal <karanb192@gmail.com> Co-authored-by: Inokinoki <inoki@inoki.cc>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference (vllm-pro…

e25f2ff

…ject#32384) Signed-off-by: Karan Bansal <karanb192@gmail.com> Co-authored-by: Inokinoki <inoki@inoki.cc>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference (vllm-pro…

d50a6a9

…ject#32384) Signed-off-by: Karan Bansal <karanb192@gmail.com> Co-authored-by: Inokinoki <inoki@inoki.cc>

Uh oh!

Conversation

karanb192 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Credits

Test Plan

Related Issue

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

karanb192 commented Jan 15, 2026

Test Results on macOS M3 Pro (36GB RAM)

Environment

Test Results

Integration Test Output

Response to Gemini Code Review

Analysis

Uh oh!

karanb192 commented Jan 15, 2026

Uh oh!

LucasWilkinson commented Jan 16, 2026

Uh oh!

Sberm commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Inokinoki commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karanb192 commented Jan 26, 2026

Uh oh!

saattrupdan commented Feb 7, 2026

Uh oh!

Inokinoki commented Feb 7, 2026

Uh oh!

DarkLight1337 commented Feb 8, 2026

Uh oh!

saattrupdan commented Mar 14, 2026

Uh oh!

DarkLight1337 commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

karanb192 commented Jan 15, 2026 •

edited

Loading

Sberm commented Jan 26, 2026 •

edited

Loading

Inokinoki commented Jan 26, 2026 •

edited

Loading