[Bugfix] Fix structured output crash on CPU due to pin_memory=True by wjhrdy · Pull Request #37706 · vllm-project/vllm

wjhrdy · 2026-03-20T17:42:27Z

Essential Checks

PR title follows the pattern [Tag] Short description
I have searched for related issues and checked existing PRs
I have run linting/formatting locally

Purpose

Fix RuntimeError: pin_memory=True requires a CUDA or other accelerator backend crash when using structured output (guided decoding) on CPU-only deployments.

Fixes #37705

Problem

apply_grammar_bitmask() in vllm/v1/structured_output/utils.py crashes on CPU when handling mixed batches (concurrent structured + non-structured requests):

pin_memory=True is hardcoded — torch.tensor(out_indices, ..., pin_memory=True) requires CUDA; fails on CPU-only systems.
xgrammar CPU kernel expects Sequence[int], not torch.Tensor — apply_token_bitmask_inplace_cpu() only accepts a Python list for the indices argument.

Note: the existing CPU float32 workaround (added in #31901) was never reachable because the pin_memory=True crash occurs first.

Fix

On CPU, pass out_indices as a plain Python list directly instead of converting to a pinned tensor. The GPU path with pinned memory is preserved.

Test Plan

Tested by starting vLLM on CPU with ibm-granite/granite-3.2-2b-instruct, then sending concurrent plain + structured output (response_format: json_schema) requests. Without the fix, both requests return 500 and the EngineCore dies. With the fix, both succeed and the server stays healthy.

import concurrent.futures
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
MODEL = "ibm-granite/granite-3.2-2b-instruct"

def plain_request():
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "Tell me a story"}],
        max_tokens=200,
    )

def structured_request():
    return client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": "What is the capital of France?"}],
        max_tokens=50,
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "resp", "strict": True,
                "schema": {
                    "type": "object",
                    "properties": {"capital": {"type": "string"}},
                    "required": ["capital"],
                    "additionalProperties": False,
                },
            },
        },
    )

with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    f1 = executor.submit(plain_request)
    f2 = executor.submit(structured_request)
    print(f1.result())
    print(f2.result())

On CPU-only deployments, `apply_grammar_bitmask()` crashes with `RuntimeError: pin_memory=True requires a CUDA or other accelerator backend` when handling mixed batches of structured and non-structured requests. Two issues: 1. `pin_memory=True` is hardcoded in the `torch.tensor()` call for `out_indices` — this requires CUDA and fails on CPU. 2. The xgrammar CPU kernel (`apply_token_bitmask_inplace_cpu`) expects `Sequence[int]` for the `indices` argument, not a tensor. Note: the existing CPU float32 workaround added in vllm-project#31901 was never reachable because the `pin_memory=True` crash occurs first. Fix: on CPU, pass `out_indices` as a plain Python list. The GPU path with pinned memory is preserved. Fixes vllm-project#37705 Signed-off-by: Willy Hardy <whardy@redhat.com>

gemini-code-assist

Code Review

This pull request effectively resolves a critical RuntimeError that occurred on CPU-only deployments due to pin_memory=True being hardcoded for torch.tensor creation. The changes correctly introduce conditional logic to handle CPU and GPU devices separately, ensuring that pin_memory=True is only applied when a CUDA device is available. Furthermore, it addresses the xgrammar CPU kernel's expectation of a Python list for indices by passing out_indices directly on CPU, which is a significant improvement for correctness and stability in mixed-batch scenarios. The updated type hint for indices also enhances code clarity.

gemini-code-assist · 2026-03-20T17:48:28Z

vllm/v1/structured_output/utils.py

+        if logits.device.type == "cpu":
+            # On CPU, pass indices as a plain list — pin_memory requires CUDA,
+            # and the xgrammar CPU kernel expects Sequence[int], not a tensor.
+            indices = out_indices
+        else:
+            indices = torch.tensor(
+                out_indices, dtype=torch.int32, device="cpu", pin_memory=True,
+            )
+            indices = indices.to(logits.device, non_blocking=True)


This conditional logic is a critical fix. By checking logits.device.type, the code now correctly avoids setting pin_memory=True on CPU, which was causing a RuntimeError. Additionally, passing out_indices as a plain Python list for CPU devices directly addresses the xgrammar CPU kernel's expectation for a Sequence[int], preventing potential issues with type mismatches.

dougbtv

Looks excellent -- do we need any validation on the testing side?

mgoin

Seems reasonable to me, thanks!

mergify · 2026-03-20T20:12:17Z

Hi @wjhrdy, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Willy Hardy <whardy@redhat.com>

andy-neuma

thanks

vllm/v1/structured_output/utils.py

- Use logits.is_cpu instead of logits.device.type == "cpu" - Restore original comment explaining non_blocking tensor copy in else branch - Consolidate tensor creation formatting Signed-off-by: Will Hardy <whardy@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Will Hardy <whardy@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vllm/v1/structured_output/utils.py

benchislett · 2026-03-23T15:13:43Z

vllm/v1/structured_output/utils.py

+        if logits.is_cpu:
+            # On CPU, pass indices as a plain list — pin_memory requires CUDA,
+            # and the xgrammar CPU kernel expects Sequence[int], not a tensor.
+            indices = out_indices


why rename this variable?

Address review feedback from benchislett: use is_pin_memory_available() for pin_memory instead of branching on device type. This eliminates the CPU-specific code path entirely. Also reverts variable name back to index_tensor (original name). Signed-off-by: Will Hardy <whardy@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

wjhrdy · 2026-03-23T15:20:28Z

Addressed all review feedback:

@njhill: Used logits.is_cpu, restored the original comment in the else branch, cleaned up formatting
@benchislett: Simplified to use is_pin_memory_available() instead of device-type branching — no more CPU-specific code path needed. Reverted variable name back to index_tensor (original name).

Note: the CPU machine I normally test these changes on is currently down, so this latest update is untested. Will validate once the machine is back up.

Signed-off-by: Nick Hill <nickhill123@gmail.com>

…structured-output-pin-memory

njhill · 2026-03-24T16:21:35Z

Thanks @wjhrdy. I reworked it a bit to separate the CPU and non-CPU cases after all since I noticed that for CPU, xgrammar just converts the tensor back to a list, and there was already some cpu-specific logic.

…llm-project#37706) Signed-off-by: Willy Hardy <whardy@redhat.com> Signed-off-by: Will Hardy <whardy@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…llm-project#37706) Signed-off-by: Willy Hardy <whardy@redhat.com> Signed-off-by: Will Hardy <whardy@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Michel Belleau <michel.belleau@malaiwah.com>

…llm-project#37706) Signed-off-by: Willy Hardy <whardy@redhat.com> Signed-off-by: Will Hardy <whardy@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…llm-project#37706) Signed-off-by: Willy Hardy <whardy@redhat.com> Signed-off-by: Will Hardy <whardy@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…llm-project#37706) Signed-off-by: Willy Hardy <whardy@redhat.com> Signed-off-by: Will Hardy <whardy@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…llm-project#37706) Signed-off-by: Willy Hardy <whardy@redhat.com> Signed-off-by: Will Hardy <whardy@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…llm-project#37706) Signed-off-by: Willy Hardy <whardy@redhat.com> Signed-off-by: Will Hardy <whardy@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

wjhrdy requested review from aarnphm, benchislett, mgoin and russellb as code owners March 20, 2026 17:42

mergify bot added structured-output v1 bug Something isn't working labels Mar 20, 2026

github-project-automation bot added this to Structured Output Mar 20, 2026

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

dougbtv reviewed Mar 20, 2026

View reviewed changes

mgoin approved these changes Mar 20, 2026

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 20, 2026

style: fix formatting per pre-commit linter

e97ab92

Signed-off-by: Willy Hardy <whardy@redhat.com>

wjhrdy force-pushed the fix/cpu-structured-output-pin-memory branch from 030f141 to e97ab92 Compare March 20, 2026 20:54

andy-neuma approved these changes Mar 20, 2026

View reviewed changes

Merge branch 'main' into fix/cpu-structured-output-pin-memory

18d90e3

njhill reviewed Mar 21, 2026

View reviewed changes

vllm/v1/structured_output/utils.py Outdated Show resolved Hide resolved

vllm/v1/structured_output/utils.py Show resolved Hide resolved

vllm/v1/structured_output/utils.py Outdated Show resolved Hide resolved

wjhrdy and others added 2 commits March 23, 2026 09:35

fix mypy error: use index_tensor local var to avoid union-attr

0420e6c

Signed-off-by: Will Hardy <whardy@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

benchislett reviewed Mar 23, 2026

View reviewed changes

vllm/v1/structured_output/utils.py Show resolved Hide resolved

benchislett reviewed Mar 23, 2026

View reviewed changes

njhill added 2 commits March 24, 2026 09:13

rework - use list for cpu after all

d6f7861

Signed-off-by: Nick Hill <nickhill123@gmail.com>

Merge remote-tracking branch 'refs/remotes/origin/main' into fix/cpu-…

81893e3

…structured-output-pin-memory

njhill approved these changes Mar 24, 2026

View reviewed changes

njhill enabled auto-merge (squash) March 24, 2026 16:22

njhill merged commit 057fc94 into vllm-project:main Mar 24, 2026
49 checks passed

github-project-automation bot moved this to Done in Structured Output Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix structured output crash on CPU due to pin_memory=True#37706

[Bugfix] Fix structured output crash on CPU due to pin_memory=True#37706
njhill merged 8 commits intovllm-project:mainfrom
wjhrdy:fix/cpu-structured-output-pin-memory

wjhrdy commented Mar 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

dougbtv left a comment

Uh oh!

mgoin left a comment

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

andy-neuma left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benchislett Mar 23, 2026

Uh oh!

wjhrdy commented Mar 23, 2026

Uh oh!

njhill commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

wjhrdy commented Mar 20, 2026

Essential Checks

Purpose

Problem

Fix

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

dougbtv left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

andy-neuma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benchislett Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

wjhrdy commented Mar 23, 2026

Uh oh!

njhill commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants