[V1][Feat] Fail request if FSM fails to advance by atbe · Pull Request #18780 · vllm-project/vllm

atbe · 2025-05-27T22:35:18Z

Fix streaming requests hanging when structured output FSM fails to advance

Problem

When using structured outputs with the xgrammar backend, streaming requests would hang indefinitely if the FSM (Finite State Machine) failed to advance. This occurred when accept_tokens() returned False in the xgrammar backend, logging an error but not properly terminating the request.

Diagnosis

The issue was in the scheduler's update_from_output() method. When processing new tokens for structured output requests, the code called accept_tokens() but ignored its return value:

request.structured_output_request.grammar.accept_tokens(req_id, new_token_ids)

When the xgrammar FSM encountered an invalid token sequence, it would:

Log an error: "Failed to advance FSM for request %s for tokens %s. Please file an issue."
Return False from accept_tokens()
Leave the FSM in an invalid state

Since the scheduler didn't check the return value, it continued processing as if nothing was wrong, causing the streaming response to hang indefinitely without sending a completion signal.

Solution

The fix checks the return value of accept_tokens() and properly terminates the request when it returns False:

if not request.structured_output_request.grammar.accept_tokens(req_id, new_token_ids):
    # Grammar FSM failed to advance - mark request as finished with error
    logger.error(
        "Structured output FSM failed to advance for request %s. "
        "Terminating request.", req_id)
    request.status = RequestStatus.FINISHED_ABORTED
    stopped = True
    self._free_request(request)

This ensures that:

The request is marked as FINISHED_ABORTED
Resources are properly freed
The streaming response terminates with finish_reason: "abort"
Clients receive a proper completion signal instead of hanging

github-actions · 2025-05-27T22:35:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Ibrahim Ahmed <abeahmed2@gmail.com>

aarnphm

one tiny comment.

fwiw I think it is better to raise exception and propagate it accordingly in the engine. but that is probably for another day.

aarnphm · 2025-05-27T22:39:15Z

vllm/v1/core/sched/scheduler.py

                # check above, so safe to ignore type warning
-                request.structured_output_request.grammar.accept_tokens(  # type: ignore[union-attr]
-                    req_id, new_token_ids)
+                if not request.structured_output_request.grammar.accept_tokens(  # type: ignore[union-attr]


let's also add a note and create a bug for tracking this here.

Here's an issue #18783

how does that look @aarnphm

cadedaniel · 2025-06-03T18:27:52Z

Could we add a test to this PR?

aarnphm · 2025-06-03T18:44:31Z

@cadedaniel for visibility https://vllm-dev.slack.com/archives/C07QQ8DAXMK/p1748388146836209

This happens in the case where "after a few hundred thousand requests are sent to the same instance".

For tests I think we might be able to reproduce something when we send the same requests repeatedly? might need some fine tune for this regression test.

cadedaniel · 2025-06-03T21:46:58Z

I think one could mock the output of the model to be an invalid token wrt the grammar.

ekagra-ranjan · 2025-06-09T19:59:26Z

I didnt understand why would accept_tokens() not accept a token as per the grammar when the grammar itself decides the mask and makes sure the transition is valid. My understanding was that the valid token sampled with bitmask is guaranteed to be pass the accept_token() w/o error.

njhill

Thanks @atbe.

Agree with @cadedaniel that a test would be good.

njhill · 2025-06-12T00:24:05Z

vllm/v1/core/sched/scheduler.py

+                if not request.structured_output_request.grammar.accept_tokens(  # type: ignore[union-attr]
+                        req_id, new_token_ids):


Suggest using a variable here to make things a bit clearer:

Suggested change

if not request.structured_output_request.grammar.accept_tokens( # type: ignore[union-attr]

req_id, new_token_ids):

accepted = request.structured_output_request.grammar.accept_tokens( # type: ignore[union-attr]

req_id, new_token_ids)

if not accepted:

Closes vllm-project#19493 Closes vllm-project#18376 Related to vllm-project#18780 Several people have noticed errors when using both the `xgrammar` and `guidance` backends where we would start generating invalid tokens for a request and they would be continuously rejected by the backend currently in use. The conditions seemed to be: - Only impacts certain models - Occurs with concurrent structured output requests After further investigation once an easy way to reproduce was provided via vllm-project#19493, I identified more details about the failure: - When the failured occurred in my test using a concurrency of 2, whichever request came in first was always successful. It was the second request that would fail. Debugging further identified that the bitmask was not being applied correctly, but only for that second request. In the GPU model runner, this translates to the 2nd row in the bitmask tensor and the 2nd row of the logits tensor. I could see that a couple bytes were left unmasked. I suspect the reason the issue appears to be model specific has to do with the vocab and what the tokens are that were left unmasked. I have not verified this part for sure. The reason it occurred with both structured output backends is because we use the `xgrammar` library's implementation of applying the bitmask in all cases. Xgrammar on cuda, by default, uses a triton kernel for applying the bitmask. I identified that by forcing it to use the `torch.compile` implementation instead, the problem is resolved. The torch implementation is used for all other accelerator types in Xgrammar's logic, so it seems fine to just force the use of that implementation. I have not yet narrowed down the problem in triton kernel, but this change works around the problem for vLLM. We can move back to Xgrammar's wrapper that chooses which implementation to use once we can verify everything is working properly again. Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb · 2025-06-12T16:25:21Z

PR related to the root cause of the problem that was occurring here: #19565

aarnphm · 2025-06-12T18:21:17Z

@russellb I think we should still have this orthogonal to #19565. If the FSM fails to advance, we should gracefully fail this request, wdyt?

russellb · 2025-06-12T19:27:28Z

@russellb I think we should still have this orthogonal to #19565. If the FSM fails to advance, we should gracefully fail this request, wdyt?

I agree that this change is still an improvement

ma-armenta · 2025-08-20T14:35:33Z

Friendly bump on this. I am also seeing same log error. pointing to line 160.

russellb · 2025-08-20T14:54:28Z

Friendly bump on this. I am also seeing same log error. pointing to line 160.

what version? We fixed the underlying issue a while ago that led to this proposed change. If you see the FSM fail to advice, it indicates a problem we need to fix.

ma-armenta · 2025-08-20T15:18:54Z

@russellb This is the error that I'm getting. This is for sure on v0.10.0 for sure, need to check on the latest release. Inference is happening with Instructor client and structured responses. Model is llama 3.1 8b instruct

ERROR 08-16 21:59:58 [backend_xgrammar.py:160] Failed to advance FSM for request chatcmpl-42f20635-3867-4126-889f-f8075ce1155b for tokens 0. Please file an issue.

russellb · 2025-08-20T15:38:09Z

@russellb This is the error that I'm getting. This is for sure on v0.10.0 for sure, need to check on the latest release. Inference is happening with Instructor client and structured responses. Model is llama 3.1 8b instruct

ERROR 08-16 21:59:58 [backend_xgrammar.py:160] Failed to advance FSM for request chatcmpl-42f20635-3867-4126-889f-f8075ce1155b for tokens 0. Please file an issue.

If you can come up with an easy way to reproduce it, please file an issue so we can look into it. Thanks!

mergify · 2025-09-08T16:48:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @atbe.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

FilippoBoni1921 · 2025-11-03T15:40:33Z

Friendly reminding about this PR. I think it would be extremely helpful at least to fail gracefully ;)

github-actions · 2026-02-02T02:18:40Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2026-02-02T02:19:33Z

Hi @atbe, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-03T02:19:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @atbe.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

hmellor · 2026-03-04T17:36:25Z

This is only being kept alive by bots. Closing as stale. If this feature still needs to be merged, feel free to re-open and update this PR or make a new one.

atbe requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners May 27, 2025 22:35

mergify bot added the v1 label May 27, 2025

Fail request if FSM fails to advance

49f2024

Signed-off-by: Ibrahim Ahmed <abeahmed2@gmail.com>

atbe force-pushed the fix-hanging-requests-when-fsm-fails-to-advance-in-xgrammar branch from 389a97c to 49f2024 Compare May 27, 2025 22:38

aarnphm approved these changes May 27, 2025

View reviewed changes

njhill reviewed Jun 12, 2025

View reviewed changes

russellb mentioned this pull request Jun 12, 2025

[V1] Resolve failed concurrent structred output requests #19565

Merged

aarnphm changed the title ~~Fail request if FSM fails to advance~~ [V1][Feat] Fail request if FSM fails to advance Jun 12, 2025

aarnphm self-requested a review June 12, 2025 18:23

aarnphm added the structured-output label Jun 12, 2025

github-project-automation bot added this to Structured Output Jun 12, 2025

tjohnson31415 mentioned this pull request Jul 18, 2025

[Bug]: Server hang with google/gemma-3-27b-it and structured decoding #21148

Closed

1 task

cadedaniel mentioned this pull request Aug 20, 2025

[RFC]: Restructure the core loop to allow more asynchrony #23233

Closed

1 task

mergify bot added the needs-rebase label Sep 8, 2025

github-actions bot added the stale Over 90 days of inactivity label Feb 2, 2026

github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Feb 3, 2026

hmellor closed this Mar 4, 2026

github-project-automation bot moved this to Done in Structured Output Mar 4, 2026

		if not request.structured_output_request.grammar.accept_tokens( # type: ignore[union-attr]
		req_id, new_token_ids):

Uh oh!

Conversation

atbe commented May 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix streaming requests hanging when structured output FSM fails to advance

Problem

Diagnosis

Solution

Uh oh!

github-actions bot commented May 27, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

aarnphm May 27, 2025

Choose a reason for hiding this comment

Uh oh!

atbe May 27, 2025

Choose a reason for hiding this comment

Uh oh!

atbe May 27, 2025

Choose a reason for hiding this comment

Uh oh!

cadedaniel commented Jun 3, 2025

Uh oh!

aarnphm commented Jun 3, 2025

Uh oh!

cadedaniel commented Jun 3, 2025

Uh oh!

ekagra-ranjan commented Jun 9, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

njhill Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

russellb commented Jun 12, 2025

Uh oh!

aarnphm commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

russellb commented Jun 12, 2025

Uh oh!

ma-armenta commented Aug 20, 2025

Uh oh!

russellb commented Aug 20, 2025

Uh oh!

ma-armenta commented Aug 20, 2025

Uh oh!

russellb commented Aug 20, 2025

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

FilippoBoni1921 commented Nov 3, 2025

Uh oh!

github-actions bot commented Feb 2, 2026

Uh oh!

mergify bot commented Feb 2, 2026

Uh oh!

mergify bot commented Feb 3, 2026

Uh oh!

hmellor commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

atbe commented May 27, 2025 •

edited by github-actions bot

Loading

aarnphm commented Jun 12, 2025 •

edited

Loading