[V1][Feat] Fail request if FSM fails to advance#18780
[V1][Feat] Fail request if FSM fails to advance#18780atbe wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Ibrahim Ahmed <abeahmed2@gmail.com>
389a97c to
49f2024
Compare
aarnphm
left a comment
There was a problem hiding this comment.
one tiny comment.
fwiw I think it is better to raise exception and propagate it accordingly in the engine. but that is probably for another day.
| # check above, so safe to ignore type warning | ||
| request.structured_output_request.grammar.accept_tokens( # type: ignore[union-attr] | ||
| req_id, new_token_ids) | ||
| if not request.structured_output_request.grammar.accept_tokens( # type: ignore[union-attr] |
There was a problem hiding this comment.
let's also add a note and create a bug for tracking this here.
|
Could we add a test to this PR? |
|
@cadedaniel for visibility https://vllm-dev.slack.com/archives/C07QQ8DAXMK/p1748388146836209 This happens in the case where "after a few hundred thousand requests are sent to the same instance". For tests I think we might be able to reproduce something when we send the same requests repeatedly? might need some fine tune for this regression test. |
|
I think one could mock the output of the model to be an invalid token wrt the grammar. |
|
I didnt understand why would |
njhill
left a comment
There was a problem hiding this comment.
Thanks @atbe.
Agree with @cadedaniel that a test would be good.
| if not request.structured_output_request.grammar.accept_tokens( # type: ignore[union-attr] | ||
| req_id, new_token_ids): |
There was a problem hiding this comment.
Suggest using a variable here to make things a bit clearer:
| if not request.structured_output_request.grammar.accept_tokens( # type: ignore[union-attr] | |
| req_id, new_token_ids): | |
| accepted = request.structured_output_request.grammar.accept_tokens( # type: ignore[union-attr] | |
| req_id, new_token_ids) | |
| if not accepted: |
Closes vllm-project#19493 Closes vllm-project#18376 Related to vllm-project#18780 Several people have noticed errors when using both the `xgrammar` and `guidance` backends where we would start generating invalid tokens for a request and they would be continuously rejected by the backend currently in use. The conditions seemed to be: - Only impacts certain models - Occurs with concurrent structured output requests After further investigation once an easy way to reproduce was provided via vllm-project#19493, I identified more details about the failure: - When the failured occurred in my test using a concurrency of 2, whichever request came in first was always successful. It was the second request that would fail. Debugging further identified that the bitmask was not being applied correctly, but only for that second request. In the GPU model runner, this translates to the 2nd row in the bitmask tensor and the 2nd row of the logits tensor. I could see that a couple bytes were left unmasked. I suspect the reason the issue appears to be model specific has to do with the vocab and what the tokens are that were left unmasked. I have not verified this part for sure. The reason it occurred with both structured output backends is because we use the `xgrammar` library's implementation of applying the bitmask in all cases. Xgrammar on cuda, by default, uses a triton kernel for applying the bitmask. I identified that by forcing it to use the `torch.compile` implementation instead, the problem is resolved. The torch implementation is used for all other accelerator types in Xgrammar's logic, so it seems fine to just force the use of that implementation. I have not yet narrowed down the problem in triton kernel, but this change works around the problem for vLLM. We can move back to Xgrammar's wrapper that chooses which implementation to use once we can verify everything is working properly again. Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
PR related to the root cause of the problem that was occurring here: #19565 |
|
Friendly bump on this. I am also seeing same log error. pointing to line 160. |
what version? We fixed the underlying issue a while ago that led to this proposed change. If you see the FSM fail to advice, it indicates a problem we need to fix. |
|
@russellb This is the error that I'm getting. This is for sure on v0.10.0 for sure, need to check on the latest release. Inference is happening with Instructor client and structured responses. Model is llama 3.1 8b instruct
|
If you can come up with an easy way to reproduce it, please file an issue so we can look into it. Thanks! |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Friendly reminding about this PR. I think it would be extremely helpful at least to fail gracefully ;) |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
Hi @atbe, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
This is only being kept alive by bots. Closing as stale. If this feature still needs to be merged, feel free to re-open and update this PR or make a new one. |
Fix streaming requests hanging when structured output FSM fails to advance
Problem
When using structured outputs with the xgrammar backend, streaming requests would hang indefinitely if the FSM (Finite State Machine) failed to advance. This occurred when
accept_tokens()returnedFalsein the xgrammar backend, logging an error but not properly terminating the request.Diagnosis
The issue was in the scheduler's
update_from_output()method. When processing new tokens for structured output requests, the code calledaccept_tokens()but ignored its return value:When the xgrammar FSM encountered an invalid token sequence, it would:
"Failed to advance FSM for request %s for tokens %s. Please file an issue."Falsefromaccept_tokens()Since the scheduler didn't check the return value, it continued processing as if nothing was wrong, causing the streaming response to hang indefinitely without sending a completion signal.
Solution
The fix checks the return value of
accept_tokens()and properly terminates the request when it returnsFalse:This ensures that:
FINISHED_ABORTED