[Bugfix]: prevent crash when sampled tokens exceed max_model_len#25160
[Bugfix]: prevent crash when sampled tokens exceed max_model_len#25160nicole-lihui wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a bug that caused a crash when the number of sampled tokens exceeded the maximum model length. The fix correctly replaces an assertion with logic to truncate the sampled tokens, ensuring they fit within max_model_len. The implementation is sound and handles edge cases, such as when a sequence has already reached its maximum length. This change effectively resolves the reported issue.
3e4cbd0 to
c0c438f
Compare
vllm/v1/worker/gpu_model_runner.py
Outdated
| continue | ||
| # avoid overflow. | ||
| if len(sampled_ids) > remaining_slots: | ||
| sampled_ids = sampled_ids[:remaining_slots] |
There was a problem hiding this comment.
Typically, when truncating the output, vllm would set the reason/cause in the output object
But in any case, it was extremely weird that this assert blew up
There was a problem hiding this comment.
I noticed that #20291
changed one line in the scheduler from:
self.max_model_len - request.num_computed_tokensto:
self.max_model_len - 1 - request.num_computed_tokensI haven’t directly identified the root cause of the overflow, but I did notice that the scheduler handles num_new_tokens inconsistently between running and waiting requests, which is very likely related.
@WoosukKwon Could you share some suggestions on this?
7fccea0 to
a823976
Compare
|
@yannicks1 Looks like we’re fixing the same issue — I guessed another possible cause. I’m new to vLLM, so any feedback would be super helpful. |
a823976 to
299ae57
Compare
|
hi @nicole-lihui, I found that "[Optimization] Cache sampled token ids in model runner" #20291 actually introduced this bug and addressed it accordingly in that part of the code. IMO that is safer and less intrusive than touching the scheduler as in your PR. I also provide a unit test to detect such an inconsistency in the future. |
|
@nicole-lihui Could you perhaps review #24446? To me, the fix looks simpler and the PR also includes a test case. |
closed vllm-project#25120 Signed-off-by: nicole-lihui <nicole.li@daocloud.io>
Signed-off-by: nicole-lihui <nicole.li@daocloud.io>
299ae57 to
59b4f2b
Compare
Purpose
closed #25120
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.