Fix: PrefillAdder.add_chunked_req with negative rem_total_tokens with pp#13698
Fix: PrefillAdder.add_chunked_req with negative rem_total_tokens with pp#13698strgrb wants to merge 2 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Will review this PR today. @strgrb Can you test whether this bug is still happening in #11852? cc: @XucSh @whybeyoung |
OK, I'll try it. |
ShangmingCai
left a comment
There was a problem hiding this comment.
Changes don't look like it is related to PP, more likely related to bybrid memory.
It's not related to hybrid memory, it's just for budget calculation where hybrid is used. The real reason is about budget, i.e. budget for chunked request is negative with pp situation. |
|
@ShangmingCai Should I move this budget check logic to |
@strgrb If it is related to pp only, maybe you could try reverting this commit locally: #13144. If the bug still exists, then it might not be related to the pp. Also, you can try changing the attention backend to test whether this is a bug of flashinfer? |
|
let's get this problem fixed after this refactoring to simplify the logics |
Motivation
I need to run Ring-1T and Ling-1T model with tp8pp4, and met up with an error

The error may differ, but new-token is negative every time. It does not appear without pp.
This pr try to fix this problem.
Finally I found
self.rem_total_tokensbecome negative inPrefillAdder.add_chunked_req, it's computed byavailable_and_evictable - self.rem_total_token_offset, and budgets for running requests are added toself.rem_total_token_offset.After merging of last_batch and running_batch, this budget will increase, and if this increase to an amount larger than available size,
PrefillAdder.add_chunked_reqwill calculate a negative extend input len.Modifications
Since chunked_req's req_pool_idx is freed in
Scheduler.get_next_batch_to_run, we should check remaining tokens excluding budgets for running requests here, and avoid freeing chunked_req's req_pool_idx. Since budget is not enough for decoding now, it's time for decoding, and chunked_req is hanged, waiting for budget is enough.Accuracy Tests
Benchmarking and Profiling
Checklist