Skip to content

Fix: PrefillAdder.add_chunked_req with negative rem_total_tokens with pp#13698

Closed
strgrb wants to merge 2 commits intosgl-project:mainfrom
antgroup:fix/pp-negative-rem-tokens
Closed

Fix: PrefillAdder.add_chunked_req with negative rem_total_tokens with pp#13698
strgrb wants to merge 2 commits intosgl-project:mainfrom
antgroup:fix/pp-negative-rem-tokens

Conversation

@strgrb
Copy link
Copy Markdown
Collaborator

@strgrb strgrb commented Nov 21, 2025

Motivation

I need to run Ring-1T and Ling-1T model with tp8pp4, and met up with an error
image
The error may differ, but new-token is negative every time. It does not appear without pp.
This pr try to fix this problem.

Finally I found self.rem_total_tokens become negative in PrefillAdder.add_chunked_req , it's computed by available_and_evictable - self.rem_total_token_offset , and budgets for running requests are added to self.rem_total_token_offset .
After merging of last_batch and running_batch, this budget will increase, and if this increase to an amount larger than available size, PrefillAdder.add_chunked_req will calculate a negative extend input len.

Modifications

Since chunked_req's req_pool_idx is freed in Scheduler.get_next_batch_to_run , we should check remaining tokens excluding budgets for running requests here, and avoid freeing chunked_req's req_pool_idx. Since budget is not enough for decoding now, it's time for decoding, and chunked_req is hanged, waiting for budget is enough.

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ShangmingCai
Copy link
Copy Markdown
Collaborator

ShangmingCai commented Nov 21, 2025

Will review this PR today. @strgrb Can you test whether this bug is still happening in #11852?

cc: @XucSh @whybeyoung

@strgrb
Copy link
Copy Markdown
Collaborator Author

strgrb commented Nov 21, 2025

Will review this PR today. @strgrb Can you test whether this bug is still happening in #11852?

cc: @XucSh @whybeyoung

OK, I'll try it.

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes don't look like it is related to PP, more likely related to bybrid memory.

CC: @xiezhq-hermann @yizhang2077

@strgrb
Copy link
Copy Markdown
Collaborator Author

strgrb commented Nov 24, 2025

Changes don't look like it is related to PP, more likely related to bybrid memory.

CC: @xiezhq-hermann @yizhang2077

It's not related to hybrid memory, it's just for budget calculation where hybrid is used. The real reason is about budget, i.e. budget for chunked request is negative with pp situation.

@strgrb
Copy link
Copy Markdown
Collaborator Author

strgrb commented Nov 24, 2025

@ShangmingCai Should I move this budget check logic to PrefillAdder , which rem_total_tokens() can be used directly?

@ShangmingCai
Copy link
Copy Markdown
Collaborator

@ShangmingCai Should I move this budget check logic to PrefillAdder , which rem_total_tokens() can be used directly?

@strgrb If it is related to pp only, maybe you could try reverting this commit locally: #13144. If the bug still exists, then it might not be related to the pp. Also, you can try changing the attention backend to test whether this is a bug of flashinfer?

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

let's get this problem fixed after this refactoring to simplify the logics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants