Fix streaming token ids data loss under load#19977
Fix streaming token ids data loss under load#19977ishandhanani merged 4 commits intosgl-project:mainfrom
Conversation
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Thanks for the fix @vladnosiv this is important. For completeness can you add a comment or update the PR description with a before/after |
added a description in the PR, described the problem in more detail in the linked issue |
|
/tag-and-rerun-ci |
|
Only a single CI test failed in the previous run |
|
@hnyls2002 Hi ! |
|
Seems like in |
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
When handling multiple scheduler batches before
_wait_one_responseis naturally scheduled, intermediate outputs accumulate instate.out_list. Previously, the code consumed only the last element (state.out_list[-1]).While this masked the issue for cumulative text consumers, it introduced a severe bug for consumers reading
output_idslimits. Since streaming token IDs are emitted as disjoint deltas, dropping intermediate entries fromout_listcaused silent, unrecoverable data loss (missing tokens, missing format in tool-calling, and output corruption).This race condition is significantly amplified under high concurrency, especially when using
--skip-tokenizer-init(e.g., in Dynamo environments), where the absence of a CPU-bound detokenizer allows ZMQ messages to arrive at wire speed and accumulate rapidly.Proposed Changes
out_list: Differentiate between streaming and non-streaming requests. For streams, we now drain and yield all pending output dicts sequentially. For non-streams, we preserve the previous behavior of only taking the latest cumulative output (state.out_list[-1:]).is_last) when the state isfinished.Resolves #19976
cc @ishandhanani