[BugFix]: Fix async scheduer transfer exceed KV cache#3318
Conversation
Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Related: #3306 @alex-jw-brooks PTAL, I think this fix can resolve the bug |
| @@ -622,7 +668,8 @@ def _free_request(self, request: Request, delay_free_blocks: bool = False) -> di | |||
| ) | |||
| else: | |||
There was a problem hiding this comment.
Related: _replace_session_with_streaming_update in omni_scheduler_mixin.py resets num_computed_tokens = 0 but does not reset num_output_placeholders. If this helper is ever called on a request that went through that path, it would return a negative value. Probably worth resetting num_output_placeholders = 0 there too for consistency.
hsliuustc0106
left a comment
There was a problem hiding this comment.
BLOCKER scan:
| Category | Result |
|---|---|
| Correctness | PASS |
| Reliability/Safety | PASS |
| Breaking Changes | PASS |
| Test Coverage | PASS |
| Documentation | PASS |
| Security | PASS |
Non-blocking suggestions:
-
Missing regression test: This is a bug fix for AsyncScheduler's KV transfer issue. Consider adding a regression test that specifically validates the fix - e.g., a test that would fail with the buggy behavior (reading uninitialized KV) but passes with the fix. The existing Bagel text2img tests validate the output is correct, but don't explicitly test the KV transfer correctness edge case.
-
Behavior change in
_update_request_with_output: You've switched from AsyncScheduler's implementation to SyncScheduler's implementation for token appending. This is intentional (to avoid the eager cache_blocks call), but it's worth documenting this deviation from the parent class behavior for future maintainers. -
Consider adding a helper method: The pattern
confirmed_computed = self._get_confirmed_num_computed_tokens(request)is repeated 3 times. Could extract to a local helper for cleaner code, though this is minor.
The fix looks correct and well-documented.
Signed-off-by: princepride <wangzhipeng628@gmail.com>
|
@natureofnature now img2img task will get a full of noise picture:
|
|
hsliuustc0106
left a comment
There was a problem hiding this comment.
lgtm, but I think we need a more concise design doc for multi-stage model serving
|
Thanks @princepride! LGTM, this is pretty much similar to what I had been thinking as well. I'll rebase the other PR on main later today. @hsliuustc0106 I agree, I think we need that and guidance on how to tune your own config for your device if you aren't using the same setup as the deploy configs (ideally with some utils to help do that automatically). Can help on some of these after 0.20.0 release if others don't pick them up first |



Purpose
AsyncScheduler._update_after_schedule()optimistically advancesrequest.num_computed_tokenswith output placeholders before GPU execution confirms the tokens. The KV transfer logic was using this inflatednum_computed_tokensasseq_len, causing the DiT stage to read uninitialized KV data for placeholder positions — resulting in wrong conditioning and different generated pixels.Fix: Override
_update_request_with_outputinOmniARSchedulerto explicitly manage the async placeholder protocol, and usenum_computed_tokens - num_output_placeholders(confirmed computed tokens) for all KV transferseq_lenvalues.Test Plan
Result