Skip to content

[BugFix]: Fix async scheduer transfer exceed KV cache#3318

Merged
hsliuustc0106 merged 3 commits into
vllm-project:mainfrom
princepride:fix-scheduler-bug
May 3, 2026
Merged

[BugFix]: Fix async scheduer transfer exceed KV cache#3318
hsliuustc0106 merged 3 commits into
vllm-project:mainfrom
princepride:fix-scheduler-bug

Conversation

@princepride
Copy link
Copy Markdown
Collaborator

@princepride princepride commented May 3, 2026

Purpose

AsyncScheduler._update_after_schedule() optimistically advances request.num_computed_tokens with output placeholders before GPU execution confirms the tokens. The KV transfer logic was using this inflated num_computed_tokens as seq_len, causing the DiT stage to read uninitialized KV data for placeholder positions — resulting in wrong conditioning and different generated pixels.

Fix: Override _update_request_with_output in OmniARScheduler to explicitly manage the async placeholder protocol, and use num_computed_tokens - num_output_placeholders (confirmed computed tokens) for all KV transfer seq_len values.

Test Plan

pytest -v tests/e2e/offline_inference/test_bagel_text2img.py -m "advanced_model" --run-level "advanced_model"

Result

=========================================================================== test session starts ===========================================================================
platform linux -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0 -- /usr/bin/python3.12
cachedir: .pytest_cache
rootdir: /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni
configfile: pyproject.toml
plugins: forked-1.6.0, timeout-2.4.0, hydra-core-1.3.2, asyncio-1.3.0, rerunfailures-16.1, mock-3.15.1, shard-0.1.2, anyio-4.13.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items                                                                                                                                                         
Running 2 items in this shard: tests/e2e/offline_inference/test_bagel_text2img.py::test_bagel_text2img_shared_memory_connector, tests/e2e/offline_inference/test_bagel_text2img.py::test_bagel_text2img_mooncake_connector

tests/e2e/offline_inference/test_bagel_text2img.py::test_bagel_text2img_shared_memory_connector PASSED                                                              [ 50%]
tests/e2e/offline_inference/test_bagel_text2img.py::test_bagel_text2img_mooncake_connector PASSED                                                                   [100%]

============================================================================ warnings summary =============================================================================
vllm_omni/version.py:55
  /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
   --> vLLM-Omni version 0.18.1.dev411+gd49356e4c.d20260503
   --> vLLM version 0.20.0
  This will likely cause compatibility issues.
    warn_if_misaligned_vllm_version()

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../../../../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:365: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
=============================================================== 2 passed, 17 warnings in 180.57s (0:03:00) ================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride princepride requested a review from hsliuustc0106 as a code owner May 3, 2026 06:56
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@princepride
Copy link
Copy Markdown
Collaborator Author

Related: #3306 @alex-jw-brooks PTAL, I think this fix can resolve the bug

@princepride princepride added the merge-test label to trigger buildkite merge test CI label May 3, 2026
@@ -622,7 +668,8 @@ def _free_request(self, request: Request, delay_free_blocks: bool = False) -> di
)
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: _replace_session_with_streaming_update in omni_scheduler_mixin.py resets num_computed_tokens = 0 but does not reset num_output_placeholders. If this helper is ever called on a request that went through that path, it would return a negative value. Probably worth resetting num_output_placeholders = 0 there too for consistency.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKER scan:

Category Result
Correctness PASS
Reliability/Safety PASS
Breaking Changes PASS
Test Coverage PASS
Documentation PASS
Security PASS

Non-blocking suggestions:

  1. Missing regression test: This is a bug fix for AsyncScheduler's KV transfer issue. Consider adding a regression test that specifically validates the fix - e.g., a test that would fail with the buggy behavior (reading uninitialized KV) but passes with the fix. The existing Bagel text2img tests validate the output is correct, but don't explicitly test the KV transfer correctness edge case.

  2. Behavior change in _update_request_with_output: You've switched from AsyncScheduler's implementation to SyncScheduler's implementation for token appending. This is intentional (to avoid the eager cache_blocks call), but it's worth documenting this deviation from the parent class behavior for future maintainers.

  3. Consider adding a helper method: The pattern confirmed_computed = self._get_confirmed_num_computed_tokens(request) is repeated 3 times. Could extract to a local helper for cleaner code, though this is minor.

The fix looks correct and well-documented.

Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride
Copy link
Copy Markdown
Collaborator Author

@natureofnature now img2img task will get a full of noise picture:

root@job-afbdf05bd2b0de31-7qvjv:/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni# python3 examples/offline_inference/bagel/end2end.py \
        --prompts "Change the grass color to red" \
        --modality img2img --step 15 \
        --image-path 2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg
image

@princepride
Copy link
Copy Markdown
Collaborator Author

@natureofnature now img2img task will get a full of noise picture:

root@job-afbdf05bd2b0de31-7qvjv:/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni# python3 examples/offline_inference/bagel/end2end.py \
        --prompts "Change the grass color to red" \
        --modality img2img --step 15 \
        --image-path 2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg
image

Resolved!
image

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label May 3, 2026
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but I think we need a more concise design doc for multi-stage model serving

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

cc @tzhouam @alex-jw-brooks @amy-why-3459

@hsliuustc0106 hsliuustc0106 merged commit e50a066 into vllm-project:main May 3, 2026
8 checks passed
@alex-jw-brooks
Copy link
Copy Markdown
Contributor

alex-jw-brooks commented May 3, 2026

Thanks @princepride! LGTM, this is pretty much similar to what I had been thinking as well. I'll rebase the other PR on main later today.

@hsliuustc0106 I agree, I think we need that and guidance on how to tune your own config for your device if you aren't using the same setup as the deploy configs (ideally with some utils to help do that automatically). Can help on some of these after 0.20.0 release if others don't pick them up first

sphinxkkkbc pushed a commit to sphinxkkkbc/vllm-omni that referenced this pull request May 4, 2026
)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-test label to trigger buildkite merge test CI ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants