[CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset by AndreasKaratzas · Pull Request #37335 · vllm-project/vllm

AndreasKaratzas · 2026-03-17T18:35:11Z

test_cpu_offloading[TRITON_ATTN-48] was intermittently failing because reset_prefix_cache() was called while the async GPU-to-CPU offload was still in progress, returning False (silently ignored). This meant the GPU prefix cache was never actually cleared, no new CPU stored events were produced, and assert subscriber.get_new_cpu_stored_events() failed with an empty list.
Add _wait_for_prefix_cache_reset() that retries with a timeout until blocks are freed and the reset succeeds.
Use longer timeouts on ROCm where async offloads are slower under CI load.
Pass max_num_seqs=1 on ROCm to reduce batch variance.

Test plan

pytest -s -v tests/v1/kv_offload/test_cpu_offloading.py

… before cache reset Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gemini-code-assist

Code Review

This pull request aims to stabilize the test_cpu_offloading test, which was intermittently failing on ROCm due to a race condition with asynchronous GPU-to-CPU offloading. The changes introduce a robust waiting mechanism, _wait_for_prefix_cache_reset, that polls until reset_prefix_cache() succeeds, ensuring the prefix cache is cleared only after the async offload is in a ready state. Additionally, timeouts have been increased for ROCm to accommodate slower operations under CI load, and a hard timeout has been added to event collection to prevent test hangs. The logic appears sound and the changes should effectively address the reported test flakiness. I found no high or critical issues in this pull request.

tjtanaa

LGTM

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[ROCm][CI] Stabilize test_cpu_offloading by waiting for async offload…

bb95453

… before cache reset Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas marked this pull request as ready for review March 17, 2026 18:35

AndreasKaratzas requested review from ApostaC and orozery as code owners March 17, 2026 18:35

AndreasKaratzas mentioned this pull request Mar 17, 2026

[CI Failure]: mi325_1: V1 Test others #31631

Closed

3 tasks

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

AndreasKaratzas changed the title ~~[ROCm][CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset~~ [CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset Mar 17, 2026

mergify bot added the v1 label Mar 17, 2026

tjtanaa approved these changes Mar 18, 2026

View reviewed changes

tjtanaa added rocm Related to AMD ROCm ready ONLY add when PR is ready to merge/full CI is needed labels Mar 18, 2026

github-project-automation bot added this to AMD Mar 18, 2026

github-project-automation bot moved this to Todo in AMD Mar 18, 2026

tjtanaa enabled auto-merge (squash) March 18, 2026 04:05

tjtanaa merged commit ce2ef42 into vllm-project:main Mar 18, 2026
18 of 19 checks passed

github-project-automation bot moved this from Todo to Done in AMD Mar 18, 2026

AndreasKaratzas deleted the akaratza_stabilize_cpu_offload branch March 18, 2026 05:29

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[CI] Stabilize test_cpu_offloading by waiting for async offload befor…

99f12a9

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[CI] Stabilize test_cpu_offloading by waiting for async offload befor…

df0b3b0

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

[CI] Stabilize test_cpu_offloading by waiting for async offload befor…

d053597

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[CI] Stabilize test_cpu_offloading by waiting for async offload befor…

1433eff

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[CI] Stabilize test_cpu_offloading by waiting for async offload befor…

804ae94

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026

[CI] Stabilize test_cpu_offloading by waiting for async offload befor…

badb227

…e cache reset (vllm-project#37335) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset#37335

[CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset#37335
tjtanaa merged 1 commit intovllm-project:mainfrom
ROCm:akaratza_stabilize_cpu_offload

AndreasKaratzas commented Mar 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

tjtanaa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AndreasKaratzas commented Mar 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AndreasKaratzas commented Mar 17, 2026 •

edited by github-actions bot

Loading