Skip to content

[CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset#37335

Merged
tjtanaa merged 1 commit intovllm-project:mainfrom
ROCm:akaratza_stabilize_cpu_offload
Mar 18, 2026
Merged

[CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset#37335
tjtanaa merged 1 commit intovllm-project:mainfrom
ROCm:akaratza_stabilize_cpu_offload

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 17, 2026

  • test_cpu_offloading[TRITON_ATTN-48] was intermittently failing because reset_prefix_cache() was called while the async GPU-to-CPU offload was still in progress, returning False (silently ignored). This meant the GPU prefix cache was never actually cleared, no new CPU stored events were produced, and assert subscriber.get_new_cpu_stored_events() failed with an empty list.

  • Add _wait_for_prefix_cache_reset() that retries with a timeout until blocks are freed and the reset succeeds.

  • Use longer timeouts on ROCm where async offloads are slower under CI load.

  • Pass max_num_seqs=1 on ROCm to reduce batch variance.

Test plan

  • pytest -s -v tests/v1/kv_offload/test_cpu_offloading.py

cc @kenroche

… before cache reset

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review March 17, 2026 18:35
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to stabilize the test_cpu_offloading test, which was intermittently failing on ROCm due to a race condition with asynchronous GPU-to-CPU offloading. The changes introduce a robust waiting mechanism, _wait_for_prefix_cache_reset, that polls until reset_prefix_cache() succeeds, ensuring the prefix cache is cleared only after the async offload is in a ready state. Additionally, timeouts have been increased for ROCm to accommodate slower operations under CI load, and a hard timeout has been added to event collection to prevent test hangs. The logic appears sound and the changes should effectively address the reported test flakiness. I found no high or critical issues in this pull request.

@AndreasKaratzas AndreasKaratzas changed the title [ROCm][CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset [CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset Mar 17, 2026
@mergify mergify bot added the v1 label Mar 17, 2026
Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added rocm Related to AMD ROCm ready ONLY add when PR is ready to merge/full CI is needed labels Mar 18, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 18, 2026
@tjtanaa tjtanaa enabled auto-merge (squash) March 18, 2026 04:05
@tjtanaa tjtanaa merged commit ce2ef42 into vllm-project:main Mar 18, 2026
18 of 19 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 18, 2026
@AndreasKaratzas AndreasKaratzas deleted the akaratza_stabilize_cpu_offload branch March 18, 2026 05:29
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…e cache reset (vllm-project#37335)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…e cache reset (vllm-project#37335)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026
…e cache reset (vllm-project#37335)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
…e cache reset (vllm-project#37335)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…e cache reset (vllm-project#37335)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…e cache reset (vllm-project#37335)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…e cache reset (vllm-project#37335)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…e cache reset (vllm-project#37335)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
…e cache reset (vllm-project#37335)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants