Skip to content

[CI] Add persistent cache mounts and fix test download paths#36951

Open
AndreasKaratzas wants to merge 5 commits intovllm-project:mainfrom
ROCm:akaratza_revamp_test_cache
Open

[CI] Add persistent cache mounts and fix test download paths#36951
AndreasKaratzas wants to merge 5 commits intovllm-project:mainfrom
ROCm:akaratza_revamp_test_cache

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 13, 2026

  • Adds persistent cache volume mounts for CI test containers: MODELSCOPE_CACHE, VLLM_TEST_CACHE, VLLM_CACHE_ROOT, and VLLM_MEDIA_CACHE.
  • Routes all test data downloads (dummy models, GSM8K datasets, tiktoken data, Prithvi GeoTIFFs) through VLLM_TEST_CACHE instead of /tmp or scattered locations.
  • Fixes hardcoded HF cache path in test_extraction.py to use HF_HOME env var.
  • Fixes snapshot_download bypass in test_token_in_token_out.py by removing explicit cache_dir that skipped HF cache.

Changes

File Change
.buildkite/scripts/hardware_ci/run-amd-test.sh Add cache volume mounts + env vars for CI containers
tests/conftest.py Route dummy model creation through VLLM_TEST_CACHE
tests/evals/gsm8k/gsm8k_eval.py Cache GSM8K downloads to VLLM_TEST_CACHE/gsm8k/
tests/evals/gpt_oss/test_gpqa_correctness.py Cache tiktoken data to VLLM_TEST_CACHE/tiktoken/
tests/plugins/.../prithvi_processor.py Cache URL-fetched GeoTIFFs to VLLM_TEST_CACHE/prithvi/
tests/entrypoints/openai/test_token_in_token_out.py Use default HF cache instead of /tmp
tests/v1/kv_connector/.../test_extraction.py Use HF_HOME env var instead of hardcoded path

Note: The VLLM_MEDIA_CACHE feature code (env registration + MediaConnector caching) is in a separate PR: #37123

cc @kenroche

…URLs

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces persistent caching for various assets in CI to improve performance, which is a great enhancement. The changes correctly utilize environment variables to configure cache paths. My review focuses on potential race conditions in the new caching logic. I've identified two instances where concurrent writes to the cache could lead to corrupted files, one in production code which is critical, and a similar one in test code. I've provided suggestions to make the file writing atomic and prevent these race conditions.

Comment on lines +140 to +145
def _put_cached_bytes(self, url: str, data: bytes) -> None:
"""Store downloaded bytes in the cache."""
if not self._media_cache_dir:
return
cache_path = self._media_cache_path(url)
cache_path.write_bytes(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation of _put_cached_bytes has a race condition. If multiple processes or threads attempt to download and cache the same URL concurrently, they could write to the same cache file simultaneously, leading to a corrupted file. This can cause subsequent requests to fail when reading the corrupted cache entry.

To fix this, you should write the downloaded data to a temporary file within the cache directory and then perform an atomic rename to the final cache path. This ensures that readers will only ever see a complete file.

Note: The suggested code requires importing the tempfile module at the top of the file.

    def _put_cached_bytes(self, url: str, data: bytes) -> None:
        """Store downloaded bytes in the cache."""
        if not self._media_cache_dir:
            return
        cache_path = self._media_cache_path(url)
        # To prevent race conditions, write to a temporary file and then atomically rename.
        with tempfile.NamedTemporaryFile(mode="wb", dir=self._media_cache_dir, delete=False) as tmp_file:
            tmp_file.write(data)
            tmp_path = tmp_file.name
        try:
            os.rename(tmp_path, cache_path)
        except OSError:
            # Another process might have already written the file.
            os.remove(tmp_path)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Comment on lines +128 to +131
resp = urllib.request.urlopen(file_path)
with open(cached_path, "wb") as f:
f.write(resp.read())
path = cached_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential race condition here. If multiple tests running in parallel attempt to download and cache the same URL, they could write to cached_path simultaneously, resulting in a corrupted file. To ensure atomicity, it's safer to write the downloaded content to a temporary file and then atomically rename it to the final destination.

Suggested change
resp = urllib.request.urlopen(file_path)
with open(cached_path, "wb") as f:
f.write(resp.read())
path = cached_path
resp = urllib.request.urlopen(file_path)
# To prevent race conditions, write to a temporary file and then atomically rename.
with tempfile.NamedTemporaryFile(mode="wb", dir=cache_dir, delete=False) as tmp_file:
tmp_file.write(resp.read())
tmp_path = tmp_file.name
try:
os.rename(tmp_path, cached_path)
except OSError:
# Another process might have already written the file.
os.remove(tmp_path)
path = cached_path

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 13, 2026
…URLs

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid modifying the main vLLM code with media cache. We already have fixtures such as image_urls which pre-download media files

@AndreasKaratzas
Copy link
Collaborator Author

AndreasKaratzas commented Mar 13, 2026

@DarkLight1337 You're right that image_urls / LocalAssetServer already handles a lot of the test media through pre-downloaded assets. That said, I'd like to keep the VLLM_MEDIA_CACHE piece as a separate discussion because I think it still has value. Right now if a user sends the same image URL in 10 different requests, MediaConnector downloads it 10 times. URL-level caching is a pretty natural optimization, it's opt-in (disabled by default, zero behavior change), and the implementation is minimal (~20 lines). It mirrors what get_vllm_public_assets already does for S3 assets, just generalized to arbitrary URLs.

In other words, this feature can only be useful if you define the env var before tests.

@DarkLight1337
Copy link
Member

I see your point. But it should be done as a separate RFC / feature request

@AndreasKaratzas
Copy link
Collaborator Author

AndreasKaratzas commented Mar 14, 2026

I see your point. But it should be done as a separate RFC / feature request

@DarkLight1337 I can make it ROCm specific, I just thought that upstream would also benefit from this, essentially this is a completely optional data path, that is only set if you set that env car, otherwise the execution is as is right now. And it really helps if there are network issues on a machine, because everything is under a specific cache path that can be stored in NFS. If there is any other recommendation towards that path let me know. I'm certainly open to refactoring this.

@DarkLight1337
Copy link
Member

I think this requires a broader discussion so please open a RFC for media download cache specifically, and I'll tag relevant people.

@AndreasKaratzas
Copy link
Collaborator Author

I think this requires a broader discussion so please open a RFC for media download cache specifically, and I'll tag relevant people.

Certainly :) #37075

@AndreasKaratzas AndreasKaratzas changed the title [CI] Add persistent cache mounts for all CI test downloads and media URLs [CI] Add persistent cache mounts and fix test download paths Mar 15, 2026
@AndreasKaratzas AndreasKaratzas added the rocm Related to AMD ROCm label Mar 15, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 15, 2026
@mergify
Copy link

mergify bot commented Mar 17, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build gpt-oss Related to GPT-OSS models kv-connector multi-modality Related to multi-modality (#4194) needs-rebase ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Todo
Status: To Triage

Development

Successfully merging this pull request may close these issues.

2 participants