[CI] Add persistent cache mounts and fix test download paths#36951
[CI] Add persistent cache mounts and fix test download paths#36951AndreasKaratzas wants to merge 5 commits intovllm-project:mainfrom
Conversation
…URLs Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request introduces persistent caching for various assets in CI to improve performance, which is a great enhancement. The changes correctly utilize environment variables to configure cache paths. My review focuses on potential race conditions in the new caching logic. I've identified two instances where concurrent writes to the cache could lead to corrupted files, one in production code which is critical, and a similar one in test code. I've provided suggestions to make the file writing atomic and prevent these race conditions.
vllm/multimodal/media/connector.py
Outdated
| def _put_cached_bytes(self, url: str, data: bytes) -> None: | ||
| """Store downloaded bytes in the cache.""" | ||
| if not self._media_cache_dir: | ||
| return | ||
| cache_path = self._media_cache_path(url) | ||
| cache_path.write_bytes(data) |
There was a problem hiding this comment.
The current implementation of _put_cached_bytes has a race condition. If multiple processes or threads attempt to download and cache the same URL concurrently, they could write to the same cache file simultaneously, leading to a corrupted file. This can cause subsequent requests to fail when reading the corrupted cache entry.
To fix this, you should write the downloaded data to a temporary file within the cache directory and then perform an atomic rename to the final cache path. This ensures that readers will only ever see a complete file.
Note: The suggested code requires importing the tempfile module at the top of the file.
def _put_cached_bytes(self, url: str, data: bytes) -> None:
"""Store downloaded bytes in the cache."""
if not self._media_cache_dir:
return
cache_path = self._media_cache_path(url)
# To prevent race conditions, write to a temporary file and then atomically rename.
with tempfile.NamedTemporaryFile(mode="wb", dir=self._media_cache_dir, delete=False) as tmp_file:
tmp_file.write(data)
tmp_path = tmp_file.name
try:
os.rename(tmp_path, cache_path)
except OSError:
# Another process might have already written the file.
os.remove(tmp_path)| resp = urllib.request.urlopen(file_path) | ||
| with open(cached_path, "wb") as f: | ||
| f.write(resp.read()) | ||
| path = cached_path |
There was a problem hiding this comment.
There's a potential race condition here. If multiple tests running in parallel attempt to download and cache the same URL, they could write to cached_path simultaneously, resulting in a corrupted file. To ensure atomicity, it's safer to write the downloaded content to a temporary file and then atomically rename it to the final destination.
| resp = urllib.request.urlopen(file_path) | |
| with open(cached_path, "wb") as f: | |
| f.write(resp.read()) | |
| path = cached_path | |
| resp = urllib.request.urlopen(file_path) | |
| # To prevent race conditions, write to a temporary file and then atomically rename. | |
| with tempfile.NamedTemporaryFile(mode="wb", dir=cache_dir, delete=False) as tmp_file: | |
| tmp_file.write(resp.read()) | |
| tmp_path = tmp_file.name | |
| try: | |
| os.rename(tmp_path, cached_path) | |
| except OSError: | |
| # Another process might have already written the file. | |
| os.remove(tmp_path) | |
| path = cached_path |
…URLs Signed-off-by: Andreas Karatzas <akaratza@amd.com>
DarkLight1337
left a comment
There was a problem hiding this comment.
I think we should avoid modifying the main vLLM code with media cache. We already have fixtures such as image_urls which pre-download media files
|
@DarkLight1337 You're right that In other words, this feature can only be useful if you define the env var before tests. |
|
I see your point. But it should be done as a separate RFC / feature request |
@DarkLight1337 I can make it ROCm specific, I just thought that upstream would also benefit from this, essentially this is a completely optional data path, that is only set if you set that env car, otherwise the execution is as is right now. And it really helps if there are network issues on a machine, because everything is under a specific cache path that can be stored in NFS. If there is any other recommendation towards that path let me know. I'm certainly open to refactoring this. |
|
I think this requires a broader discussion so please open a RFC for media download cache specifically, and I'll tag relevant people. |
Certainly :) #37075 |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
MODELSCOPE_CACHE,VLLM_TEST_CACHE,VLLM_CACHE_ROOT, andVLLM_MEDIA_CACHE.VLLM_TEST_CACHEinstead of/tmpor scattered locations.test_extraction.pyto useHF_HOMEenv var.snapshot_downloadbypass intest_token_in_token_out.pyby removing explicitcache_dirthat skipped HF cache.Changes
.buildkite/scripts/hardware_ci/run-amd-test.shtests/conftest.pyVLLM_TEST_CACHEtests/evals/gsm8k/gsm8k_eval.pyVLLM_TEST_CACHE/gsm8k/tests/evals/gpt_oss/test_gpqa_correctness.pyVLLM_TEST_CACHE/tiktoken/tests/plugins/.../prithvi_processor.pyVLLM_TEST_CACHE/prithvi/tests/entrypoints/openai/test_token_in_token_out.py/tmptests/v1/kv_connector/.../test_extraction.pyHF_HOMEenv var instead of hardcoded pathNote: The
VLLM_MEDIA_CACHEfeature code (env registration +MediaConnectorcaching) is in a separate PR: #37123cc @kenroche