Skip to content

[Bugfix] Fall back to buffered I/O if O_DIRECT fails in secondary tier fs#44016

Closed
micah-wil wants to merge 2 commits into
vllm-project:mainfrom
ROCm:micah/kv-offload-odirect
Closed

[Bugfix] Fall back to buffered I/O if O_DIRECT fails in secondary tier fs#44016
micah-wil wants to merge 2 commits into
vllm-project:mainfrom
ROCm:micah/kv-offload-odirect

Conversation

@micah-wil
Copy link
Copy Markdown
Contributor

@micah-wil micah-wil commented May 29, 2026

This issue exposed in the V1 Core + KV + Metrics test group in AMD CI:

pytest -v -s tests/v1/kv_offload/test_fs_tier.py
============================================================= short test summary info ==============================================================
FAILED tests/v1/kv_offload/test_fs_tier.py::test_store_creates_file_and_lookup_succeeds - assert False
FAILED tests/v1/kv_offload/test_fs_tier.py::test_multiple_jobs_tracked_independently - AssertionError: assert False is True
FAILED tests/v1/kv_offload/test_fs_tier.py::test_multi_block_job_partial_failure - assert False
FAILED tests/v1/kv_offload/test_fs_tier.py::test_store_load_data_integrity - assert False
===================================================== 4 failed, 5 passed, 17 warnings in 2.93s =====================================================

(e.g. build 8978)

The test_fs_tier.py test was added recently in #41735, which implements file system secondary tier for multi-tier offload. The implementation relies on O_DIRECT in the store and load methods, but doesn't allow for any sort of fallback if it fails. This PR implements a fallback because O_DIRECT access fails in AMD CI. The os.open and os.write calls return EINVAL with O_DIRECT enabled. This isn't inherently a ROCm specific issue, but it is sensitive to things like linux kernel version https://stackoverflow.com/questions/41257656/what-does-o-direct-really-mean.

With the fallback in place, I am seeing the following when running the test:

========================================================== 9 passed, 17 warnings in 3.22s ==========================================================

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@mergify mergify Bot added the v1 label May 29, 2026
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@micah-wil micah-wil changed the title Fall back to buffered I/O if O_DIRECT fails in kv_transfer [Bugfix] Fall back to buffered I/O if O_DIRECT fails in secondary tier fs May 29, 2026
@mergify mergify Bot added the bug Something isn't working label May 29, 2026
@micah-wil micah-wil marked this pull request as ready for review May 29, 2026 21:31
@micah-wil micah-wil requested review from ApostaC and orozery as code owners May 29, 2026 21:31
@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label May 29, 2026
@AndreasKaratzas
Copy link
Copy Markdown
Member

cc @orozery @ApostaC LGTM but you guys are the owners so I don't want to stamp it.

Copy link
Copy Markdown
Collaborator

@orozery orozery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndreasKaratzas I think this is a dup of #43689.
Can you verify it solves the issue on the AMD CI?
See the discussion there: I don't want to use a fallback, but instead make sure the buffers are aligned.

@micah-wil
Copy link
Copy Markdown
Contributor Author

Hey @orozery thanks for pointing that out, I didn't see that PR. Looks like pytest -s tests/v1/kv_offload/test_fs_tier.py is passing with #43689, so I think we can close this one if you don't want to add a fallback.

@micah-wil micah-wil closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants