Skip to content

Comments

Skip max_device_cache_size setter when BAR1 memory isn't present on the GPUs in the system#814

Merged
rapids-bot[bot] merged 4 commits intorapidsai:branch-25.10from
ahoyle-nvidia:ah_skip_test_dgxspark
Sep 9, 2025
Merged

Skip max_device_cache_size setter when BAR1 memory isn't present on the GPUs in the system#814
rapids-bot[bot] merged 4 commits intorapidsai:branch-25.10from
ahoyle-nvidia:ah_skip_test_dgxspark

Conversation

@ahoyle-nvidia
Copy link
Contributor

We've seen multiple issues over the months from DGX Spark users when it comes to this specific file. This PR address these issues by applying a skip for the max_device_cache_size (cuFileDriverSetMaxCacheSize) setter by examining the output of nvidia-smi.

@ahoyle-nvidia ahoyle-nvidia requested a review from a team as a code owner September 8, 2025 08:26
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ahoyle-nvidia ahoyle-nvidia marked this pull request as draft September 8, 2025 08:29
@kingcrimsontianyu kingcrimsontianyu added non-breaking Introduces a non-breaking change bug Something isn't working python Affects the Python API of KvikIO labels Sep 8, 2025
@kingcrimsontianyu
Copy link
Contributor

/ok to test f0e108a

@kingcrimsontianyu
Copy link
Contributor

Code needs to go through the linter. I'll fix this.

@kingcrimsontianyu
Copy link
Contributor

/ok to test 91e1913

@ahoyle-nvidia ahoyle-nvidia marked this pull request as ready for review September 8, 2025 16:09
@kingcrimsontianyu
Copy link
Contributor

/ok to test ad60204

@kingcrimsontianyu
Copy link
Contributor

Further note: In cuFile shipped with CUDA 13, the call to cuFileDriverSetMaxCacheSize on platforms without proper BAR memory (such as DGX Spark) will fail by design. Coincidentally, the previous PR #754 which addresses a cuFile absence issue on ARM with CUDA < 12.2 happens to help avoid the test failure by capturing the exceptions from the shim. This PR proactively checks the BAR memory and makes things more explicit.

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@kingcrimsontianyu
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit f4e022e into rapidsai:branch-25.10 Sep 9, 2025
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change python Affects the Python API of KvikIO

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants