Use custom plugin to download test data early#7169
Merged
rapids-bot[bot] merged 3 commits intorapidsai:branch-25.10from Sep 3, 2025
Merged
Use custom plugin to download test data early#7169rapids-bot[bot] merged 3 commits intorapidsai:branch-25.10from
rapids-bot[bot] merged 3 commits intorapidsai:branch-25.10from
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Downloading data when using pytest-xdist can lead to races where different workers see different sized datasets. We work around this problem by defining a custom plugin that runs before workers start and downloads all datasets.
test_naive_bayes.py failures
csadorf
reviewed
Sep 3, 2025
csadorf
approved these changes
Sep 3, 2025
Contributor
There was a problem hiding this comment.
Approving since it will resolve the immediate problem. We should keep an eye out for a more long-term solution, ideally one that fixes the root-cause within the sklearn code base.
We should reference scikit-learn/scikit-learn#32095 within the code at appropriate spots.
jameslamb
approved these changes
Sep 3, 2025
KyleFromNVIDIA
approved these changes
Sep 3, 2025
Contributor
|
/merge |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Running this in CI as I can't reproduce the failure locallyThe problem we were seeing in the naive bayes tests was that some test functions saw only a subset of the dataset. As a result there was not at least one sample from every class in the dataset. The reason this happened is some kind of race in the downloading and processing of the data. This happens because we use more than one worker for
pytest-xdist.We work around this problem by defining a custom plugin that runs before workers start and downloads all datasets.
The downside of this approach is that we need to manually list the datasets that get "pre-downloaded". I think that is Ok because we don't add new datasets frequently. But this could be improved.
An upside is that we only download each dataset once, not once per worker as we were doing so far.
More discussion and details in scikit-learn/scikit-learn#32095 - maybe it is possible to fix this at the level of scikit-learn. Which would be great as it would mean we can remove this plugin again.
xref #7152