Skip to content

Comments

Make dask.dataframe optional#1439

Merged
rapids-bot[bot] merged 5 commits intorapidsai:branch-25.04from
rjzamora:relax-dataframe-dependency
Feb 4, 2025
Merged

Make dask.dataframe optional#1439
rapids-bot[bot] merged 5 commits intorapidsai:branch-25.04from
rjzamora:relax-dataframe-dependency

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Feb 4, 2025

Dask-CUDA currently requires that dask.dataframe be imported in a few places. We only do this to patch in explicit-comms shuffling and to register various dispatch functions. There is no fundamental reason that we need dask.dataframe to be installed if the user is not actually using dask.dataframe/dask_cudf in their workflow.

This PR essentially adds exception handling for "automatic" dask.dataframe imports (when dask_cuda is imported).

@rjzamora rjzamora added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 4, 2025
@rjzamora rjzamora self-assigned this Feb 4, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 4, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the python python code needed label Feb 4, 2025
@TomAugspurger
Copy link
Contributor

How should we test this? I think we can either define a new test environment with the minimal dependencies, and update all the test imports to be importorskips, or we add a test that configures the imports so that dask.dataframe can't be imported. Something like this accomplishes that:

diff --git a/dask_cuda/tests/test_local_cuda_cluster.py b/dask_cuda/tests/test_local_cuda_cluster.py
index b144d11..ac0d10e 100644
--- a/dask_cuda/tests/test_local_cuda_cluster.py
+++ b/dask_cuda/tests/test_local_cuda_cluster.py
@@ -580,3 +580,12 @@ def test_death_timeout_raises():
             dashboard_address=":0",
         ):
             pass
+
+
+def test_without_dask_dataframe(monkeypatch):
+    for k in list(sys.modules):
+        if k.startswith("dask.dataframe"):
+            monkeypatch.setitem(sys.modules, k, None)
+    monkeypatch.delitem(sys.modules, "dask_cuda")
+
+    import dask_cuda  # noqa: F401

By sticking None in there, Python will raise a ModuleNotFoundError. That test fails on main, but passes on this branch.

The downside of this test is that a change in the implementation could break it. I've tried to make it a bit more robust by overriding every package under dask.dataframe, but it's a risk.

@pentschev
Copy link
Member

we add a test that configures the imports so that dask.dataframe can't be imported

Probably this makes more sense given the cost of launching a separate environment only to test this behavior. Historically I think we ignored testing such rather complex cases because it's hard to ensure any changes are also reverted appropriately, to overcome that I think writing the test as you proposed makes sense but I'd prefer if we launch the test in a new process as we do we some other tests, this way it becomes less probable that we leak some of those global changes to tests that follow.

@TomAugspurger
Copy link
Contributor

I like the subprocess suggestion in this case.

assert not p.exitcode


def _test_dask_cuda_import(monkeypatch):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger @pentschev - I used a combination of your suggestions (thanks!)

Let me know what you think.

Copy link
Contributor

@TomAugspurger TomAugspurger Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good. The only behavior I'm a little worried about is what the behavior of a pickled monkeypatch is since we're sending it through the mp.Process. It's probably fine, but the docs suggest that you can use pytest.MonkeyPatch.context() if the fixture isn't available, so maybe I'd suggest that:

with pytest.MonkeyPatch.context() as monekypatch:
    ...

Only question is whether we'd want to test any functionality (like creating a LocalCUDACluster) beyond the imports.

@rjzamora rjzamora marked this pull request as ready for review February 4, 2025 14:48
@rjzamora rjzamora requested a review from a team as a code owner February 4, 2025 14:48
@TomAugspurger
Copy link
Contributor

Looks good to me. I've confirmed locally that the test fails if the imports aren't inside try/except blocks.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @rjzamora !

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Feb 4, 2025
@rjzamora
Copy link
Member Author

rjzamora commented Feb 4, 2025

/merge

@rapids-bot rapids-bot bot merged commit 4b86af2 into rapidsai:branch-25.04 Feb 4, 2025
32 checks passed
@rjzamora rjzamora deleted the relax-dataframe-dependency branch February 4, 2025 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python python code needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants