Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter check for query planning. #107

Merged
merged 5 commits into from
Jul 1, 2024

Conversation

ayushdg
Copy link
Collaborator

@ayushdg ayushdg commented Jun 12, 2024

Description

Importing newer versions of dask (tested with 2024.5) set dataframe.query-planning to None instead of True or False, but the None case defaults to using query planning. This PR extends the check to None to raise relevant errors w/ newer versions of dask.

Usage

N/A

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@ayushdg
Copy link
Collaborator Author

ayushdg commented Jun 12, 2024

cc: @rjzamora If you could take a look

Copy link
Contributor

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal here to raise an error in any case that the user does not explicitly opt out of query-planning? This means all curator users will see this error unless they explicitly set ther config to false ahead of time.

Comment on lines 31 to 32
else:
dask.config.set({"dataframe.query-planning": False})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like there is any point of this else statement, because the config would already need to be False.

@ayushdg
Copy link
Collaborator Author

ayushdg commented Jun 12, 2024

Is the goal here to raise an error in any case that the user does not explicitly opt out of query-planning? This means all curator users will see this error unless they explicitly set ther config to false ahead of time.

The intended goal is to raise this error whenever Query-Planning is enabled and dask is going down that route. In newer versions of dask, I've noticed that this value is being set to None on importing dask.dataframe, but still going down the query planning route.
I now realize that it could be None in the case where dask.dataframe is not imported at all. I'll think about this a bit more, but curious if you see better ways of checking whether query planning is truly enabled.

@rjzamora
Copy link
Contributor

I now realize that it could be None in the case where dask.dataframe is not imported at all. I'll think about this a bit more, but curious if you see better ways of checking whether query planning is truly enabled.

Right - I believe you have stumbled upon the primary reason we rapids-24.04 was pinned to such an "old" version of dask: There is currently no reliable way (that I know of) to check if the user has already imported dask.dataframe before dask_cudf is imported :/

That said, it may be possible for us to add a canonical "switch" in rapids-dask-dependency.

@@ -16,11 +16,17 @@

# Disable query planning if possible
# https://github.com/NVIDIA/NeMo-Curator/issues/73
if dask.config.get("dataframe.query-planning") is True:
QUERY_PLANNING = dask.config.get("dataframe.query-planning")
if QUERY_PLANNING is True or QUERY_PLANNING is None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more about this, it might make sense to replace the None check for the config with something like "dask_expr" in sys.modules.
We can then followup with a better approach based on the resolution of upstream dask discussions.

nemo_curator/__init__.py Outdated Show resolved Hide resolved
ayushdg and others added 2 commits June 28, 2024 15:38
Co-authored-by: Richard (Rick) Zamora <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
@ayushdg ayushdg requested a review from rjzamora July 1, 2024 18:53
@ayushdg ayushdg requested review from VibhuJawa and rjzamora and removed request for rjzamora July 1, 2024 19:08
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ayushdg ayushdg merged commit 0ea5efe into NVIDIA:main Jul 1, 2024
3 checks passed
sarahyurick pushed a commit to sarahyurick/NeMo-Curator that referenced this pull request Jul 23, 2024
* Stricter query planning checks with newer versions of dask

Signed-off-by: Ayush Dattagupta <[email protected]>

* Add checks to tests/__init__

Signed-off-by: Ayush Dattagupta <[email protected]>

* Check sys.modules to ensure dask-expr is not enabled

Signed-off-by: Ayush Dattagupta <[email protected]>

* Search for "dask_expr" in sys modules

Co-authored-by: Richard (Rick) Zamora <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>

* use dask_expr instead of dask-expr

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Co-authored-by: Richard (Rick) Zamora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants