-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stricter check for query planning. #107
Stricter check for query planning. #107
Conversation
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
cc: @rjzamora If you could take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the goal here to raise an error in any case that the user does not explicitly opt out of query-planning? This means all curator users will see this error unless they explicitly set ther config to false ahead of time.
else: | ||
dask.config.set({"dataframe.query-planning": False}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like there is any point of this else statement, because the config would already need to be False
.
The intended goal is to raise this error whenever |
Right - I believe you have stumbled upon the primary reason we rapids-24.04 was pinned to such an "old" version of dask: There is currently no reliable way (that I know of) to check if the user has already imported That said, it may be possible for us to add a canonical "switch" in |
nemo_curator/__init__.py
Outdated
@@ -16,11 +16,17 @@ | |||
|
|||
# Disable query planning if possible | |||
# https://github.com/NVIDIA/NeMo-Curator/issues/73 | |||
if dask.config.get("dataframe.query-planning") is True: | |||
QUERY_PLANNING = dask.config.get("dataframe.query-planning") | |||
if QUERY_PLANNING is True or QUERY_PLANNING is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit more about this, it might make sense to replace the None
check for the config with something like "dask_expr" in sys.modules
.
We can then followup with a better approach based on the resolution of upstream dask discussions.
Signed-off-by: Ayush Dattagupta <[email protected]>
Co-authored-by: Richard (Rick) Zamora <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Stricter query planning checks with newer versions of dask Signed-off-by: Ayush Dattagupta <[email protected]> * Add checks to tests/__init__ Signed-off-by: Ayush Dattagupta <[email protected]> * Check sys.modules to ensure dask-expr is not enabled Signed-off-by: Ayush Dattagupta <[email protected]> * Search for "dask_expr" in sys modules Co-authored-by: Richard (Rick) Zamora <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> * use dask_expr instead of dask-expr Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Richard (Rick) Zamora <[email protected]>
Description
Importing newer versions of dask (tested with 2024.5) set
dataframe.query-planning
to None instead ofTrue
orFalse
, but theNone
case defaults to using query planning. This PR extends the check toNone
to raise relevant errors w/ newer versions of dask.Usage
N/A
Checklist