Switch to Arrow DataFusion SQL parser #788

charlesbluca · 2022-09-21T20:41:07Z

This PR reflects the culmination of work in datafusion-sql-planner to switch from our Java-based Calcite SQL parser to Arrow DataFusion's Rust-based SQL parser.

…he type also used by Datafusion Statements which are logical plans

…y compute yet

…eField

…query work now

* Add basic predicate-pushdown optimization (#433) * basic predicate-pushdown support * remove explict Dispatch class * use _Frame.fillna * cleanup comments * test coverage * improve test coverage * add xfail test for dt accessor in predicate and fix test_show.py * fix some naming issues * add config and use assert_eq * add logging events when predicate-pushdown bails * move bail logic earlier in function * address easier code review comments * typo fix * fix creation_info access bug * convert any expression to DNF * csv test coverage * include IN coverage * improve test rigor * address code review * skip parquet tests when deps are not installed * fix bug * add pyarrow dep to cluster workers * roll back test skipping changes Co-authored-by: Charles Blackmon-Luca <[email protected]> * Add workflow to keep datafusion dev branch up to date (#440) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Updates to dates and parsing dates like postgresql does * Update gpuCI `RAPIDS_VER` to `22.06` (#434) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Bump black to 22.3.0 (#443) * Check for ucx-py nightlies when updating gpuCI (#441) * Simplify gpuCI updating workflow * Add check for cuML nightly version * Refactored to adjust for better type management * Refactor schema and statements * update types * fix syntax issues and renamed function name calls * Add handling for newer `prompt_toolkit` versions in cmd tests (#447) * Add handling for newer prompt-toolkit version * Place compatibility code in _compat * Fix version for gha-find-replace (#446) * Improved error handling and code clean up * move pieces of logical.rs to seperated files to ensure code readability * left join working * Update versions of Java dependencies (#445) * Update versions for java dependencies with cves * Rerun tests * Update jackson databind version (#449) * Update versions for java dependencies with cves * Rerun tests * update jackson-databind dependency * Disable SQL server functionality (#448) * Disable SQL server functionality * Update docs/source/server.rst Co-authored-by: Ayush Dattagupta <[email protected]> * Disable server at lowest possible level * Skip all server tests * Add tests to ensure server is disabled * Fix CVE fix test Co-authored-by: Ayush Dattagupta <[email protected]> * Update dask pinnings for release (#450) * Add Java source code to source distribution (#451) * Bump `httpclient` dependency (#453) * Revert "Disable SQL server functionality (#448)" This reverts commit 37a3a61. * Bump httpclient version * Unpin Dask/distributed versions (#452) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * Add jsonschema to ci testing (#454) * Add jsonschema to ci env * Fix typo in config schema * Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365) * Start moving tests to dd.assert_eq * Use assert_eq in datetime filter test * Resolve most resulting test failures * Resolve remaining test failures * Convert over tests * Convert more tests * Consolidate select limit cpu/gpu test * Remove remaining assert_series_equal * Remove explicit cudf imports from many tests * Resolve rex test failures * Remove some additional compute calls * Consolidate sorting tests with getfixturevalue * Fix failed join test * Remove breakpoint * Use custom assert_eq function for tests * Resolve test failures / seg faults * Remove unnecessary testing utils * Resolve local test failures * Generalize RAND test * Avoid closing client if using independent cluster * Fix failures on Windows * Resolve black failures * Make random test variables more clear * First basic working checkpoint for group by * Set max pin on antlr4-python-runtime (#456) * Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql * update comment on antlr max pin version * Updates to style * stage pre-commit changes for upstream merge * Fix black failures * Updates to Rust formatting * Fix rust lint and clippy * Remove jar building step which is no longer needed * Remove Java from github workflows matrix * Removes jar and Java references from test.yml * Update Release workflow to remove references to Java * Update rust.yml to remove references from linux-build-lib * Add pre-commit.sh file to provide pre-commit support for Rust in a convenient script * Removed overlooked jdk references * cargo clippy auto fixes * Address all Rust clippy warnings * Include setuptools-rust in conda build recipie * Include setuptools-rust in conda build recipie, in host and run * Adjustments for conda build, committing for others to help with error and see it occurring in CI * Include sql.yaml in package files * Include pyarrow in run section of conda build to ensure tests pass * include setuptools-rust in host and run of conda since removing caused errors * to_string() method had been removed in rust and not removed here, caused conda run_test.py to fail when this line was hit * Replace commented out tests with pytest.skip and bump version of pyarrow to 7.0.0 * Fix setup.py syntax issue introduced on last commit by find/replace * Rename Datafusion -> DataFusion and Apache DataFusion -> Arrow DataFusion * Fix docs build environment * Include Rust compiler in docs environment * Bump Rust compiler version to 1.59 * Ok, well readthedocs didn't like that * Store libdask_planner.so and retrieve it between github workflows * Cache the Rust library binary * Remove Cargo.lock from git * Remove unused datafusion-expr crate * Build datafusion at each test step instead of caching binaries * Remove maven and jar cache steps from test-upstream.yaml * Removed dangling 'build' workflow step reference * Lowered PyArrow version to 6.0.1 since cudf has a hard requirement on that version for the version of cudf we are using * Add Rust build step to test in dask cluster * Install setuptools-rust for pip to use for bare requirements import * Include pyarrow 6.0.1 via conda as a bare minimum dependency * Remove cudf dependency for python 3.9 which is causing build issues on windows * Address documentation from review * Install Rust as readthedocs post_create_environment step * Run rust install non-interactively * Run rust install non-interactively * Rust isn't available in PyPi so remove that dependency * Append ~/.cargo/bin to the PATH * Print out some environment information for debugging * Print out some environment information for debugging * More - Increase verbosity * More - Increase verbosity * More - Increase verbosity * Switch RTD over to use Conda instead of Pip since having issues with Rust and pip * Try to use mamba for building docs environment * Partial review suggestion address, checking CI still works * Skip mistakenly enabled tests * Use DataFusion master branch, and fix syntax issues related to the version bump * More updates after bumping DataFusion version to master * Use actions-rs in github workflows debug flag for setup.py * Remove setuptools-rust from conda * Use re-exported Rust types for BuiltinScalarFunction * Move python imports to TYPE_CHECKING section where applicable * Address review concerns and remove pre-commit.sh file * Pin to a specific github rev for DataFusion Co-authored-by: Richard (Rick) Zamora <[email protected]> Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <[email protected]>

* bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Mark just the GPU tests as skipped Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Minor code cleanup in row_type() * remove unwrap

* helper code for getting column name from expression * Update dask_planner/src/expression.rs Co-authored-by: Jeremy Dyer <[email protected]> * Update dask_planner/src/expression.rs Co-authored-by: Jeremy Dyer <[email protected]> * fix build * Improve error handling Co-authored-by: Jeremy Dyer <[email protected]>

* Update exceptions that are thrown * Remove Java error regex formatting logic. Rust messages will be presented already formatted from Rust itself * Removed lingering test that was still trying to test out Java specific error messages * Update dask_planner/src/sql.rs Co-authored-by: Andy Grove <[email protected]> * clean up logical_relational_algebra function Co-authored-by: Andy Grove <[email protected]>

sync: main to datafusion-sql-planner

* Switch back to architectured builds * Make sure to upload correct files * Update python versions in build config * Modify triggering conditions for builds * Add Rust-specific files to paths

…` statements (#747) * Generalize CREATE and PREDICT MODEL to accept non-native SELECT statements * Refactor parse_create_model * Restrict nested input queries to SELECT statements * Add tests * Use expect_one_of_keywords to expand supported input queries * Skip test on independent cluster

* use correct schema in table provider * add check for catalog * Update dask_planner/src/sql.rs Co-authored-by: GALI PREM SAGAR <[email protected]>

* initial updates * Update installation.rst * Update installation.rst Co-authored-by: GALI PREM SAGAR <[email protected]>

* use correct schema in table provider * add check for catalog * Add support for switching schemas

sync: main to datafusion-sql-planner

* jupyter lab fix * disable_highlighting * multiple things * needs_local_scope * Update ipython.py * Resolve style errors Co-authored-by: Charles Blackmon-Luca <[email protected]>

[DF] Remove PyPI release workflow

sync: main to datafusion-sql-planner

* Support complex queries with multiple distinct aggregates * format * fixes * remove debug logging * add another test * support GROUP BY with COUNT and COUNT DISTINCT * bug fix and a hack * save experiment with stripping qualifiers and removing hash from names * save progess * clippy and fmt * use trace logging and update test to use alias * remove strip_qualifier code * bug fix * skip failing tests * lint * fix merge issue * revert changes in Cargo.lock * refactor to reduce duplicate code * Update _distinct_agg_expr to support AggregateUDF Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: Charles Blackmon-Luca <[email protected]>

* save progress * partial fix * Add accessor functions for time64 and timestamp values * Add handling for time,timestamp literals * Re-add log env_log removed during merge * Remove debug/commented code * Un-xfail query 32 * Implement support for scalar Decimal128 Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Uncomment test_group_by_filtered * Enable filtered aggs, implement get_filter_expr * Compute filter columns in _collect_aggregations * Resolve test failures * Add back in aggregate function assertion Co-authored-by: GALI PREM SAGAR <[email protected]>

#782) * Revert to earlier version of rule and skip failing tests * add Rust workflow * ci * ci

* Revert to earlier version of rule and skip failing tests * add Rust workflow * ci * ci * fix remaining regressions in optimizer rule * fmt * unignore tests

* regr_count * add regr_syy and regr_sxx * regr_syy and regr_sxx functionality * remove covars * Update test_groupby.py * split test_stats_aggregation into regr and covar * format fix * add gpu param * Update tests/integration/test_groupby.py Co-authored-by: Ayush Dattagupta <[email protected]> * format fix Co-authored-by: Ayush Dattagupta <[email protected]>

* Revert to earlier version of rule and skip failing tests * add Rust workflow * ci * ci * fix remaining regressions in optimizer rule * fmt * save progress * unignore tests * fix regression from agg with filter changes * save progress * code cleanup * logging * lint Co-authored-by: GALI PREM SAGAR <[email protected]>

codecov-commenter · 2022-09-21T20:55:07Z

Codecov Report

Merging #788 (033d2f4) into main (442b871) will decrease coverage by 12.93%.
The diff coverage is 76.37%.

@@             Coverage Diff             @@
##             main     #788       +/-   ##
===========================================
- Coverage   88.45%   75.52%   -12.94%     
===========================================
  Files          69       73        +4     
  Lines        3517     3689      +172     
  Branches      711      769       +58     
===========================================
- Hits         3111     2786      -325     
- Misses        318      769      +451     
- Partials       88      134       +46

Impacted Files	Coverage Δ
dask_sql/physical/rex/core/literal.py	`46.00% <44.59%> (-49.46%)`	⬇️
dask_sql/physical/rex/core/input_ref.py	`80.00% <50.00%> (ø)`
dask_sql/physical/rex/core/call.py	`81.75% <50.44%> (-16.35%)`	⬇️
dask_sql/physical/rex/core/subquery.py	`57.14% <57.14%> (ø)`
dask_sql/physical/rel/custom/alter.py	`37.03% <66.66%> (-55.56%)`	⬇️
dask_sql/physical/rel/custom/show_models.py	`53.33% <66.66%> (-26.67%)`	⬇️
dask_sql/physical/rex/base.py	`81.81% <66.66%> (+4.04%)`	⬆️
dask_sql/physical/rel/logical/explain.py	`70.00% <70.00%> (ø)`
dask_sql/physical/rel/logical/subquery_alias.py	`70.00% <70.00%> (ø)`
dask_sql/physical/rel/convert.py	`86.95% <75.00%> (-0.55%)`	⬇️
... and 58 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jdye64 and others added 30 commits March 2, 2022 15:23

First pass at datafusion parsing

12da2f1

updates

c544794

updates

7596144

updates

1b17a6a

DaskSchema implementation for Python in Rust

c5dbb96

updated mappings so that Python types map to PyArrow types which is t…

5249808

…he type also used by Datafusion Statements which are logical plans

Add ability to add columns to an existing DaskTable

b20704e

Add ability to tables to be added to the DaskSchema

dc98d7c

Completion of _get_ral() function in dask-sql. Still does not actuall…

64d2688

…y compute yet

Finished converting base class and DaskRelDataType and DaskRelDataTyp…

5cf8eb0

…eField

Can make a very simple pass of a projection on a TableScan operation …

5b4d169

…query work now

updates

02c257c

Allow for the rough registration of Schemas to the DaskSQLContext

13cc875

pytest test_context.py working/checkpoint

ca410e4

all unit tests passing/checkpoint

1eced89

checkpoint

41ffe94

Update on test_select.py

45a568b

Refactor setup.py

c66f0a3

Refactored Rust code to traverse the AST SQL parse tree

172d3cc

merge updates

55bf4c2

Bump DataFusion version (#494)

b695daa

* bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider

Minor code cleanup in row_type() (#504)

672821f

* Minor code cleanup in row_type() * remove unwrap

Bump Rust version to 1.60 from 1.59 (#508)

5932f35

add support for expr_to_field for Expr::Sort expressions (#515)

346be12

reduce crate dependencies (#516)

70d7fb8

charlesbluca and others added 23 commits September 14, 2022 10:14

Merge pull request #757 from dask-contrib/merge-upstream-main

87a6681

sync: main to datafusion-sql-planner

Upgrade pyo (#762)

7ec9812

Merge branch 'datafusion-sql-planner' into merge-upstream-main

5da63a9

Merge pull request #764 from dask-contrib/merge-upstream-main

862b901

sync: main to datafusion-sql-planner

[DF] Switch back to architectured builds (#765)

0577f8e

* Switch back to architectured builds * Make sure to upload correct files * Update python versions in build config * Modify triggering conditions for builds * Add Rust-specific files to paths

Remove python constraint (#766)

6676574

Use DataFusion 12.0.0 (#767)

312b01b

[DF] Use correct schema in TableProvider (#769)

fc1507b

* use correct schema in table provider * add check for catalog * Update dask_planner/src/sql.rs Co-authored-by: GALI PREM SAGAR <[email protected]>

Update docs (#768)

a8241b8

* initial updates * Update installation.rst * Update installation.rst Co-authored-by: GALI PREM SAGAR <[email protected]>

[DF] Add support for switching schema in DaskSqlContext (#770)

4a0fbc5

* use correct schema in table provider * add check for catalog * Add support for switching schemas

Merge pull request #775 from dask-contrib/main

e683bab

sync: main to datafusion-sql-planner

c.ipython_magic fix for Jupyter Lab (#772)

3cab31c

* jupyter lab fix * disable_highlighting * multiple things * needs_local_scope * Update ipython.py * Resolve style errors Co-authored-by: Charles Blackmon-Luca <[email protected]>

Remove PyPI release workflow

150e374

Merge pull request #776 from dask-contrib/remove-pypi-release

13db840

[DF] Remove PyPI release workflow

Merge pull request #778 from dask-contrib/main

509a091

sync: main to datafusion-sql-planner

[DF] Fix regressions in EliminateAggDistinct, run cargo test in CI (

0570b4b

#782) * Revert to earlier version of rule and skip failing tests * add Rust workflow * ci * ci

[DF] Fix remaining regressions in optimizer rule (#784)

3c000b6

* Revert to earlier version of rule and skip failing tests * add Rust workflow * ci * ci * fix remaining regressions in optimizer rule * fmt * unignore tests

charlesbluca changed the title ~~Switch to DataFusion SQL parser~~ Switch to Arrow DataFusion SQL parser Sep 21, 2022

ayushdg approved these changes Sep 21, 2022

View reviewed changes

galipremsagar approved these changes Sep 21, 2022

View reviewed changes

charlesbluca merged commit 28910e0 into main Sep 21, 2022

charlesbluca deleted the datafusion-sql-planner branch September 21, 2022 20:57

charlesbluca mentioned this pull request Oct 11, 2022

Use correct env file for dask-sql builds rapidsai/dask-build-environment#49

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to Arrow DataFusion SQL parser #788

Switch to Arrow DataFusion SQL parser #788

Uh oh!

charlesbluca commented Sep 21, 2022 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Switch to Arrow DataFusion SQL parser #788

Switch to Arrow DataFusion SQL parser #788

Uh oh!

Conversation

charlesbluca commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Sep 21, 2022

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

charlesbluca commented Sep 21, 2022 •

edited

Loading