-
Notifications
You must be signed in to change notification settings - Fork 72
Switch to Arrow DataFusion SQL parser #788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…he type also used by Datafusion Statements which are logical plans
* Add basic predicate-pushdown optimization (#433) * basic predicate-pushdown support * remove explict Dispatch class * use _Frame.fillna * cleanup comments * test coverage * improve test coverage * add xfail test for dt accessor in predicate and fix test_show.py * fix some naming issues * add config and use assert_eq * add logging events when predicate-pushdown bails * move bail logic earlier in function * address easier code review comments * typo fix * fix creation_info access bug * convert any expression to DNF * csv test coverage * include IN coverage * improve test rigor * address code review * skip parquet tests when deps are not installed * fix bug * add pyarrow dep to cluster workers * roll back test skipping changes Co-authored-by: Charles Blackmon-Luca <[email protected]> * Add workflow to keep datafusion dev branch up to date (#440) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Updates to dates and parsing dates like postgresql does * Update gpuCI `RAPIDS_VER` to `22.06` (#434) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Bump black to 22.3.0 (#443) * Check for ucx-py nightlies when updating gpuCI (#441) * Simplify gpuCI updating workflow * Add check for cuML nightly version * Refactored to adjust for better type management * Refactor schema and statements * update types * fix syntax issues and renamed function name calls * Add handling for newer `prompt_toolkit` versions in cmd tests (#447) * Add handling for newer prompt-toolkit version * Place compatibility code in _compat * Fix version for gha-find-replace (#446) * Improved error handling and code clean up * move pieces of logical.rs to seperated files to ensure code readability * left join working * Update versions of Java dependencies (#445) * Update versions for java dependencies with cves * Rerun tests * Update jackson databind version (#449) * Update versions for java dependencies with cves * Rerun tests * update jackson-databind dependency * Disable SQL server functionality (#448) * Disable SQL server functionality * Update docs/source/server.rst Co-authored-by: Ayush Dattagupta <[email protected]> * Disable server at lowest possible level * Skip all server tests * Add tests to ensure server is disabled * Fix CVE fix test Co-authored-by: Ayush Dattagupta <[email protected]> * Update dask pinnings for release (#450) * Add Java source code to source distribution (#451) * Bump `httpclient` dependency (#453) * Revert "Disable SQL server functionality (#448)" This reverts commit 37a3a61. * Bump httpclient version * Unpin Dask/distributed versions (#452) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * Add jsonschema to ci testing (#454) * Add jsonschema to ci env * Fix typo in config schema * Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365) * Start moving tests to dd.assert_eq * Use assert_eq in datetime filter test * Resolve most resulting test failures * Resolve remaining test failures * Convert over tests * Convert more tests * Consolidate select limit cpu/gpu test * Remove remaining assert_series_equal * Remove explicit cudf imports from many tests * Resolve rex test failures * Remove some additional compute calls * Consolidate sorting tests with getfixturevalue * Fix failed join test * Remove breakpoint * Use custom assert_eq function for tests * Resolve test failures / seg faults * Remove unnecessary testing utils * Resolve local test failures * Generalize RAND test * Avoid closing client if using independent cluster * Fix failures on Windows * Resolve black failures * Make random test variables more clear * First basic working checkpoint for group by * Set max pin on antlr4-python-runtime (#456) * Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql * update comment on antlr max pin version * Updates to style * stage pre-commit changes for upstream merge * Fix black failures * Updates to Rust formatting * Fix rust lint and clippy * Remove jar building step which is no longer needed * Remove Java from github workflows matrix * Removes jar and Java references from test.yml * Update Release workflow to remove references to Java * Update rust.yml to remove references from linux-build-lib * Add pre-commit.sh file to provide pre-commit support for Rust in a convenient script * Removed overlooked jdk references * cargo clippy auto fixes * Address all Rust clippy warnings * Include setuptools-rust in conda build recipie * Include setuptools-rust in conda build recipie, in host and run * Adjustments for conda build, committing for others to help with error and see it occurring in CI * Include sql.yaml in package files * Include pyarrow in run section of conda build to ensure tests pass * include setuptools-rust in host and run of conda since removing caused errors * to_string() method had been removed in rust and not removed here, caused conda run_test.py to fail when this line was hit * Replace commented out tests with pytest.skip and bump version of pyarrow to 7.0.0 * Fix setup.py syntax issue introduced on last commit by find/replace * Rename Datafusion -> DataFusion and Apache DataFusion -> Arrow DataFusion * Fix docs build environment * Include Rust compiler in docs environment * Bump Rust compiler version to 1.59 * Ok, well readthedocs didn't like that * Store libdask_planner.so and retrieve it between github workflows * Cache the Rust library binary * Remove Cargo.lock from git * Remove unused datafusion-expr crate * Build datafusion at each test step instead of caching binaries * Remove maven and jar cache steps from test-upstream.yaml * Removed dangling 'build' workflow step reference * Lowered PyArrow version to 6.0.1 since cudf has a hard requirement on that version for the version of cudf we are using * Add Rust build step to test in dask cluster * Install setuptools-rust for pip to use for bare requirements import * Include pyarrow 6.0.1 via conda as a bare minimum dependency * Remove cudf dependency for python 3.9 which is causing build issues on windows * Address documentation from review * Install Rust as readthedocs post_create_environment step * Run rust install non-interactively * Run rust install non-interactively * Rust isn't available in PyPi so remove that dependency * Append ~/.cargo/bin to the PATH * Print out some environment information for debugging * Print out some environment information for debugging * More - Increase verbosity * More - Increase verbosity * More - Increase verbosity * Switch RTD over to use Conda instead of Pip since having issues with Rust and pip * Try to use mamba for building docs environment * Partial review suggestion address, checking CI still works * Skip mistakenly enabled tests * Use DataFusion master branch, and fix syntax issues related to the version bump * More updates after bumping DataFusion version to master * Use actions-rs in github workflows debug flag for setup.py * Remove setuptools-rust from conda * Use re-exported Rust types for BuiltinScalarFunction * Move python imports to TYPE_CHECKING section where applicable * Address review concerns and remove pre-commit.sh file * Pin to a specific github rev for DataFusion Co-authored-by: Richard (Rick) Zamora <[email protected]> Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <[email protected]>
* bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider
* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]>
* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Mark just the GPU tests as skipped Co-authored-by: Charles Blackmon-Luca <[email protected]>
* Minor code cleanup in row_type() * remove unwrap
* helper code for getting column name from expression * Update dask_planner/src/expression.rs Co-authored-by: Jeremy Dyer <[email protected]> * Update dask_planner/src/expression.rs Co-authored-by: Jeremy Dyer <[email protected]> * fix build * Improve error handling Co-authored-by: Jeremy Dyer <[email protected]>
* Update exceptions that are thrown * Remove Java error regex formatting logic. Rust messages will be presented already formatted from Rust itself * Removed lingering test that was still trying to test out Java specific error messages * Update dask_planner/src/sql.rs Co-authored-by: Andy Grove <[email protected]> * clean up logical_relational_algebra function Co-authored-by: Andy Grove <[email protected]>
sync: main to datafusion-sql-planner
sync: main to datafusion-sql-planner
* Switch back to architectured builds * Make sure to upload correct files * Update python versions in build config * Modify triggering conditions for builds * Add Rust-specific files to paths
…` statements (#747) * Generalize CREATE and PREDICT MODEL to accept non-native SELECT statements * Refactor parse_create_model * Restrict nested input queries to SELECT statements * Add tests * Use expect_one_of_keywords to expand supported input queries * Skip test on independent cluster
* use correct schema in table provider * add check for catalog * Update dask_planner/src/sql.rs Co-authored-by: GALI PREM SAGAR <[email protected]>
* initial updates * Update installation.rst * Update installation.rst Co-authored-by: GALI PREM SAGAR <[email protected]>
* use correct schema in table provider * add check for catalog * Add support for switching schemas
sync: main to datafusion-sql-planner
* jupyter lab fix * disable_highlighting * multiple things * needs_local_scope * Update ipython.py * Resolve style errors Co-authored-by: Charles Blackmon-Luca <[email protected]>
[DF] Remove PyPI release workflow
sync: main to datafusion-sql-planner
* Support complex queries with multiple distinct aggregates * format * fixes * remove debug logging * add another test * support GROUP BY with COUNT and COUNT DISTINCT * bug fix and a hack * save experiment with stripping qualifiers and removing hash from names * save progess * clippy and fmt * use trace logging and update test to use alias * remove strip_qualifier code * bug fix * skip failing tests * lint * fix merge issue * revert changes in Cargo.lock * refactor to reduce duplicate code * Update _distinct_agg_expr to support AggregateUDF Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: Charles Blackmon-Luca <[email protected]>
* save progress * partial fix * Add accessor functions for time64 and timestamp values * Add handling for time,timestamp literals * Re-add log env_log removed during merge * Remove debug/commented code * Un-xfail query 32 * Implement support for scalar Decimal128 Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: Charles Blackmon-Luca <[email protected]>
* Uncomment test_group_by_filtered * Enable filtered aggs, implement get_filter_expr * Compute filter columns in _collect_aggregations * Resolve test failures * Add back in aggregate function assertion Co-authored-by: GALI PREM SAGAR <[email protected]>
#782) * Revert to earlier version of rule and skip failing tests * add Rust workflow * ci * ci
* Revert to earlier version of rule and skip failing tests * add Rust workflow * ci * ci * fix remaining regressions in optimizer rule * fmt * unignore tests
* regr_count * add regr_syy and regr_sxx * regr_syy and regr_sxx functionality * remove covars * Update test_groupby.py * split test_stats_aggregation into regr and covar * format fix * add gpu param * Update tests/integration/test_groupby.py Co-authored-by: Ayush Dattagupta <[email protected]> * format fix Co-authored-by: Ayush Dattagupta <[email protected]>
* Revert to earlier version of rule and skip failing tests * add Rust workflow * ci * ci * fix remaining regressions in optimizer rule * fmt * save progress * unignore tests * fix regression from agg with filter changes * save progress * code cleanup * logging * lint Co-authored-by: GALI PREM SAGAR <[email protected]>
ayushdg
approved these changes
Sep 21, 2022
galipremsagar
approved these changes
Sep 21, 2022
Codecov Report
@@ Coverage Diff @@
## main #788 +/- ##
===========================================
- Coverage 88.45% 75.52% -12.94%
===========================================
Files 69 73 +4
Lines 3517 3689 +172
Branches 711 769 +58
===========================================
- Hits 3111 2786 -325
- Misses 318 769 +451
- Partials 88 134 +46
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR reflects the culmination of work in
datafusion-sql-plannerto switch from our Java-based Calcite SQL parser to Arrow DataFusion's Rust-based SQL parser.