Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return TableProviderFilterPushDown::Exact when Parquet Pushdown Enabled #12135

Merged
merged 24 commits into from
Sep 17, 2024

Conversation

itsjunetime
Copy link
Contributor

Which issue does this PR close?

I think this should close #4028

Rationale for this change

As far as I can tell, this follows the recommendations stated in #4028. The only part that I'm confused about is specifically the last requirement for this API: "The predicate is fully pushed down by ParquetExec (not all predicates are supported)".

To fulfill this requirement, this PR pulls in code that used to reside in FilterCandidateBuilder that recurses through an expr and checks if any column:

  • Is not contained inside the provided schemas (as it must then be project), OR
  • Contains a nested datatype, such as a struct or a list.

If any columns in the given expression fulfill either of these two requirements, we assume that it can't be pushed down, and return Inexact for this API (as well as if we're not using parquet or if pushdown is not enabled, of course).

I'm worried about this a little bit because I was under the impression that there were more complex requirements which defined whether or not an expression could be pushed down (specifically with regard to more complex expressions - e.g. AND/OR exprs have more complex rules that single-column exprs don't have to worry about) but those rules don't seem to be reflected in the code that I could find, so I might be being a bit paranoid. I'm not exactly certain and would appreciate some feedback on that specifically :)

What changes are included in this PR?

Code is mostly just moved around to facilitate the changes described above.

Are these changes tested?

Yes - I've added a few tests to make sure all the changes I'm aware of are now tested.

Are there any user-facing changes?

No API changes, but I think the affected ListingTable API is public-facing, so this will be a slight behavioral change.

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate common Related to common crate proto Related to proto crate labels Aug 23, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @itsjunetime -- this is very nice

I left some ideas for how to potentially improve this PR

Note for other reviewers: I found https://github.com/apache/datafusion/pull/12135/files?w=1 to be easier to understand

Testing

In order to merge this PR I think we need an end to end test that shows:

  1. An explain query against a parquet file (`EXPLAIN SELECT ... FROM foo.parquet WHERE col = '5') showing that there is indeed a FilterExec above the parquet
  2. Show that when pushdown filters is enabled (e.g. `set datafusion.execution.parquet.pushdown_filters=true) there is no more FilterExec

I would suggest adding to to this slt file:
https://github.com/apache/datafusion/blob/e4be013064943786c9915bbc79c18ee82106340a/datafusion/sqllogictest/test_files/parquet.slt

Here are the instructions for sqllogictest: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

It would also be really great if we could make a PR that added coverage as described in #12115, but I don't think that is strictly required

Avoid #cfg(..) is possib,e

I also think we should consider avodiing the #[cfg(...)] checks by extending the FileFormat trait (I commented inline), but we could also do that as a follow on PR

Suggestion: Split into 2 PRs

If you pull the cleanups (e.g. improving use of flags, macro->function, etc) into their own PR, we could get them reviewed and merged faster. I realize it takes more effort on the author's (your) part to create multiple PRs, but it optimizes for what is often more limited (reviewer bandwidth)

Discussion about expression pushdown

I'm worried about this a little bit because I was under the impression that there were more complex requirements which defined whether or not an expression could be pushed down (specifically with regard to more complex expressions - e.g. AND/OR exprs have more complex rules that single-column exprs don't have to worry about)

DataFusion already handles the rules for when can a predicate be pushed down between various types of plan nodes (e.g. with grouping, joins, multiple tables, etc) in the PushDownFilter pass

By the time it gets to a TableProvider the predicates have already been broken up

I double checked, and ArrowPredicate's take arbitrary expressions, internally they can evaluate complex expressions such as AND/OR

cc @XiangpengHao, @Dandandan, @thinkharderdev and @tustvold who I think may interested in this feature as well

@itsjunetime
Copy link
Contributor Author

Drive-by fixes have been extracted into their own PR, #12240. Still working on the other recommendations.

@itsjunetime
Copy link
Contributor Author

Rebased to make it mergeable

@github-actions github-actions bot removed the common Related to common crate label Sep 3, 2024
@itsjunetime
Copy link
Contributor Author

Another rebasing to clean out the drive-by cleanups after merging of the other PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @itsjunetime -- this PR looks really nice, both in structure and coments.

I think it is important to write some end to end tests (e.g. slt tests) that show explain plans with these filters enabled / disabled correctly -- and I did so in #12362

It turns out that when I ran the tests from this PR on the tests in #12362 the tests don't change. I am taking a look to see if I can figure out why that is.

alamb
alamb previously approved these changes Sep 6, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that when I ran the tests from this PR on the tests in #12362 the tests don't change. I am taking a look to see if I can figure out why that is.

I looked into this some more and I found it out is because the settings (enable_pushdown) is inherited when the table is created -- so just setting it in the session isn't enough -- it needs to be set when the table is created

I will adjust my tests accordingly. Thus I think this PR is ready to go from my perspective (we can add the tests as a follow on)

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 6, 2024
@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

Thanks again @itsjunetime -- very nice work.

I took the liberty of merging up from main to run all the latests tests and ensure CI is passing. Once the CI is passing I plan to merge this PR.

Thanks again 🙏

@alamb alamb self-requested a review September 6, 2024 20:06
@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

🤔 sorry to flip flop on this, but I wrote some more tests in #12362 and when I run them on this branch I see an internal error

To reproduce:

  1. Check out this branch
  2. cherry-pick the new tests from Minor: Add tests for using FilterExec when parquet was pushed down #12362 git cherry-pick bbe30e0
  3. Run the test and update output: cargo test --test sqllogictests -- parquet_filter_pushdown --complete
  4. Observe the diff: git diff

This results in a query with in an internal error like this (when the file didn't have a column)

 # When filter pushdown *is* enabled, ParquetExec can filter exactly,
 # not just metadata, so we expect to see no FilterExec
 # once https://github.com/apache/datafusion/issues/4028 is fixed
-query T
+query error DataFusion error: Arrow error: External error: External: Compute error: Error evaluating filter predicate: Internal\("PhysicalExpr Column references column 'b' at index 0 \(zer
o\-based\) but input schema only has 0 columns: \[\]"\)
 select a from t_pushdown where b > 2 ORDER BY a;

@alamb alamb dismissed their stale review September 6, 2024 20:39

Found some errors with tests

@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

To reproduce:

Sorry that I forgot to post this, but the tests have been since merged to main, so you should be able to pick them up by merging this branch up from main, if you haven't already done so

@alamb
Copy link
Contributor

alamb commented Sep 11, 2024

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it clearer which PRs are waiting for feedback and which are not.

Please mark it as ready for review when it is ready for another look

@alamb alamb marked this pull request as draft September 11, 2024 15:05
@itsjunetime
Copy link
Contributor Author

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it clearer which PRs are waiting for feedback and which are not.

Please mark it as ready for review when it is ready for another look

Sounds good, thanks - I'm checking out the failing tests now and will also handle rebasing once I'm done

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just writing up some notes I have from a discussion I just had with @itsjunetime (mostly so I don't forget this)

We traced the issue to the fact that after returning TableProviderPushdown::Exact there is now a difference between the columns used by the rest of the query (columns above the filter) and the columns needed for evaluating the filters

The SchamaAdapter is now removing columns necessary for evaluating the filter from the columns passed to the filter which is causing the error

The code that is removing the the columns is the map_partial_batch added in @jeffreyssmith2nd #10716 (that also applies any type casting necessary)

As I understand it, @itsjunetime is going to look into how to modify / replace map_partial_batch so that it does not remove columns even when they aren't necessary for the rest of the query

I also plan to improve some of the documentation based on our debugging today

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Sep 13, 2024
@itsjunetime
Copy link
Contributor Author

Alright, I've pushed changes to pipe through the information about the schema to the necessary places (specifically, the SchemaMapper) so that the tests added by @alamb don't error anymore. I also added a lot of documentation in the latest commit to explain what I now understand and how I came to understand it.

@itsjunetime itsjunetime marked this pull request as ready for review September 13, 2024 21:18
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @itsjunetime -- this now looks really nice to me. Thank you for sticking with this

I have a few minor comment suggestions but I think we could do that as a follow on PR as well

I noticed there is a CI failure https://github.com/apache/datafusion/actions/runs/10856219950/job/30130449796?pr=12135

This is likely due to using some feature that is not available in Rust 1.77 (the current MSRV). You can probably fix it by using some less nice syntax. However, it might also just start passing once @comphead increases the minimum supported rust version to 1.78 in #12398

cc @Dandandan and @thinkharderdev as I think you are also users of this filter pushdown code. This should make the plans more efficient by avoiding redundant FilterExecs (and potentially also producing fewer columns)

04)------FilterExec: b@1 > 2, projection=[a@0]
05)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2
06)----------ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]}, projection=[a, b], predicate=b@1 > 2, pruning_predicate=CASE WHEN b_null_count@1 = b_row_count@2 THEN false ELSE b_max@0 > 2 END, required_guarantees=[]
03)----ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]}, projection=[a], predicate=b@1 > 2, pruning_predicate=CASE WHEN b_null_count@1 = b_row_count@2 THEN false ELSE b_max@0 > 2 END, required_guarantees=[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so-beautiful

datafusion/core/src/datasource/schema_adapter.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/schema_adapter.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/schema_adapter.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/schema_adapter.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Sep 16, 2024

I plan to merge this tomorrow to give people time to comment / respond if desired

Thanks again

url.object_store()
} else {
let Some(object_store_url) =
self.table_paths.first().map(ListingTableUrl::object_store)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Sep 17, 2024

🚀

@alamb alamb merged commit be42f3d into apache:main Sep 17, 2024
24 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return TableProviderFilterPushDown::Exact when Parquet Pushdown Enabled
3 participants