[SPARK-44493][SQL] Translate catalyst expression into partial datasource filter #43769

wangyum · 2023-11-12T15:54:57Z

What changes were proposed in this pull request?

SPARK-22548 fixed incorrect nested AND expression pushed down to JDBC data source. But Parquet/ORC data sources do not need that fix because they always keep the Filter node.

For example

CREATE TABLE parquet_table (name string, theid int) using parquet;
EXPLAIN SELECT * FROM parquet_table WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 'fred');

-- Before SPARK-22548
== Physical Plan ==
*Filter (((THEID#21 > 0) && (trim(NAME#20) = mary)) || (NAME#20 = fred))
+- *FileScan parquet default.parquet_table[name#20,theid#21] PushedFilters: [Or(GreaterThan(theid,0),EqualTo(name,fred))]

-- After SPARK-22548(Actually we can push down [Or(GreaterThan(theid,0),EqualTo(name,fred))] to data source)
== Physical Plan ==
*Filter (((THEID#21 > 0) && (trim(NAME#20) = mary)) || (NAME#20 = fred))
+- *FileScan parquet default.parquet_table[name#20,theid#21] PushedFilters: []

This PR adds a function parameter(named canPartialPushDown) to DataSourceStrategy.translateFilter. The caller side can set it to true if caller side always keep the Filter node to push down more filters.

Why are the changes needed?

Pushdown more filters to improve query performance. This is a very common case and can significantly improve query performance. For example:

1. id IN (1677950,1468757,2059706,2053742,2047936,5387,2046791) OR (id=2351460 and very_heavy_udf(very_large_column, 'key') = 'EXPC')
2. (id IN (4279154) AND very_heavy_udf(very_large_column, 'key') IS NOT NULL) OR
(id = 3914 AND very_heavy_udf(very_large_column, 'sid') = 'p.2380')

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and benchmark test:

Before	After

Was this patch authored or co-authored using generative AI tooling?

No.

wangyum · 2023-11-12T15:55:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

cc @dongjoon-hyun @viirya

dongjoon-hyun

Thank you for pinging me, @wangyum .

This is not a correctness patch but the PR description might look misleading. Could you avoid a word correct in your PR description?

Also, cc @huaxingao too

dongjoon-hyun · 2023-11-12T22:15:19Z

Also, cc @cloud-fan from #19776, too

wangyum · 2023-11-13T01:39:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala

Before:

== Physical Plan == *(1) Project [NAME#7, THEID#8] +- *(1) Filter (((THEID#8 > 0) AND (trim(NAME#7, None) = mary)) OR (THEID#8 > 10)) +- BatchScan json file:/private/var/folders/j0/b7tktmwj1pl591yd6ysvv5sm0000gq/T/spark-ea31f369-4c0e-442b-941d-bbde3be2f41b[NAME#7, THEID#8] JsonScan DataFilters: [(((THEID#8 > 0) AND (trim(NAME#7, None) = mary)) OR (THEID#8 > 10))], Format: json, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/j0/b7tktmwj1pl591yd6ysvv5sm0000gq/T/spark-ea..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<NAME:string,THEID:int> RuntimeFilters: []

After:

== Physical Plan == *(1) Project [NAME#7, THEID#8] +- *(1) Filter (((THEID#8 > 0) AND (trim(NAME#7, None) = mary)) OR (THEID#8 > 10)) +- BatchScan json file:/private/var/folders/j0/b7tktmwj1pl591yd6ysvv5sm0000gq/T/spark-aaa18d17-865a-45d0-9bc1-e8e90855add5[NAME#7, THEID#8] JsonScan DataFilters: [(((THEID#8 > 0) AND (trim(NAME#7, None) = mary)) OR (THEID#8 > 10))], Format: json, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/j0/b7tktmwj1pl591yd6ysvv5sm0000gq/T/spark-aa..., PartitionFilters: [], PushedFilters: [Or(GreaterThan(THEID,0),GreaterThan(THEID,10))], ReadSchema: struct<NAME:string,THEID:int> RuntimeFilters: []

What is the different?

PushedFilters is not empty:

PushedFilters: [Or(GreaterThan(THEID,0),GreaterThan(THEID,10))]

I feel we can partially push down for each data sources.(contains JDBC database).
Please tell me the exceptions, if not possible.

These are two things.

wangyum · 2023-11-13T01:54:11Z

sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala

+  /**
+   * Returns a set with all the filters present in the physical plan.
+   */
+  def getPhysicalFilters(df: DataFrame): ExpressionSet = {
+    ExpressionSet(
+      df.queryExecution.executedPlan.collect {
+        case execution.FilterExec(f, _) => splitConjunctivePredicates(f)
+      }.flatten)
+  }
+
+  /**
+   * Returns a resolved expression for `str` in the context of `df`.
+   */
+  def resolve(df: DataFrame, str: String): Expression = {
+    df.select(expr(str)).queryExecution.analyzed.expressions.head.children.head
+  }


Moved from FileSourceStrategySuite.scala

beliefer

partially push down due to the parquet or orc partitions?

beliefer · 2023-11-13T01:56:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala

What is the different?

wangyum · 2023-11-13T13:31:44Z

cc @dongjoon-hyun Already fixed PR description.

github-actions · 2024-02-22T00:18:05Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Nov 12, 2023

wangyum commented Nov 12, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala Outdated

Copy link

Member Author

wangyum Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dongjoon-hyun @viirya

dongjoon-hyun reviewed Nov 12, 2023

View reviewed changes

wangyum commented Nov 13, 2023

View reviewed changes

fix

c3b41e2

wangyum force-pushed the SPARK-44493-dev branch from 8d91a3c to c3b41e2 Compare November 13, 2023 01:53

wangyum commented Nov 13, 2023

View reviewed changes

beliefer reviewed Nov 13, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala Outdated

Copy link

Contributor

beliefer Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the different?

github-actions bot added the Stale label Feb 22, 2024

github-actions bot closed this Feb 23, 2024

[SPARK-44493][SQL] Translate catalyst expression into partial datasource filter #43769

[SPARK-44493][SQL] Translate catalyst expression into partial datasource filter #43769

Uh oh!

Conversation

wangyum commented Nov 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 12, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum commented Nov 13, 2023

Uh oh!

github-actions bot commented Feb 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangyum commented Nov 12, 2023 •

edited

Loading