Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Nov 12, 2023

What changes were proposed in this pull request?

SPARK-22548 fixed incorrect nested AND expression pushed down to JDBC data source. But Parquet/ORC data sources do not need that fix because they always keep the Filter node.

For example

CREATE TABLE parquet_table (name string, theid int) using parquet;
EXPLAIN SELECT * FROM parquet_table WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 'fred');
-- Before SPARK-22548
== Physical Plan ==
*Filter (((THEID#21 > 0) && (trim(NAME#20) = mary)) || (NAME#20 = fred))
+- *FileScan parquet default.parquet_table[name#20,theid#21] PushedFilters: [Or(GreaterThan(theid,0),EqualTo(name,fred))]

-- After SPARK-22548(Actually we can push down [Or(GreaterThan(theid,0),EqualTo(name,fred))] to data source)
== Physical Plan ==
*Filter (((THEID#21 > 0) && (trim(NAME#20) = mary)) || (NAME#20 = fred))
+- *FileScan parquet default.parquet_table[name#20,theid#21] PushedFilters: []

This PR adds a function parameter(named canPartialPushDown) to DataSourceStrategy.translateFilter. The caller side can set it to true if caller side always keep the Filter node to push down more filters.

Why are the changes needed?

Pushdown more filters to improve query performance. This is a very common case and can significantly improve query performance. For example:

1. id IN (1677950,1468757,2059706,2053742,2047936,5387,2046791) OR (id=2351460 and very_heavy_udf(very_large_column, 'key') = 'EXPC')
2. (id IN (4279154) AND very_heavy_udf(very_large_column, 'key') IS NOT NULL) OR
(id = 3914 AND very_heavy_udf(very_large_column, 'sid') = 'p.2380')

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and benchmark test:

Before After
image image
image image

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Nov 12, 2023
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @wangyum .

This is not a correctness patch but the PR description might look misleading. Could you avoid a word correct in your PR description?

Also, cc @huaxingao too

@dongjoon-hyun
Copy link
Member

Also, cc @cloud-fan from #19776, too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before:

== Physical Plan ==
*(1) Project [NAME#7, THEID#8]
+- *(1) Filter (((THEID#8 > 0) AND (trim(NAME#7, None) = mary)) OR (THEID#8 > 10))
   +- BatchScan json file:/private/var/folders/j0/b7tktmwj1pl591yd6ysvv5sm0000gq/T/spark-ea31f369-4c0e-442b-941d-bbde3be2f41b[NAME#7, THEID#8] JsonScan DataFilters: [(((THEID#8 > 0) AND (trim(NAME#7, None) = mary)) OR (THEID#8 > 10))], Format: json, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/j0/b7tktmwj1pl591yd6ysvv5sm0000gq/T/spark-ea..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<NAME:string,THEID:int> RuntimeFilters: []

After:

== Physical Plan ==
*(1) Project [NAME#7, THEID#8]
+- *(1) Filter (((THEID#8 > 0) AND (trim(NAME#7, None) = mary)) OR (THEID#8 > 10))
   +- BatchScan json file:/private/var/folders/j0/b7tktmwj1pl591yd6ysvv5sm0000gq/T/spark-aaa18d17-865a-45d0-9bc1-e8e90855add5[NAME#7, THEID#8] JsonScan DataFilters: [(((THEID#8 > 0) AND (trim(NAME#7, None) = mary)) OR (THEID#8 > 10))], Format: json, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/j0/b7tktmwj1pl591yd6ysvv5sm0000gq/T/spark-aa..., PartitionFilters: [], PushedFilters: [Or(GreaterThan(THEID,0),GreaterThan(THEID,10))], ReadSchema: struct<NAME:string,THEID:int> RuntimeFilters: []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PushedFilters is not empty:

PushedFilters: [Or(GreaterThan(THEID,0),GreaterThan(THEID,10))]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we can partially push down for each data sources.(contains JDBC database).
Please tell me the exceptions, if not possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are two things.

Comment on lines +258 to +273
/**
* Returns a set with all the filters present in the physical plan.
*/
def getPhysicalFilters(df: DataFrame): ExpressionSet = {
ExpressionSet(
df.queryExecution.executedPlan.collect {
case execution.FilterExec(f, _) => splitConjunctivePredicates(f)
}.flatten)
}

/**
* Returns a resolved expression for `str` in the context of `df`.
*/
def resolve(df: DataFrame, str: String): Expression = {
df.select(expr(str)).queryExecution.analyzed.expressions.head.children.head
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partially push down due to the parquet or orc partitions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the different?

@wangyum
Copy link
Member Author

wangyum commented Nov 13, 2023

cc @dongjoon-hyun Already fixed PR description.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 22, 2024
@github-actions github-actions bot closed this Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants