Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce duplicate Projection when PARQUET_PUSHDOWN_FILTERS is on. #4398

Open
Ted-Jiang opened this issue Nov 28, 2022 · 8 comments
Open

Reduce duplicate Projection when PARQUET_PUSHDOWN_FILTERS is on. #4398

Ted-Jiang opened this issue Nov 28, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@Ted-Jiang
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

❯ explain  select l_orderkey from lineitem where l_extendedprice='2618.76';
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                      |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: lineitem.l_orderkey                                                                                                                                                                                                                                           |
|               |   Filter: CAST(lineitem.l_extendedprice AS Utf8) = Utf8("2618.76")                                                                                                                                                                                                        |
|               |     TableScan: lineitem projection=[l_orderkey, l_extendedprice], partial_filters=[CAST(lineitem.l_extendedprice AS Utf8) = Utf8("2618.76")]                                                                                                                              |
| physical_plan | ProjectionExec: expr=[l_orderkey@0 as l_orderkey]                                                                                                                                                                                                                         |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                                                                                                                                             |
|               |     FilterExec: CAST(l_extendedprice@1 AS Utf8) = 2618.76                                                                                                                                                                                                                 |
|               |       RepartitionExec: partitioning=RoundRobinBatch(16)                                                                                                                                                                                                                   |
|               |         ParquetExec: limit=None, partitions=*.parquet], predicate=CAST(l_extendedprice AS Utf8) = Utf8("2618.76"), projection=[l_orderkey, l_extendedprice] |
|               |                                                                                                                                                                                                                                                                           |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.017 seconds.

In this sql will have two related cols:
1. Selected col l_orderkey
2. Filter col l_extendedprice

IMO, this situation where Filter Col not exists in Selected col, is perfect for row_filter_push_down. We can only visit the Filter Col once, without using the selection do the decoding thing.

Describe the solution you'd like

Maybe we can change the physical plan to

| physical_plan | ProjectionExec: expr=[l_orderkey@0 as l_orderkey]                                                                                                                                                                                                                         |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                                                                                                                                                                                             |                                                                                                                                                                                                             |
|               |       RepartitionExec: partitioning=RoundRobinBatch(16)                                                                                                                                                                                                                   |
|               |         ParquetExec: limit=None, partitions=[*.parquet], predicate=CAST(l_extendedprice AS Utf8) = Utf8("2618.76"), projection=[l_orderkey] |
|               |

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@Ted-Jiang Ted-Jiang added the enhancement New feature or request label Nov 28, 2022
@Ted-Jiang
Copy link
Member Author

@alamb @tustvold @thinkharderdev Look forward to hearing your opinion😄

@tustvold
Copy link
Contributor

I think this relates to #4020, which appears to have been closed?!

@thinkharderdev
Copy link
Contributor

So if I understand correctly, you are saying we should push down the predicate so that we only used it to generate a row selection for the other rows (i.e. we do not buffer the predicate column in memory after we have generated the selection)?

@thinkharderdev
Copy link
Contributor

Would this be accomplished by simply making the filter pushdown exact? Or is there something else required?

@tustvold
Copy link
Contributor

Ah yes, I misread the ticket #4028 may resolve this, although it might need an additional optimisation pass to remove the now-redundant projection

@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Nov 28, 2022

we do not buffer the predicate column in memory after we have generated the selection

Yes just like you mentioned it is an exact filter.

Would this be accomplished by simply making the filter pushdown exact?

Sounds like a reasonable solution,seems miss this part in row filter.

@alamb
Copy link
Contributor

alamb commented Nov 28, 2022

If the parquet pushdown filtering is indeed exact, returning exact sounds like a good idea to me. RowGroup pruning and page index pruning are definitely not exact.

@thinkharderdev
Copy link
Contributor

Ah yes, I misread the ticket #4028 may resolve this, although it might need an additional optimisation pass to remove the now-redundant projection

I think it should work as is. As long and ProjectionPushdown runs after FilterPushdown then the projection should be eliminated entirely I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants