Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move nested union optimization from plan builder to logical optimizer #7695

Merged
merged 8 commits into from
Oct 10, 2023

Conversation

maruschin
Copy link
Contributor

@maruschin maruschin commented Sep 29, 2023

Which issue does this PR close?

Closes #7481.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the optimizer Optimizer rules label Sep 29, 2023
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Sep 29, 2023
@maruschin maruschin force-pushed the add_union_optimization branch 3 times, most recently from 4c42dc4 to 770bf79 Compare September 29, 2023 06:24
@maruschin maruschin changed the title [WIP] Add Union optimization Add nested union optimization Sep 29, 2023
@maruschin maruschin changed the title Add nested union optimization Move nested union optimization from plan builder to logical optimizer Sep 29, 2023
@maruschin maruschin marked this pull request as ready for review September 29, 2023 09:11
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @maruschin -- this is looking quite close. The basic idea and testing looks very solid. I think several of the CI failures on this PR are due to #7701, so merging / rebasing with master will help.

I also double checked this branch gets the same plan as main

❯ explain select * from (select 1 UNION ALL (select 2 UNION ALL select 3));
+---------------+----------------------------------------+
| plan_type     | plan                                   |
+---------------+----------------------------------------+
| logical_plan  | Union                                  |
|               |   Projection: Int64(1) AS Int64(1)     |
|               |     EmptyRelation                      |
|               |   Projection: Int64(2) AS Int64(1)     |
|               |     EmptyRelation                      |
|               |   Projection: Int64(3) AS Int64(1)     |
|               |     EmptyRelation                      |
| physical_plan | UnionExec                              |
|               |   ProjectionExec: expr=[1 as Int64(1)] |
|               |     EmptyExec: produce_one_row=true    |
|               |   ProjectionExec: expr=[2 as Int64(1)] |
|               |     EmptyExec: produce_one_row=true    |
|               |   ProjectionExec: expr=[3 as Int64(1)] |
|               |     EmptyExec: produce_one_row=true    |
|               |                                        |
+---------------+----------------------------------------+
2 rows in set. Query took 0.008 seconds.

I found a few sqllogictests seem to fail, specifically

cargo test --test sqllogictests -- timestamp

Running "timestamps.slt"
External error: query result mismatch:
[SQL] SELECT date_trunc('hour', TIMESTAMPTZ '2000-01-01T00:00:00+00:45') as ts_irregular_offset
 UNION ALL
SELECT date_trunc('hour', TIMESTAMPTZ '2000-01-01T00:00:00+00:30') as ts_irregular_offset
 UNION ALL
SELECT date_trunc('hour', TIMESTAMPTZ '2000-01-01T00:00:00+00:15') as ts_irregular_offset
 UNION ALL
SELECT date_trunc('hour', TIMESTAMPTZ '2000-01-01T00:00:00-00:15') as ts_irregular_offset
 UNION ALL
SELECT date_trunc('hour', TIMESTAMPTZ '2000-01-01T00:00:00-00:30') as ts_irregular_offset
 UNION ALL
SELECT date_trunc('hour', TIMESTAMPTZ '2000-01-01T00:00:00-00:45') as ts_irregular_offset
[Diff] (-expected|+actual)
-   1999-12-31T23:00:00Z
-   1999-12-31T23:00:00Z
-   1999-12-31T23:00:00Z
-   2000-01-01T00:00:00Z
-   2000-01-01T00:00:00Z
-   2000-01-01T00:00:00Z
+   1999-12-31T23:00:00
+   1999-12-31T23:00:00
+   1999-12-31T23:00:00
+   2000-01-01T00:00:00
+   2000-01-01T00:00:00
+   2000-01-01T00:00:00
at test_files/timestamps.slt:1471

Which looks to me like the timestamp logic got lost somehow 🤔

datafusion/optimizer/src/eliminate_one_union.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/eliminate_nested_union.rs Outdated Show resolved Hide resolved
@maruschin maruschin force-pushed the add_union_optimization branch 2 times, most recently from c4a11b7 to cdad0be Compare October 2, 2023 23:39
@github-actions github-actions bot added the sql SQL Planner label Oct 3, 2023
@maruschin maruschin marked this pull request as draft October 3, 2023 02:53
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 3, 2023
@maruschin maruschin force-pushed the add_union_optimization branch 8 times, most recently from d8024ab to e874a08 Compare October 6, 2023 04:53
@maruschin maruschin marked this pull request as ready for review October 6, 2023 04:59
@maruschin maruschin force-pushed the add_union_optimization branch 2 times, most recently from 6258d57 to e59e54c Compare October 6, 2023 06:32
plan: &LogicalPlan,
_config: &dyn OptimizerConfig,
) -> Result<Option<LogicalPlan>> {
// TODO: Add optimization for nested distinct unions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at one point (maybe as a TODO) we probably should also remove unions that have only a single child since they are basically a no-op/pass-through. Not sure if this should be an separate optimizer rule or if this should be done in this rule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the EliminateOneUnion rule, also added in this PR, handles the one union case

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @maruschin -- this is a really nice piece of work. It is well tested and the code is very clean.

Thank you @crepererum and @jackwener for the reviews

plan: &LogicalPlan,
_config: &dyn OptimizerConfig,
) -> Result<Option<LogicalPlan>> {
// TODO: Add optimization for nested distinct unions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the EliminateOneUnion rule, also added in this PR, handles the one union case

@alamb
Copy link
Contributor

alamb commented Oct 10, 2023

I took the liberty of merging this branch up from main and fixing clippy as I had it all checked out already

I also wrote some end to end tests in #7787

@alamb alamb merged commit 704e034 into apache:main Oct 10, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 10, 2023

Thanks again @maruschin

@maruschin maruschin deleted the add_union_optimization branch October 11, 2023 00:56
devinjdangelo pushed a commit to devinjdangelo/arrow-datafusion that referenced this pull request Oct 11, 2023
…apache#7695)

* Add naive implementation of eliminate_nested_union

* Remove union optimization from LogicalPlanBuilder::union

* Fix propagate_union_children_different_schema test

* Add implementation of eliminate_one_union

* Simplified eliminate_nested_union test

* Fix

* clippy

---------

Co-authored-by: Evgeny Maruschenko <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize nested unions
5 participants