ARROW-11366: [Datafusion] Implement constant folding for boolean literal expressions #9309

houqp · 2021-01-24T23:38:13Z

No description provided.

github-actions · 2021-01-24T23:38:34Z

https://issues.apache.org/jira/browse/ARROW-11366

houqp · 2021-01-24T23:52:37Z

rust/rustfmt.toml

drive by config fix

codecov-io · 2021-01-25T00:39:42Z

Codecov Report

Merging #9309 (4318278) into master (88e9eb8) will increase coverage by 0.12%.
The diff coverage is 95.21%.

@@            Coverage Diff             @@
##           master    #9309      +/-   ##
==========================================
+ Coverage   82.27%   82.40%   +0.12%     
==========================================
  Files         234      235       +1     
  Lines       54594    55094     +500     
==========================================
+ Hits        44919    45398     +479     
- Misses       9675     9696      +21

Impacted Files	Coverage Δ
rust/datafusion/src/logical_plan/plan.rs	`82.45% <82.14%> (-0.04%)`	⬇️
rust/datafusion/src/optimizer/constant_folding.rs	`95.86% <95.86%> (ø)`
rust/datafusion/src/execution/context.rs	`90.17% <100.00%> (+0.08%)`	⬆️
rust/datafusion/src/sql/planner.rs	`83.23% <100.00%> (+0.07%)`	⬆️
rust/datafusion/src/optimizer/utils.rs	`52.39% <0.00%> (+0.36%)`	⬆️
rust/datafusion/src/logical_plan/expr.rs	`80.66% <0.00%> (+0.47%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88e9eb8...4318278. Read the comment docs.

Dandandan · 2021-01-26T07:19:26Z

rust/datafusion/src/optimizer/boolean_comparison.rs

I think we could make this optimizer a bit more generic (not necessary in this PR) to split recursion / pattern match

This is the more "optimizer framework" I had in mind in the comment on the roadmap
@alamb @jorgecarleitao @vertexclique .

A common strategy (used by Spark) for rule / replacement based to have a loop that just does something like this:

let changed = False; while !changed { (logical_plan, changed) = apply_optimizations(rules, logical_plan); }

A rule could be something like (returning Some on replaced output) and doesn't need the boilerplate of recursion for every rule.

// Optimizer can work both on expr and logical plans, default returns `None` impl OptimizerRule for BooleanComparison { fn optimize_expr(&mut self, plan: &Expr) -> Option<Expr> { match e { Expr::BinaryExpr { left, op, right } => { let left = optimize_expr(left); let right = optimize_expr(right); match op { Operator::Eq => match (&left, &right) { (Expr::Literal(ScalarValue::Boolean(b)), _) => match b { Some(true) => Some(right), Some(false) | None => Some(Expr::Not(Box::new(right))), }, (_, Expr::Literal(ScalarValue::Boolean(b))) => match b { Some(true) => Some(left), Some(false) | None => Some(Expr::Not(Box::new(left))), }, _ => None, }, Operator::NotEq => match (&left, &right) { (Expr::Literal(ScalarValue::Boolean(b)), _) => match b { Some(false) | None => Some(right), Some(true) => Some(Expr::Not(Box::new(right))), }, (_, Expr::Literal(ScalarValue::Boolean(b))) => match b { Some(false) | None => Some(left), Some(true) => Some(Expr::Not(Box::new(left))), }, _ => None, }, _ => None } } } } }

@Dandandan -- I think this PR as written is quite efficient and doesn't need a convergence loop as you suggest (which I think ends up potentially being quite inefficient if many rewrites are required) -- it already does a depth first traversal of the tree, simplifying on the way up.

I think convergence loops might be best used if we have several rules that can each potentially make changes that would unlock additional optimizations of the others

For example, if you had two different optimization functions like optimization_A and optimization_B but parts of optimization_A wouldn't be applied unless you ran optimization_B. In that case a loop like the following would let you take full advantage of that

while !changed { let exprA = optimization_A(expr); let exprB = optimization_B(exprA); changed = expr != exprB; expr = exprB }

As you can probably guess given PRs like #9278 my preference to avoid repeating the structure walking logic is via a Visitor. Perhaps after this PR is merged, I can take a shot at rewriting it using a general ExprRewriter type pattern.

👍 Yes, agreed this ATM doesn't need a loop as there is no interaction yet with other rules. But I guess once you'll add them you will need it if you want to combine it with other rules. Using expr != exprB for changed might be a good way to start. The main thing I want to stress is that at some point we don't want the recursion itself in a rule, but in a more general "optimization framework".

The optimization loop itself can be written in such a way it does both top down and bottom up replacements, applying the same rule while the optimization "generates" a replacement for the node, so it doesn't need multiple traversals for cases like you mention. The concept is here in polars
https://github.com/ritchie46/polars/blob/master/polars/polars-lazy/src/logical_plan/optimizer/mod.rs#L69

And sounds like a perfect candidate to try for ExprRewriter 👍

yeah, totally, agree with both of you, i was really tempting to introduce an optimization framework while writing those tree traversal boilerplate code, but decided it's probably better be done as a separate refactoring so we can get a better feeling of how the new framework can affect all the existing rules. the way we are currently managing optimization rules is definitely too raw ;)

I will take a look at the Expression Visitor pattern as well since that's something already exists in the code base.

alamb

This is great @houqp -- thank you! I went over it fairly carefully, and I have some structural / naming suggestions, but I think the logic is sound and this could also be merged as is.

alamb · 2021-01-26T11:20:31Z

rust/datafusion/src/optimizer/boolean_comparison.rs

I think you could also apply optimize_expr to then and else_expr:

Suggested change

.map(|(when, then)| (Box::new(optimize_expr(when)), then.clone()))

.collect(),

else_expr: else_expr.clone(),

.map(|(when, then)| (Box::new(optimize_expr(when)), optimize_expr(then))

.collect(),

else_expr: optimize_expr(else_expr),

I have been going back and forth on this one. So basically what I am trying to avoid is

CASE WHEN true THEN (col1 = true) ELSE 1 END;

Being optimized into:

CASE WHEN true THEN col1 ELSE 1 END;

This is not a valid optimization when col1 column is not typed as boolean. Although we currently don't support comparison between boolean type and none boolean type in datafusion, i am expecting us to support it in the future. Am I overthinking this?

OK, i think what I can do is to check for column type based on plan schema and skip optimization for col = true case when col type is not boolean. col = false is always safe to be optimized into !col. What do you think?

That is a good point @houqp -- I think it applies to all expressions (not just CASE statements). You are in essence worried about changing the type of an expression from boolean to something else which is a good thing to worry about!

For columns that have types other than boolean, I would expect an expression like col1 = true to be eventually rewritten to CAST(col1, boolean) = true in which case the optimization to CAST(col1, boolean) is correct.

Upon reading this code a bit more, I can't recall exactly when cast's (coercions) are inserted.

I think skipping the boolean rewrite when col is not a boolean is the right thing to do -- and if that type of comparison can happen we need to update the rewrite rules for BinaryExpr, NotExpr, etc

OK, I have updated the code to apply the folding optimization to all plan nodes and expressions wherever applicable. We currently doesn't do the CAST rewrite at the moment, so that needs to be handled in a separate ticket/PR to get full support for applying binary operators on expressions with different types. This PR now assumes automatic casting will be added in the future and skips optimization for expressions that are not in boolean type.

alamb · 2021-01-26T11:23:09Z

rust/datafusion/src/optimizer/boolean_comparison.rs

I think adding a test for rewriting expressions in a non-filter plan would be valuable (e.g. make a join plan or something)

rust/datafusion/src/optimizer/boolean_comparison.rs

alamb · 2021-01-26T11:33:59Z

rust/datafusion/src/optimizer/boolean_comparison.rs

@Dandandan -- I think this PR as written is quite efficient and doesn't need a convergence loop as you suggest (which I think ends up potentially being quite inefficient if many rewrites are required) -- it already does a depth first traversal of the tree, simplifying on the way up.

I think convergence loops might be best used if we have several rules that can each potentially make changes that would unlock additional optimizations of the others

For example, if you had two different optimization functions like optimization_A and optimization_B but parts of optimization_A wouldn't be applied unless you ran optimization_B. In that case a loop like the following would let you take full advantage of that

while !changed { let exprA = optimization_A(expr); let exprB = optimization_B(exprA); changed = expr != exprB; expr = exprB }

alamb · 2021-01-26T11:35:38Z

rust/datafusion/src/optimizer/boolean_comparison.rs

As you can probably guess given PRs like #9278 my preference to avoid repeating the structure walking logic is via a Visitor. Perhaps after this PR is merged, I can take a shot at rewriting it using a general ExprRewriter type pattern.

rust/datafusion/src/optimizer/constant_folding.rs

houqp · 2021-02-13T23:47:05Z

@Dandandan @alamb ready for another round of review. Given this PR has grown into 900 lines, I think it would be better to work on a new ExprRewriter refactor only PR to ease the review process. We will be able to get a better idea of how the new framework works by applying it to multiple optimization rules in the same patch set.

alamb · 2021-02-14T12:09:53Z

Thanks @houqp -- I ran out of time today to review this PR, but I plan to review it tomorrow. I agree that a ExprRewriter refactor would be better done in a separate PR

alamb

@houqp I started reviewing this PR but for some reason it seems to include many more changes than just your boolean literal rewrites. Is there any chance you can rebase it against current apache/master?

dev/archery/archery/benchmark/compare.py

rust/arrow/src/array/array.rs

…ssion

houqp · 2021-02-15T21:02:04Z

@alamb rebased PR on latest master.

rust/datafusion/src/optimizer/constant_folding.rs

alamb

Thanks @houqp -- I just went through this pretty carefully and I think it looks like a great foundation to begin with. I had some possible code cleanup suggestions, but I don't see any reason not to merge this in and clean it up as we improve the DataFusion optimizer framework more

Really nice work on the tests as well. 👍

rust/datafusion/src/optimizer/constant_folding.rs

houqp · 2021-02-15T21:25:12Z

let me push up another commit to incorporate the clean up suggestions

houqp · 2021-02-15T22:10:28Z

all feedback addressed, thanks for the suggestion @alamb , tests are a lot easier to read and maintain now compared to what I started with :)

alamb

Looking even better than before @houqp . Thanks so much! Unless anyone else has comments I will plan to merge this tomorrow.

cc @andygrove and @Dandandan (this is the start of such a cool feature)

alamb · 2021-02-16T20:08:09Z

rust/datafusion/src/optimizer/constant_folding.rs

+            when_then_expr,
+            else_expr,
+        } => {
+            // recurse into CASE WHEN condition expressions


alamb · 2021-02-16T20:09:29Z

rust/datafusion/src/optimizer/constant_folding.rs

+        let schema = expr_test_schema();
+
+        // x = null is always null
+        assert_eq!(


These tests are so much nicer to read ❤️

Dandandan

LGTM. Thanks @houqp !!

alamb · 2021-02-17T16:42:15Z

Thanks again @houqp -- really nice job

github-actions bot added Component: Rust - DataFusion Component: Rust labels Jan 24, 2021

houqp force-pushed the qp_boolean branch from 67aba8d to dfe841d Compare January 24, 2021 23:52

houqp commented Jan 24, 2021

View reviewed changes

rust/rustfmt.toml Outdated

Copy link

Member Author

houqp Jan 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by config fix

houqp force-pushed the qp_boolean branch from dfe841d to 5656232 Compare January 25, 2021 00:18

houqp marked this pull request as draft January 25, 2021 00:38

houqp force-pushed the qp_boolean branch from 5656232 to 26559ed Compare January 25, 2021 01:03

houqp marked this pull request as ready for review January 25, 2021 01:09

houqp force-pushed the qp_boolean branch from 26559ed to 474e14a Compare January 25, 2021 01:31

kszucs force-pushed the master branch from d238279 to 437c8c9 Compare January 25, 2021 21:53

houqp force-pushed the qp_boolean branch from 474e14a to 325fa4d Compare January 26, 2021 04:31

Dandandan reviewed Jan 26, 2021

View reviewed changes

alamb approved these changes Jan 26, 2021

View reviewed changes

houqp marked this pull request as draft January 29, 2021 06:04

houqp force-pushed the qp_boolean branch from 274e741 to 6c81fe7 Compare February 7, 2021 08:18

Dandandan reviewed Feb 7, 2021

View reviewed changes

rust/datafusion/src/optimizer/constant_folding.rs Outdated Show resolved Hide resolved

Dandandan reviewed Feb 7, 2021

View reviewed changes

rust/datafusion/src/optimizer/constant_folding.rs Outdated Show resolved Hide resolved

Dandandan reviewed Feb 7, 2021

View reviewed changes

rust/datafusion/src/optimizer/constant_folding.rs Outdated Show resolved Hide resolved

houqp force-pushed the qp_boolean branch 2 times, most recently from a781829 to 4318278 Compare February 13, 2021 22:54

houqp marked this pull request as ready for review February 13, 2021 23:33

jorgecarleitao force-pushed the master branch 2 times, most recently from d4608a9 to 356c300 Compare February 14, 2021 12:09

alamb reviewed Feb 15, 2021

View reviewed changes

dev/archery/archery/benchmark/compare.py Outdated Show resolved Hide resolved

rust/arrow/src/array/array.rs Outdated Show resolved Hide resolved

houqp added 2 commits February 15, 2021 13:00

ARROW-11366: [Datafusion] support boolean literal in comparison expre…

2b5aa24

…ssion

add expression tess

dded02c

houqp added 6 commits February 15, 2021 13:00

rename to constant folding

53fe8b0

ignore nonboolean expressions

4e4c514

optimize !!expr to expr

91cdc0c

handle null comparision

f49bd12

optimize then and else_expr branches for case expression

00baa00

recursive into all logical plan nodes and expression types

1e57370

houqp force-pushed the qp_boolean branch from 4318278 to 1e57370 Compare February 15, 2021 21:01

alamb reviewed Feb 15, 2021

View reviewed changes

rust/datafusion/src/optimizer/constant_folding.rs Outdated Show resolved Hide resolved

address review feedback

c68794a

alamb approved these changes Feb 15, 2021

View reviewed changes

alamb changed the title ~~ARROW-11366: [Datafusion] support boolean literal in comparison expression~~ ARROW-11366: [Datafusion] Implement constant folding for boolean literal expressions Feb 15, 2021

houqp added 2 commits February 15, 2021 13:55

simplify tests

8e9605a

optimize case expression when base expression is specified

874955b

alamb approved these changes Feb 16, 2021

View reviewed changes

Dandandan approved these changes Feb 16, 2021

View reviewed changes

alamb closed this in bca7d2f Feb 17, 2021

alamb mentioned this pull request Feb 22, 2021

ARROW-11710: [Rust][DataFusion] Implement ExpressionRewriter #9545

Closed

asfimport mentioned this pull request Feb 17, 2021

[Rust][DataFusion] Add Constant Folding / Support boolean literal in equality expression #27260

Closed

-                        .map(|(when, then)| (Box::new(optimize_expr(when)), then.clone()))
-                        .collect(),
-                    else_expr: else_expr.clone(),
+                        .map(|(when, then)| (Box::new(optimize_expr(when)), optimize_expr(then))
+                        .collect(),
+                    else_expr: optimize_expr(else_expr),

ARROW-11366: [Datafusion] Implement constant folding for boolean literal expressions #9309

ARROW-11366: [Datafusion] Implement constant folding for boolean literal expressions #9309

Uh oh!

Conversation

houqp commented Jan 24, 2021

Uh oh!

github-actions bot commented Jan 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Dandandan Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houqp Jan 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houqp Jan 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houqp Feb 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

houqp commented Feb 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Feb 14, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

houqp commented Feb 15, 2021

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

houqp commented Feb 15, 2021

Uh oh!

houqp commented Feb 15, 2021

codecov-io commented Jan 25, 2021 •

edited

Loading

Dandandan Jan 26, 2021 •

edited

Loading

houqp Jan 28, 2021 •

edited

Loading

houqp Jan 27, 2021 •

edited

Loading

houqp Feb 13, 2021 •

edited

Loading

houqp commented Feb 13, 2021 •

edited

Loading