-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11366: [Datafusion] Implement constant folding for boolean literal expressions #9309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rust/rustfmt.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive by config fix
Codecov Report
@@ Coverage Diff @@
## master #9309 +/- ##
==========================================
+ Coverage 82.27% 82.40% +0.12%
==========================================
Files 234 235 +1
Lines 54594 55094 +500
==========================================
+ Hits 44919 45398 +479
- Misses 9675 9696 +21
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could make this optimizer a bit more generic (not necessary in this PR) to split recursion / pattern match
This is the more "optimizer framework" I had in mind in the comment on the roadmap
@alamb @jorgecarleitao @vertexclique .
A common strategy (used by Spark) for rule / replacement based to have a loop that just does something like this:
let changed = False;
while !changed {
(logical_plan, changed) = apply_optimizations(rules, logical_plan);
}A rule could be something like (returning Some on replaced output) and doesn't need the boilerplate of recursion for every rule.
// Optimizer can work both on expr and logical plans, default returns `None`
impl OptimizerRule for BooleanComparison {
fn optimize_expr(&mut self, plan: &Expr) -> Option<Expr> {
match e {
Expr::BinaryExpr { left, op, right } => {
let left = optimize_expr(left);
let right = optimize_expr(right);
match op {
Operator::Eq => match (&left, &right) {
(Expr::Literal(ScalarValue::Boolean(b)), _) => match b {
Some(true) => Some(right),
Some(false) | None => Some(Expr::Not(Box::new(right))),
},
(_, Expr::Literal(ScalarValue::Boolean(b))) => match b {
Some(true) => Some(left),
Some(false) | None => Some(Expr::Not(Box::new(left))),
},
_ => None,
},
Operator::NotEq => match (&left, &right) {
(Expr::Literal(ScalarValue::Boolean(b)), _) => match b {
Some(false) | None => Some(right),
Some(true) => Some(Expr::Not(Box::new(right))),
},
(_, Expr::Literal(ScalarValue::Boolean(b))) => match b {
Some(false) | None => Some(left),
Some(true) => Some(Expr::Not(Box::new(left))),
},
_ => None,
},
_ => None
}
}
}
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dandandan -- I think this PR as written is quite efficient and doesn't need a convergence loop as you suggest (which I think ends up potentially being quite inefficient if many rewrites are required) -- it already does a depth first traversal of the tree, simplifying on the way up.
I think convergence loops might be best used if we have several rules that can each potentially make changes that would unlock additional optimizations of the others
For example, if you had two different optimization functions like optimization_A and optimization_B but parts of optimization_A wouldn't be applied unless you ran optimization_B. In that case a loop like the following would let you take full advantage of that
while !changed {
let exprA = optimization_A(expr);
let exprB = optimization_B(exprA);
changed = expr != exprB;
expr = exprB
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can probably guess given PRs like #9278 my preference to avoid repeating the structure walking logic is via a Visitor. Perhaps after this PR is merged, I can take a shot at rewriting it using a general ExprRewriter type pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Yes, agreed this ATM doesn't need a loop as there is no interaction yet with other rules. But I guess once you'll add them you will need it if you want to combine it with other rules. Using expr != exprB for changed might be a good way to start. The main thing I want to stress is that at some point we don't want the recursion itself in a rule, but in a more general "optimization framework".
The optimization loop itself can be written in such a way it does both top down and bottom up replacements, applying the same rule while the optimization "generates" a replacement for the node, so it doesn't need multiple traversals for cases like you mention. The concept is here in polars
https://github.com/ritchie46/polars/blob/master/polars/polars-lazy/src/logical_plan/optimizer/mod.rs#L69
And sounds like a perfect candidate to try for ExprRewriter 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, totally, agree with both of you, i was really tempting to introduce an optimization framework while writing those tree traversal boilerplate code, but decided it's probably better be done as a separate refactoring so we can get a better feeling of how the new framework can affect all the existing rules. the way we are currently managing optimization rules is definitely too raw ;)
I will take a look at the Expression Visitor pattern as well since that's something already exists in the code base.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @houqp -- thank you! I went over it fairly carefully, and I have some structural / naming suggestions, but I think the logic is sound and this could also be merged as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could also apply optimize_expr to then and else_expr:
| .map(|(when, then)| (Box::new(optimize_expr(when)), then.clone())) | |
| .collect(), | |
| else_expr: else_expr.clone(), | |
| .map(|(when, then)| (Box::new(optimize_expr(when)), optimize_expr(then)) | |
| .collect(), | |
| else_expr: optimize_expr(else_expr), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been going back and forth on this one. So basically what I am trying to avoid is
CASE
WHEN true THEN (col1 = true)
ELSE 1
END; Being optimized into:
CASE
WHEN true THEN col1
ELSE 1
END; This is not a valid optimization when col1 column is not typed as boolean. Although we currently don't support comparison between boolean type and none boolean type in datafusion, i am expecting us to support it in the future. Am I overthinking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, i think what I can do is to check for column type based on plan schema and skip optimization for col = true case when col type is not boolean. col = false is always safe to be optimized into !col. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point @houqp -- I think it applies to all expressions (not just CASE statements). You are in essence worried about changing the type of an expression from boolean to something else which is a good thing to worry about!
For columns that have types other than boolean, I would expect an expression like col1 = true to be eventually rewritten to CAST(col1, boolean) = true in which case the optimization to CAST(col1, boolean) is correct.
Upon reading this code a bit more, I can't recall exactly when cast's (coercions) are inserted.
I think skipping the boolean rewrite when col is not a boolean is the right thing to do -- and if that type of comparison can happen we need to update the rewrite rules for BinaryExpr, NotExpr, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have updated the code to apply the folding optimization to all plan nodes and expressions wherever applicable. We currently doesn't do the CAST rewrite at the moment, so that needs to be handled in a separate ticket/PR to get full support for applying binary operators on expressions with different types. This PR now assumes automatic casting will be added in the future and skips optimization for expressions that are not in boolean type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding a test for rewriting expressions in a non-filter plan would be valuable (e.g. make a join plan or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dandandan -- I think this PR as written is quite efficient and doesn't need a convergence loop as you suggest (which I think ends up potentially being quite inefficient if many rewrites are required) -- it already does a depth first traversal of the tree, simplifying on the way up.
I think convergence loops might be best used if we have several rules that can each potentially make changes that would unlock additional optimizations of the others
For example, if you had two different optimization functions like optimization_A and optimization_B but parts of optimization_A wouldn't be applied unless you ran optimization_B. In that case a loop like the following would let you take full advantage of that
while !changed {
let exprA = optimization_A(expr);
let exprB = optimization_B(exprA);
changed = expr != exprB;
expr = exprB
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can probably guess given PRs like #9278 my preference to avoid repeating the structure walking logic is via a Visitor. Perhaps after this PR is merged, I can take a shot at rewriting it using a general ExprRewriter type pattern.
a781829 to
4318278
Compare
|
@Dandandan @alamb ready for another round of review. Given this PR has grown into 900 lines, I think it would be better to work on a new |
d4608a9 to
356c300
Compare
|
Thanks @houqp -- I ran out of time today to review this PR, but I plan to review it tomorrow. I agree that a |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@houqp I started reviewing this PR but for some reason it seems to include many more changes than just your boolean literal rewrites. Is there any chance you can rebase it against current apache/master?
|
@alamb rebased PR on latest master. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @houqp -- I just went through this pretty carefully and I think it looks like a great foundation to begin with. I had some possible code cleanup suggestions, but I don't see any reason not to merge this in and clean it up as we improve the DataFusion optimizer framework more
Really nice work on the tests as well. 👍
|
let me push up another commit to incorporate the clean up suggestions |
|
all feedback addressed, thanks for the suggestion @alamb , tests are a lot easier to read and maintain now compared to what I started with :) |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking even better than before @houqp . Thanks so much! Unless anyone else has comments I will plan to merge this tomorrow.
cc @andygrove and @Dandandan (this is the start of such a cool feature)
| when_then_expr, | ||
| else_expr, | ||
| } => { | ||
| // recurse into CASE WHEN condition expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| let schema = expr_test_schema(); | ||
|
|
||
| // x = null is always null | ||
| assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are so much nicer to read ❤️
Dandandan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @houqp !!
|
Thanks again @houqp -- really nice job |
No description provided.