Skip to content

Conversation

@houqp
Copy link
Member

@houqp houqp commented Jan 24, 2021

No description provided.

@github-actions
Copy link

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by config fix

@houqp houqp marked this pull request as draft January 25, 2021 00:38
@codecov-io
Copy link

codecov-io commented Jan 25, 2021

Codecov Report

Merging #9309 (4318278) into master (88e9eb8) will increase coverage by 0.12%.
The diff coverage is 95.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9309      +/-   ##
==========================================
+ Coverage   82.27%   82.40%   +0.12%     
==========================================
  Files         234      235       +1     
  Lines       54594    55094     +500     
==========================================
+ Hits        44919    45398     +479     
- Misses       9675     9696      +21     
Impacted Files Coverage Δ
rust/datafusion/src/logical_plan/plan.rs 82.45% <82.14%> (-0.04%) ⬇️
rust/datafusion/src/optimizer/constant_folding.rs 95.86% <95.86%> (ø)
rust/datafusion/src/execution/context.rs 90.17% <100.00%> (+0.08%) ⬆️
rust/datafusion/src/sql/planner.rs 83.23% <100.00%> (+0.07%) ⬆️
rust/datafusion/src/optimizer/utils.rs 52.39% <0.00%> (+0.36%) ⬆️
rust/datafusion/src/logical_plan/expr.rs 80.66% <0.00%> (+0.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88e9eb8...4318278. Read the comment docs.

Copy link
Contributor

@Dandandan Dandandan Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make this optimizer a bit more generic (not necessary in this PR) to split recursion / pattern match

This is the more "optimizer framework" I had in mind in the comment on the roadmap
@alamb @jorgecarleitao @vertexclique .

A common strategy (used by Spark) for rule / replacement based to have a loop that just does something like this:

let changed = False;

while !changed {
    (logical_plan, changed)  = apply_optimizations(rules, logical_plan);
}

A rule could be something like (returning Some on replaced output) and doesn't need the boilerplate of recursion for every rule.

// Optimizer can work both on expr and logical plans, default returns `None`
impl OptimizerRule for BooleanComparison {
   fn optimize_expr(&mut self, plan: &Expr) -> Option<Expr> {
       match e {
             Expr::BinaryExpr { left, op, right } => {
            let left = optimize_expr(left);
            let right = optimize_expr(right);
            match op {
                Operator::Eq => match (&left, &right) {
                    (Expr::Literal(ScalarValue::Boolean(b)), _) => match b {
                        Some(true) => Some(right),
                        Some(false) | None => Some(Expr::Not(Box::new(right))),
                    },
                    (_, Expr::Literal(ScalarValue::Boolean(b))) => match b {
                        Some(true) => Some(left),
                        Some(false) | None => Some(Expr::Not(Box::new(left))),
                    },
                    _ => None,
                },
                Operator::NotEq => match (&left, &right) {
                    (Expr::Literal(ScalarValue::Boolean(b)), _) => match b {
                        Some(false) | None => Some(right),
                        Some(true) => Some(Expr::Not(Box::new(right))),
                    },
                    (_, Expr::Literal(ScalarValue::Boolean(b))) => match b {
                        Some(false) | None => Some(left),
                        Some(true) => Some(Expr::Not(Box::new(left))),
                    },
                    _ => None,
                },
                _ => None
            }
        }
       }
   }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan -- I think this PR as written is quite efficient and doesn't need a convergence loop as you suggest (which I think ends up potentially being quite inefficient if many rewrites are required) -- it already does a depth first traversal of the tree, simplifying on the way up.

I think convergence loops might be best used if we have several rules that can each potentially make changes that would unlock additional optimizations of the others

For example, if you had two different optimization functions like optimization_A and optimization_B but parts of optimization_A wouldn't be applied unless you ran optimization_B. In that case a loop like the following would let you take full advantage of that

while !changed {
  let exprA = optimization_A(expr);
  let exprB = optimization_B(exprA);
  changed = expr != exprB;
  expr = exprB
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can probably guess given PRs like #9278 my preference to avoid repeating the structure walking logic is via a Visitor. Perhaps after this PR is merged, I can take a shot at rewriting it using a general ExprRewriter type pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Yes, agreed this ATM doesn't need a loop as there is no interaction yet with other rules. But I guess once you'll add them you will need it if you want to combine it with other rules. Using expr != exprB for changed might be a good way to start. The main thing I want to stress is that at some point we don't want the recursion itself in a rule, but in a more general "optimization framework".

The optimization loop itself can be written in such a way it does both top down and bottom up replacements, applying the same rule while the optimization "generates" a replacement for the node, so it doesn't need multiple traversals for cases like you mention. The concept is here in polars
https://github.com/ritchie46/polars/blob/master/polars/polars-lazy/src/logical_plan/optimizer/mod.rs#L69

And sounds like a perfect candidate to try for ExprRewriter 👍

Copy link
Member Author

@houqp houqp Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, totally, agree with both of you, i was really tempting to introduce an optimization framework while writing those tree traversal boilerplate code, but decided it's probably better be done as a separate refactoring so we can get a better feeling of how the new framework can affect all the existing rules. the way we are currently managing optimization rules is definitely too raw ;)

I will take a look at the Expression Visitor pattern as well since that's something already exists in the code base.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @houqp -- thank you! I went over it fairly carefully, and I have some structural / naming suggestions, but I think the logic is sound and this could also be merged as is.

Comment on lines 138 to 140
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could also apply optimize_expr to then and else_expr:

Suggested change
.map(|(when, then)| (Box::new(optimize_expr(when)), then.clone()))
.collect(),
else_expr: else_expr.clone(),
.map(|(when, then)| (Box::new(optimize_expr(when)), optimize_expr(then))
.collect(),
else_expr: optimize_expr(else_expr),

Copy link
Member Author

@houqp houqp Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been going back and forth on this one. So basically what I am trying to avoid is

CASE
    WHEN true THEN (col1 = true)
    ELSE 1
END; 

Being optimized into:

CASE
    WHEN true THEN col1
    ELSE 1
END; 

This is not a valid optimization when col1 column is not typed as boolean. Although we currently don't support comparison between boolean type and none boolean type in datafusion, i am expecting us to support it in the future. Am I overthinking this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i think what I can do is to check for column type based on plan schema and skip optimization for col = true case when col type is not boolean. col = false is always safe to be optimized into !col. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point @houqp -- I think it applies to all expressions (not just CASE statements). You are in essence worried about changing the type of an expression from boolean to something else which is a good thing to worry about!

For columns that have types other than boolean, I would expect an expression like col1 = true to be eventually rewritten to CAST(col1, boolean) = true in which case the optimization to CAST(col1, boolean) is correct.

Upon reading this code a bit more, I can't recall exactly when cast's (coercions) are inserted.

I think skipping the boolean rewrite when col is not a boolean is the right thing to do -- and if that type of comparison can happen we need to update the rewrite rules for BinaryExpr, NotExpr, etc

Copy link
Member Author

@houqp houqp Feb 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have updated the code to apply the folding optimization to all plan nodes and expressions wherever applicable. We currently doesn't do the CAST rewrite at the moment, so that needs to be handled in a separate ticket/PR to get full support for applying binary operators on expressions with different types. This PR now assumes automatic casting will be added in the future and skips optimization for expressions that are not in boolean type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a test for rewriting expressions in a non-filter plan would be valuable (e.g. make a join plan or something)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan -- I think this PR as written is quite efficient and doesn't need a convergence loop as you suggest (which I think ends up potentially being quite inefficient if many rewrites are required) -- it already does a depth first traversal of the tree, simplifying on the way up.

I think convergence loops might be best used if we have several rules that can each potentially make changes that would unlock additional optimizations of the others

For example, if you had two different optimization functions like optimization_A and optimization_B but parts of optimization_A wouldn't be applied unless you ran optimization_B. In that case a loop like the following would let you take full advantage of that

while !changed {
  let exprA = optimization_A(expr);
  let exprB = optimization_B(exprA);
  changed = expr != exprB;
  expr = exprB
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can probably guess given PRs like #9278 my preference to avoid repeating the structure walking logic is via a Visitor. Perhaps after this PR is merged, I can take a shot at rewriting it using a general ExprRewriter type pattern.

@houqp houqp marked this pull request as draft January 29, 2021 06:04
@houqp houqp force-pushed the qp_boolean branch 2 times, most recently from a781829 to 4318278 Compare February 13, 2021 22:54
@houqp houqp marked this pull request as ready for review February 13, 2021 23:33
@houqp
Copy link
Member Author

houqp commented Feb 13, 2021

@Dandandan @alamb ready for another round of review. Given this PR has grown into 900 lines, I think it would be better to work on a new ExprRewriter refactor only PR to ease the review process. We will be able to get a better idea of how the new framework works by applying it to multiple optimization rules in the same patch set.

@jorgecarleitao jorgecarleitao force-pushed the master branch 2 times, most recently from d4608a9 to 356c300 Compare February 14, 2021 12:09
@alamb
Copy link
Contributor

alamb commented Feb 14, 2021

Thanks @houqp -- I ran out of time today to review this PR, but I plan to review it tomorrow. I agree that a ExprRewriter refactor would be better done in a separate PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@houqp I started reviewing this PR but for some reason it seems to include many more changes than just your boolean literal rewrites. Is there any chance you can rebase it against current apache/master?

@houqp
Copy link
Member Author

houqp commented Feb 15, 2021

@alamb rebased PR on latest master.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @houqp -- I just went through this pretty carefully and I think it looks like a great foundation to begin with. I had some possible code cleanup suggestions, but I don't see any reason not to merge this in and clean it up as we improve the DataFusion optimizer framework more

Really nice work on the tests as well. 👍

@alamb alamb changed the title ARROW-11366: [Datafusion] support boolean literal in comparison expression ARROW-11366: [Datafusion] Implement constant folding for boolean literal expressions Feb 15, 2021
@houqp
Copy link
Member Author

houqp commented Feb 15, 2021

let me push up another commit to incorporate the clean up suggestions

@houqp
Copy link
Member Author

houqp commented Feb 15, 2021

all feedback addressed, thanks for the suggestion @alamb , tests are a lot easier to read and maintain now compared to what I started with :)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking even better than before @houqp . Thanks so much! Unless anyone else has comments I will plan to merge this tomorrow.

cc @andygrove and @Dandandan (this is the start of such a cool feature)

when_then_expr,
else_expr,
} => {
// recurse into CASE WHEN condition expressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let schema = expr_test_schema();

// x = null is always null
assert_eq!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are so much nicer to read ❤️

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @houqp !!

@alamb alamb closed this in bca7d2f Feb 17, 2021
@alamb
Copy link
Contributor

alamb commented Feb 17, 2021

Thanks again @houqp -- really nice job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants