feat: add guarantees to simplification #7467

wjones127 · 2023-09-03T21:56:39Z

Which issue does this PR close?

Closes #6171.

Rationale for this change

When scanning files we sometimes have statistics about min and max values and null counts. We can translate those into statements about the possible values of each column, and then use those two simplify expressions and predicates.

What changes are included in this PR?

This PR:

Adds a new NullableInterval type, which is essentially a pair of Intervals, one representing the valid values and a boolean interval representing the validity.
Adds a new method ExprSimplifier::simplify_with_guarantees(), which is similar to ExprSimplifier::simplify() except it allows passing pairs of column expressions and NullableIntervals to allow for even more simplification. Right now, this handles, IS (NOT) NULL, BETWEEN, inequalities, plain column references, and InList.

Are these changes tested?

Most of the new tests reside in guarantees.rs.

Are there any user-facing changes?

This does not change existing APIs, only adds new ones.

wjones127 · 2023-09-11T00:17:37Z

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

+pub struct NullableInterval {
+    /// The interval for the values
+    pub values: Interval,
+    /// A boolean interval representing whether the value is certainly valid
+    /// (not null), certainly null, or has an unknown validity. This takes
+    /// precedence over the values in the interval: if this field is equal to
+    /// [Interval::CERTAINLY_FALSE], then the interval is certainly null
+    /// regardless of what `values` is.
+    pub is_valid: Interval,
+}


Hi @metesynnada. You're idea of using a pair of intervals has worked quite well. Since it's a new struct, this shouldn't impact the performance of your existing cp_solver code.

I ended up not moving the Interval struct into datafusion-common. First, datafusion-optimizer already depends on datafusion-physical-epxr, so I didn't need to move it. Plus it has some dependencies within this crate that make it not easy to move. I think if we wanted to, we might be able to move it to datafusion-expr, but I'd rather leave that to a different PR.

cc @ozankabak

alamb

This is really neat @wjones127 -- thank you 🦾

I went through this code and the tests thoroughly and I think it is really nice. While I have some suggestions on code structure that I think can simplify it substantially I also think this PR could be merged as is. This PR's code is both well commented, and well tested.

Also, I think this could potentially be used to finally unify the pruning predicate code with the other interval analyses as described here: #5535 (basically it would simplify the expression given the statistics for parquet row groups and if the expression evaluated to a constant we could filter out the row group (as well as potentially skip the filter entirely)

alamb · 2023-09-11T18:11:51Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

@@ -149,6 +153,76 @@ impl<S: SimplifyInfo> ExprSimplifier<S> {

        expr.rewrite(&mut expr_rewrite)
    }
+
+    /// Input guarantees and simplify the expression.


alamb · 2023-09-11T18:12:52Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

+    ///    .with_schema(schema);
+    /// let simplifier = ExprSimplifier::new(context);
+    ///
+    /// // Expression: (x >= 3) AND (y + 2 < 10) AND (z > 5)


this is really cool @wjones127

alamb · 2023-09-11T18:14:56Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

+    /// // z > 5.
+    /// assert_eq!(output, expr_z);
+    /// ```
+    pub fn simplify_with_guarantees<'a>(


Another potential API for this might be to store the guarantees on the simplifier -- like

let expr = ExprSimplifier::new(context) .with_guarantees(guarantees) .simplify()?

The downside is that the guarantees would have to be owned (aka a Vec)

So I think this API is fine, I just wanted to mention the possibility

The downside is that the guarantees would have to be owned (aka a Vec)

That doesn't seem to bad, I think. My imagined use case is that we re-use the same simplifier with different guarantees but the same predicate. Something like:

let mut simplifier = ExprSimplifier::new(context); for row_group in file { let guarantees = get_guarantees(row_groups.statistics); simplifier = simplifier.with_guarantees(guarantees); let group_predicate = simplifier.simplify(predicate); // Do something with the predicate }

So my main concern is that it's performant if handled in a loop like that. I think it should be.

Switched to this API.

alamb · 2023-09-11T18:58:28Z

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

+    /// precedence over the values in the interval: if this field is equal to
+    /// [Interval::CERTAINLY_FALSE], then the interval is certainly null
+    /// regardless of what `values` is.
+    pub is_valid: Interval,


I got confused about this enumeration for a while as it seems to use a boolean interval to represent one of three states. It took me a while to grok that CERTAINLY_TRUE mean that the the interval was not null, though the concept is very well documented ❤️

I think this representation also might be more error prone as an invalid state can be encoded (for example, what happens if is_valid contains ScalarValue::Float32, or what if the code uses values when is_valid is CERTAINLY_FALSE)

Perhaps we can use an enum and let the rust compiler and type system ensure we always have valid states for NullableInterval with something like :

pub enum NullableInterval { /// The value is always null in this interval Null, /// The value may or may not be null in this interval. If it is non null its value is within /// the specified values interval MaybeNull { values : Interval }, /// The value is definitely not null in this interval and is within values NotNull { vaules: Interval }, }

I think that might make the intent of some of the code above clearer as well, rather than checking for null ness using tests for CERTAINLY_FALSE

Now that you say it, that enum seems like the obvious right choice. I'll try that out and see how it simplifies things.

I made this change, and it's generally clearer. I did make the Null variant store a datatype, otherwise I found we could get errors in the ConstEvaluator.

Looks good to me!

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

alamb · 2023-09-11T19:04:15Z

datafusion/optimizer/src/simplify_expressions/guarantees.rs

+        match &expr {
+            Expr::IsNull(inner) => {
+                if let Some(interval) = self.intervals.get(inner.as_ref()) {
+                    if interval.is_valid == Interval::CERTAINLY_FALSE {


stylistically I think I would find this easier to read if it was in a method like

if interval.always_null() { ... } else if interval.alway_not_null() { .. } else { ... }

If you use the enum proposal below for NullableInteval this would naturally become a match statement like

match self.intervals.get(inner.as_ref()) { Some(NullableInterval::Null) => Ok(lit(true)), Some(NullableInterval::NotNull{..}) => Ok(lit(false)), _ => Ok(expr) }

Which I think may express the intent of the code more concisely and clearly

alamb · 2023-09-11T20:08:04Z

BTW I think the CI failure is due to #7523

alamb · 2023-09-11T20:11:14Z

I also filed #7526 for some improvements I thought of while reviewing this PR

wjones127 · 2023-09-12T06:06:43Z

Also, I think this could potentially be used to finally unify the pruning predicate code with the other interval analyses as described here

I'll be applying this in Lance first, but I will come back later and integrate this with Parquet scanning. We'll want the Parquet scanning piece in delta-rs.

metesynnada

PR looks great.

metesynnada · 2023-09-12T06:53:07Z

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

+    /// });
+    ///
+    /// ```
+    pub fn apply_operator(&self, op: &Operator, rhs: &Self) -> Result<Self> {


metesynnada · 2023-09-12T06:57:16Z

datafusion/optimizer/src/simplify_expressions/guarantees.rs

+use datafusion_physical_expr::intervals::{Interval, IntervalBound, NullableInterval};
+
+/// Rewrite expressions to incorporate guarantees.
+pub(crate) struct GuaranteeRewriter<'a> {


Can you write a docstring for GuaranteeRewriter?

berkaysynnada · 2023-09-12T07:44:19Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

@@ -129,6 +139,7 @@ impl<S: SimplifyInfo> ExprSimplifier<S> {
        expr.rewrite(&mut const_evaluator)?
            .rewrite(&mut simplifier)?
            .rewrite(&mut or_in_list_simplifier)?
+            .rewrite(&mut guarantee_rewriter)?
            // run both passes twice to try an minimize simplifications that we missed
            .rewrite(&mut const_evaluator)?
            .rewrite(&mut simplifier)


What do you think of such a loop to cover every simplification case and make it easier to accommodate future simplifications, or would it be unnecessary?

loop { let original_expr = expr.clone(); expr = expr .rewrite(&mut const_evaluator)? .rewrite(&mut simplifier)? .rewrite(&mut or_in_list_simplifier)? .rewrite(&mut guarantee_rewriter)?; if expr == original_expr { break; } }

This is a neat idea. I think we should try it in a follow on PR.

ALso, If we did this I would also suggest adding a limit on the number of loops (to avoid a "ping-poing" infinite loop where passes rewrite an expression back and forth)

berkaysynnada · 2023-09-12T08:23:22Z

datafusion/optimizer/src/simplify_expressions/guarantees.rs

+
+            Expr::BinaryExpr(BinaryExpr { left, op, right }) => {
+                // Check if this is a comparison
+                match op {


You may prefer this function:

if !op.is_comparison_operator() { return Ok(expr); }

Thanks! Much better now :)

alamb

❤️

alamb · 2023-09-12T18:52:45Z

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

+    /// assert_eq!(output, expr_z);
+    /// ```
+    pub fn with_guarantees(mut self, guarantees: Vec<(Expr, NullableInterval)>) -> Self {
+        self.guarantees = guarantees;


alamb · 2023-09-12T20:11:36Z

datafusion/optimizer/src/simplify_expressions/guarantees.rs

+
+                    let contains = expr_interval.contains(*interval)?;
+
+                    if contains.is_certainly_true() {


I really like how easy this is to read now

datafusion/optimizer/src/simplify_expressions/guarantees.rs

datafusion/physical-expr/src/intervals/interval_aritmetic.rs

datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs

…arantee

alamb · 2023-09-13T16:42:31Z

I took the liberty of merging up from main and fixing the clippy and doc errors on this branch

wjones127 · 2023-09-13T18:09:46Z

Thanks @alamb!

feat: add guarantees to simplifcation

44d1b48

wjones127 changed the title ~~feat: add guarantees to simplifcation~~ feat: add guarantees to simplification Sep 3, 2023

github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules labels Sep 3, 2023

wjones127 added 4 commits September 3, 2023 21:50

null and comparison support

4c1c3a9

add support for literal expressions

2134f2f

implement inlist guarantee use

caa738f

test the outer function

ff7ed70

github-actions bot removed the physical-expr Physical Expressions label Sep 4, 2023

docs

a6b57e3

wjones127 force-pushed the 6171-simplify-with-guarantee branch from 45ae442 to a6b57e3 Compare September 5, 2023 03:10

wjones127 added 2 commits September 10, 2023 16:40

refactor to use intervals

a78f837

add high-level test

011f176

github-actions bot added the physical-expr Physical Expressions label Sep 10, 2023

cleanup

4bd9b60

wjones127 added the enhancement New feature or request label Sep 11, 2023

wjones127 commented Sep 11, 2023

View reviewed changes

fix test to be false or null, not true

16d78c6

wjones127 force-pushed the 6171-simplify-with-guarantee branch from a9aeecf to 16d78c6 Compare September 11, 2023 02:41

wjones127 marked this pull request as ready for review September 11, 2023 03:27

wjones127 requested a review from alamb September 11, 2023 03:27

alamb reviewed Sep 11, 2023

View reviewed changes

alamb approved these changes Sep 11, 2023

View reviewed changes

alamb mentioned this pull request Sep 11, 2023

Minor: Add comments and clearer constructors to Interval #7526

Merged

wjones127 added 2 commits September 11, 2023 22:38

refactor: change NullableInterval to an enum

bffb137

refactor: use a builder-like API

e4427a3

metesynnada reviewed Sep 12, 2023

View reviewed changes

metesynnada approved these changes Sep 12, 2023

View reviewed changes

berkaysynnada reviewed Sep 12, 2023

View reviewed changes

alamb approved these changes Sep 12, 2023

View reviewed changes

wjones127 and others added 4 commits September 12, 2023 15:11

pr feedback

f4e8680

Merge remote-tracking branch 'apache/main' into 6171-simplify-with-gu…

a28d5eb

…arantee

Fix clippy

b50df80

fix doc links

2452957

wjones127 merged commit 8946f8b into main Sep 13, 2023
42 checks passed

wjones127 deleted the 6171-simplify-with-guarantee branch September 13, 2023 18:09

alamb mentioned this pull request Nov 20, 2023

Refactor Interval Arithmetic Updates #8276

Merged

alamb mentioned this pull request Dec 16, 2023

Add LiteralGuarantee on columns to extract conditions required for PhysicalExpr expressions to evaluate to true #8437

Merged

jayzhan211 mentioned this pull request May 11, 2024

Apply guarantee rewriter to sql workflow #10456

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add guarantees to simplification #7467

feat: add guarantees to simplification #7467

wjones127 commented Sep 3, 2023 •

edited

Loading

wjones127 Sep 11, 2023

alamb left a comment

alamb Sep 11, 2023

alamb Sep 11, 2023

alamb Sep 11, 2023

wjones127 Sep 11, 2023

wjones127 Sep 12, 2023

alamb Sep 11, 2023

wjones127 Sep 11, 2023

wjones127 Sep 12, 2023

ozankabak Sep 12, 2023

alamb Sep 11, 2023

alamb commented Sep 11, 2023

alamb commented Sep 11, 2023 •

edited

Loading

wjones127 commented Sep 12, 2023

metesynnada left a comment

metesynnada Sep 12, 2023

metesynnada Sep 12, 2023

berkaysynnada Sep 12, 2023

alamb Sep 12, 2023

berkaysynnada Sep 12, 2023

wjones127 Sep 12, 2023

alamb left a comment

alamb Sep 12, 2023

alamb Sep 12, 2023

alamb commented Sep 13, 2023

wjones127 commented Sep 13, 2023


		let contains = expr_interval.contains(*interval)?;

		if contains.is_certainly_true() {

feat: add guarantees to simplification #7467

feat: add guarantees to simplification #7467

Conversation

wjones127 commented Sep 3, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 11, 2023

alamb commented Sep 11, 2023 • edited Loading

wjones127 commented Sep 12, 2023

metesynnada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 13, 2023

wjones127 commented Sep 13, 2023

wjones127 commented Sep 3, 2023 •

edited

Loading

alamb commented Sep 11, 2023 •

edited

Loading