Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add guarantees to simplification #7467

Merged
merged 16 commits into from
Sep 13, 2023
Merged

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented Sep 3, 2023

Which issue does this PR close?

Closes #6171.

Rationale for this change

When scanning files we sometimes have statistics about min and max values and null counts. We can translate those into statements about the possible values of each column, and then use those two simplify expressions and predicates.

What changes are included in this PR?

This PR:

  • Adds a new NullableInterval type, which is essentially a pair of Intervals, one representing the valid values and a boolean interval representing the validity.
  • Adds a new method ExprSimplifier::simplify_with_guarantees(), which is similar to ExprSimplifier::simplify() except it allows passing pairs of column expressions and NullableIntervals to allow for even more simplification. Right now, this handles, IS (NOT) NULL, BETWEEN, inequalities, plain column references, and InList.

Are these changes tested?

Most of the new tests reside in guarantees.rs.

Are there any user-facing changes?

This does not change existing APIs, only adds new ones.

@wjones127 wjones127 changed the title feat: add guarantees to simplifcation feat: add guarantees to simplification Sep 3, 2023
@github-actions github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules labels Sep 3, 2023
@github-actions github-actions bot removed the physical-expr Physical Expressions label Sep 4, 2023
@github-actions github-actions bot added the physical-expr Physical Expressions label Sep 10, 2023
@wjones127 wjones127 added the enhancement New feature or request label Sep 11, 2023
Comment on lines 761 to 770
pub struct NullableInterval {
/// The interval for the values
pub values: Interval,
/// A boolean interval representing whether the value is certainly valid
/// (not null), certainly null, or has an unknown validity. This takes
/// precedence over the values in the interval: if this field is equal to
/// [Interval::CERTAINLY_FALSE], then the interval is certainly null
/// regardless of what `values` is.
pub is_valid: Interval,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @metesynnada. You're idea of using a pair of intervals has worked quite well. Since it's a new struct, this shouldn't impact the performance of your existing cp_solver code.

I ended up not moving the Interval struct into datafusion-common. First, datafusion-optimizer already depends on datafusion-physical-epxr, so I didn't need to move it. Plus it has some dependencies within this crate that make it not easy to move. I think if we wanted to, we might be able to move it to datafusion-expr, but I'd rather leave that to a different PR.

cc @ozankabak

@wjones127 wjones127 marked this pull request as ready for review September 11, 2023 03:27
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really neat @wjones127 -- thank you 🦾

I went through this code and the tests thoroughly and I think it is really nice. While I have some suggestions on code structure that I think can simplify it substantially I also think this PR could be merged as is. This PR's code is both well commented, and well tested.

Also, I think this could potentially be used to finally unify the pruning predicate code with the other interval analyses as described here: #5535 (basically it would simplify the expression given the statistics for parquet row groups and if the expression evaluated to a constant we could filter out the row group (as well as potentially skip the filter entirely)

@@ -149,6 +153,76 @@ impl<S: SimplifyInfo> ExprSimplifier<S> {

expr.rewrite(&mut expr_rewrite)
}

/// Input guarantees and simplify the expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

/// .with_schema(schema);
/// let simplifier = ExprSimplifier::new(context);
///
/// // Expression: (x >= 3) AND (y + 2 < 10) AND (z > 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really cool @wjones127

/// // z > 5.
/// assert_eq!(output, expr_z);
/// ```
pub fn simplify_with_guarantees<'a>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential API for this might be to store the guarantees on the simplifier -- like

let expr = ExprSimplifier::new(context)
  .with_guarantees(guarantees)
  .simplify()?

The downside is that the guarantees would have to be owned (aka a Vec)

So I think this API is fine, I just wanted to mention the possibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is that the guarantees would have to be owned (aka a Vec)

That doesn't seem to bad, I think. My imagined use case is that we re-use the same simplifier with different guarantees but the same predicate. Something like:

let mut simplifier = ExprSimplifier::new(context);
for row_group in file {
    let guarantees = get_guarantees(row_groups.statistics);
    simplifier = simplifier.with_guarantees(guarantees);
    let group_predicate = simplifier.simplify(predicate);
    // Do something with the predicate
}   

So my main concern is that it's performant if handled in a loop like that. I think it should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to this API.

/// precedence over the values in the interval: if this field is equal to
/// [Interval::CERTAINLY_FALSE], then the interval is certainly null
/// regardless of what `values` is.
pub is_valid: Interval,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused about this enumeration for a while as it seems to use a boolean interval to represent one of three states. It took me a while to grok that CERTAINLY_TRUE mean that the the interval was not null, though the concept is very well documented ❤️

I think this representation also might be more error prone as an invalid state can be encoded (for example, what happens if is_valid contains ScalarValue::Float32, or what if the code uses values when is_valid is CERTAINLY_FALSE)

Perhaps we can use an enum and let the rust compiler and type system ensure we always have valid states for NullableInterval with something like :

pub enum NullableInterval {
  /// The value is always null in this interval
  Null,
  /// The value may or may not be null in this interval. If it is non null its value is within
  /// the specified values interval
  MaybeNull { values : Interval }, 
  /// The value is definitely not null in this interval and is within values
  NotNull { vaules: Interval },
}

I think that might make the intent of some of the code above clearer as well, rather than checking for null ness using tests for CERTAINLY_FALSE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you say it, that enum seems like the obvious right choice. I'll try that out and see how it simplifies things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change, and it's generally clearer. I did make the Null variant store a datatype, otherwise I found we could get errors in the ConstEvaluator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

match &expr {
Expr::IsNull(inner) => {
if let Some(interval) = self.intervals.get(inner.as_ref()) {
if interval.is_valid == Interval::CERTAINLY_FALSE {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistically I think I would find this easier to read if it was in a method like

if interval.always_null() {
...
} else if interval.alway_not_null() {
..
} else { 
...
}

If you use the enum proposal below for NullableInteval this would naturally become a match statement like

match self.intervals.get(inner.as_ref()) {
  Some(NullableInterval::Null) => Ok(lit(true)),
  Some(NullableInterval::NotNull{..}) => Ok(lit(false)),
  _ => Ok(expr)
}

Which I think may express the intent of the code more concisely and clearly

@alamb
Copy link
Contributor

alamb commented Sep 11, 2023

BTW I think the CI failure is due to #7523

@alamb
Copy link
Contributor

alamb commented Sep 11, 2023

I also filed #7526 for some improvements I thought of while reviewing this PR

@wjones127
Copy link
Member Author

Also, I think this could potentially be used to finally unify the pruning predicate code with the other interval analyses as described here

I'll be applying this in Lance first, but I will come back later and integrate this with Parquet scanning. We'll want the Parquet scanning piece in delta-rs.

Copy link
Contributor

@metesynnada metesynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks great.

/// });
///
/// ```
pub fn apply_operator(&self, op: &Operator, rhs: &Self) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

use datafusion_physical_expr::intervals::{Interval, IntervalBound, NullableInterval};

/// Rewrite expressions to incorporate guarantees.
pub(crate) struct GuaranteeRewriter<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a docstring for GuaranteeRewriter?

@@ -129,6 +139,7 @@ impl<S: SimplifyInfo> ExprSimplifier<S> {
expr.rewrite(&mut const_evaluator)?
.rewrite(&mut simplifier)?
.rewrite(&mut or_in_list_simplifier)?
.rewrite(&mut guarantee_rewriter)?
// run both passes twice to try an minimize simplifications that we missed
.rewrite(&mut const_evaluator)?
.rewrite(&mut simplifier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of such a loop to cover every simplification case and make it easier to accommodate future simplifications, or would it be unnecessary?

        loop {
            let original_expr = expr.clone();
            expr = expr
                .rewrite(&mut const_evaluator)?
                .rewrite(&mut simplifier)?
                .rewrite(&mut or_in_list_simplifier)?
                .rewrite(&mut guarantee_rewriter)?;

            if expr == original_expr {
                break;
            }
        }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat idea. I think we should try it in a follow on PR.

ALso, If we did this I would also suggest adding a limit on the number of loops (to avoid a "ping-poing" infinite loop where passes rewrite an expression back and forth)


Expr::BinaryExpr(BinaryExpr { left, op, right }) => {
// Check if this is a comparison
match op {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may prefer this function:

        if !op.is_comparison_operator() {
                   return Ok(expr);
               }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Much better now :)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

/// assert_eq!(output, expr_z);
/// ```
pub fn with_guarantees(mut self, guarantees: Vec<(Expr, NullableInterval)>) -> Self {
self.guarantees = guarantees;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


let contains = expr_interval.contains(*interval)?;

if contains.is_certainly_true() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how easy this is to read now

@alamb
Copy link
Contributor

alamb commented Sep 13, 2023

I took the liberty of merging up from main and fixing the clippy and doc errors on this branch

@wjones127
Copy link
Member Author

Thanks @alamb!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Expr SimplifyWithGuarantee
5 participants