Implement Expr SimplifyWithGuarantee #6171

wjones127 · 2023-04-30T21:57:49Z

Is your feature request related to a problem or challenge?

We are starting to look for more advanced methods for filter pushdown. I was starting to think of porting SimplifyWithGuarantee. The critical functionality we are looking for is being able to evaluate a predicate against some statistics and get the residual expression. For example, if I have the predicate x = 1 AND y < 2:

file 1 with stats 0 <= x <= 20 and 0 < y <= 1 => residual filter x = 1 (y < 2 is always satisfied) => scan this file with x = 1 filter
file 2 with stats 3 < y <= 10 => residual filter false => don't scan this file since it will never satisfy the predicate

Describe the solution you'd like

I think a straightforward port of that function would be useful, but if there is a design that integrates better with existing functionality, I'm open to other designs.

/// Given a guarantee expression and a predicate expression, simplify the predicate expression.
/// 
/// # Example
/// 
/// This is useful for example when filtering data that has statistics. For example
/// if the statistics tell you `x > 2` (the guarantee), and you want to filter with
/// `x > 3 and y < 0`, then you can simplify the predicate to `y < 0`. Alternatively,
/// if the predicate is `x < 1 and y < 0`, then you know now directly from the 
/// statistics that the predicate will always be false, so your filter can 
/// immediately return an empty result.
/// 
/// ```
/// use datafusion_expr::{lit, col, Expr};
/// 
/// let guarantee = col("x") > lit(2);
/// let predicate = (col("x") > lit(3)) & (col("y") < lit(0));
/// assert_eq!(predicate.simplify_with_guarantee(guarantee), col("y") < lit(0));
/// 
/// let predicate = (col("x") < lit(1)) & (col("y") < lit(0));
/// assert_eq!(predicate.simplify_with_guarantee(guarantee), lit(false));
/// ```
pub fn simplify_with_guarantee(&self, guarantee: &Expr) -> Expr {
    todo!()
}

Describe alternatives you've considered

It seems like the current solutions with PruningPredicate don't give you the residual expression.

Additional context

This is related to #5830

The text was updated successfully, but these errors were encountered:

alamb · 2023-05-01T13:31:34Z

I think it would be really nice to integrate this idea with the range analysis that we already have in DataFusion - see #5535 (I think @ozankabak and @mustafasrepo are thinking about this)

It would be amazing to incorporate bounds analysis into expr simplification, though I would like to request we don't use yet another representation of ranges / bounds

ozankabak · 2023-05-01T15:30:56Z

This is totally doable with the interval library, we will be happy to help. BTW @alamb the interval/range unification is coming soon -- the interval library needs a couple more features and then we should be able to retire the old code

wjones127 · 2023-05-01T17:56:48Z

Would the interval library support non-integer data types? Such as strings, booleans, dates, timestamps? When I was looking recently it seemed to mostly support integers.

ozankabak · 2023-05-01T18:51:25Z

It will support all integral types (booleans, integers etc.), floats and dates/timestamps. Most of that support is already there. You can not do arithmetic with strings, so we haven't focused on those yet. But you can certainly analyze inequalities etc, and the code structure is amenable to that.

wjones127 · 2023-09-03T21:37:48Z

It would be amazing to incorporate bounds analysis into expr simplification, though I would like to request we don't use yet another representation of ranges / bounds

I'm taking a look at this right now. Two issues I see right now:

The Interval bounds doesn't include any information about nullability. I'd like to simplify expressions like X IS NOT NULL to true or false if null statistics support that simplification.
The interval library operates on physical expressions, while the simplification operates on logical expressions.

I'm working on a PR right now that will include a struct that is parallel to Interval for logical expressions that includes the null information. Once I have that working I'll see if it's worth consolidating that with the existing Interval struct.

ozankabak · 2023-09-04T11:02:06Z

I will also think about how we can add nullability support to the interval library without resulting in large changes or performance impacts. Moving it outside of physical-expr as a general purpose library is fairly trivial.

Let's exchange ideas so we can support your use case with as little code/functionality duplication as possible.

wjones127 · 2023-09-05T03:27:47Z

I will also think about how we can add nullability support to the interval library without resulting in large changes or performance impacts. Moving it outside of physical-expr as a general purpose library is fairly trivial.

Could I extract the Interval struct into datafusion_common, add the nullability field, and then leave a note in the cp_solver module that nullability isn't handled yet?

Here is the null tracking definition I created:

https://github.com/apache/arrow-datafusion/blob/a6b57e38eb00da2a6c5396dca0b5f1772578ac78/datafusion/optimizer/src/simplify_expressions/guarantees.rs#L71-L84

Does that seem good to you?

metesynnada · 2023-09-05T07:52:11Z

Based on your requirements, it appears that you only need two intervals in a struct. One interval would be for range analysis, and the other would be a boolean interval for null status. By having (false, false) as the input, you could indicate that the value is never null. (false, true) would indicate that it may be null, and (true, true) would indicate that it is always null.

Therefore, there is no need to modify the interval library. Instead, you can create your own logic while still using the cp_solver's helper functions. I also agree with your suggestion to move the interval library into datafusion_common.

wjones127 added the enhancement New feature or request label Apr 30, 2023

wjones127 mentioned this issue Apr 30, 2023

[EPIC] More SQL optimizer optimizations for Datafusion #5923

Open

7 tasks

wjones127 mentioned this issue May 3, 2023

feat: delete operation delta-io/delta-rs#1176

Merged

wjones127 mentioned this issue Sep 3, 2023

feat: add guarantees to simplification #7467

Merged

wjones127 closed this as completed in #7467 Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Expr SimplifyWithGuarantee #6171

Implement Expr SimplifyWithGuarantee #6171

wjones127 commented Apr 30, 2023 •

edited

Loading

alamb commented May 1, 2023

ozankabak commented May 1, 2023

wjones127 commented May 1, 2023

ozankabak commented May 1, 2023

wjones127 commented Sep 3, 2023

ozankabak commented Sep 4, 2023

wjones127 commented Sep 5, 2023

metesynnada commented Sep 5, 2023

Implement Expr SimplifyWithGuarantee #6171

Implement Expr SimplifyWithGuarantee #6171

Comments

wjones127 commented Apr 30, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented May 1, 2023

ozankabak commented May 1, 2023

wjones127 commented May 1, 2023

ozankabak commented May 1, 2023

wjones127 commented Sep 3, 2023

ozankabak commented Sep 4, 2023

wjones127 commented Sep 5, 2023

metesynnada commented Sep 5, 2023

wjones127 commented Apr 30, 2023 •

edited

Loading