Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Expr SimplifyWithGuarantee #6171

Closed
Tracked by #5923
wjones127 opened this issue Apr 30, 2023 · 8 comments · Fixed by #7467
Closed
Tracked by #5923

Implement Expr SimplifyWithGuarantee #6171

wjones127 opened this issue Apr 30, 2023 · 8 comments · Fixed by #7467
Labels
enhancement New feature or request

Comments

@wjones127
Copy link
Member

wjones127 commented Apr 30, 2023

Is your feature request related to a problem or challenge?

We are starting to look for more advanced methods for filter pushdown. I was starting to think of porting SimplifyWithGuarantee. The critical functionality we are looking for is being able to evaluate a predicate against some statistics and get the residual expression. For example, if I have the predicate x = 1 AND y < 2:

  • file 1 with stats 0 <= x <= 20 and 0 < y <= 1 => residual filter x = 1 (y < 2 is always satisfied) => scan this file with x = 1 filter
  • file 2 with stats 3 < y <= 10 => residual filter false => don't scan this file since it will never satisfy the predicate

Describe the solution you'd like

I think a straightforward port of that function would be useful, but if there is a design that integrates better with existing functionality, I'm open to other designs.

/// Given a guarantee expression and a predicate expression, simplify the predicate expression.
/// 
/// # Example
/// 
/// This is useful for example when filtering data that has statistics. For example
/// if the statistics tell you `x > 2` (the guarantee), and you want to filter with
/// `x > 3 and y < 0`, then you can simplify the predicate to `y < 0`. Alternatively,
/// if the predicate is `x < 1 and y < 0`, then you know now directly from the 
/// statistics that the predicate will always be false, so your filter can 
/// immediately return an empty result.
/// 
/// ```
/// use datafusion_expr::{lit, col, Expr};
/// 
/// let guarantee = col("x") > lit(2);
/// let predicate = (col("x") > lit(3)) & (col("y") < lit(0));
/// assert_eq!(predicate.simplify_with_guarantee(guarantee), col("y") < lit(0));
/// 
/// let predicate = (col("x") < lit(1)) & (col("y") < lit(0));
/// assert_eq!(predicate.simplify_with_guarantee(guarantee), lit(false));
/// ```
pub fn simplify_with_guarantee(&self, guarantee: &Expr) -> Expr {
    todo!()
}

Describe alternatives you've considered

It seems like the current solutions with PruningPredicate don't give you the residual expression.

Additional context

This is related to #5830

@alamb
Copy link
Contributor

alamb commented May 1, 2023

I think it would be really nice to integrate this idea with the range analysis that we already have in DataFusion - see #5535 (I think @ozankabak and @mustafasrepo are thinking about this)

It would be amazing to incorporate bounds analysis into expr simplification, though I would like to request we don't use yet another representation of ranges / bounds

@ozankabak
Copy link
Contributor

This is totally doable with the interval library, we will be happy to help. BTW @alamb the interval/range unification is coming soon -- the interval library needs a couple more features and then we should be able to retire the old code

@wjones127
Copy link
Member Author

Would the interval library support non-integer data types? Such as strings, booleans, dates, timestamps? When I was looking recently it seemed to mostly support integers.

@ozankabak
Copy link
Contributor

It will support all integral types (booleans, integers etc.), floats and dates/timestamps. Most of that support is already there. You can not do arithmetic with strings, so we haven't focused on those yet. But you can certainly analyze inequalities etc, and the code structure is amenable to that.

@wjones127
Copy link
Member Author

It would be amazing to incorporate bounds analysis into expr simplification, though I would like to request we don't use yet another representation of ranges / bounds

I'm taking a look at this right now. Two issues I see right now:

  1. The Interval bounds doesn't include any information about nullability. I'd like to simplify expressions like X IS NOT NULL to true or false if null statistics support that simplification.
  2. The interval library operates on physical expressions, while the simplification operates on logical expressions.

I'm working on a PR right now that will include a struct that is parallel to Interval for logical expressions that includes the null information. Once I have that working I'll see if it's worth consolidating that with the existing Interval struct.

@ozankabak
Copy link
Contributor

I will also think about how we can add nullability support to the interval library without resulting in large changes or performance impacts. Moving it outside of physical-expr as a general purpose library is fairly trivial.

Let's exchange ideas so we can support your use case with as little code/functionality duplication as possible.

@wjones127
Copy link
Member Author

I will also think about how we can add nullability support to the interval library without resulting in large changes or performance impacts. Moving it outside of physical-expr as a general purpose library is fairly trivial.

Could I extract the Interval struct into datafusion_common, add the nullability field, and then leave a note in the cp_solver module that nullability isn't handled yet?

Here is the null tracking definition I created:

https://github.com/apache/arrow-datafusion/blob/a6b57e38eb00da2a6c5396dca0b5f1772578ac78/datafusion/optimizer/src/simplify_expressions/guarantees.rs#L71-L84

Does that seem good to you?

@metesynnada
Copy link
Contributor

Based on your requirements, it appears that you only need two intervals in a struct. One interval would be for range analysis, and the other would be a boolean interval for null status. By having (false, false) as the input, you could indicate that the value is never null. (false, true) would indicate that it may be null, and (true, true) would indicate that it is always null.

Therefore, there is no need to modify the interval library. Instead, you can create your own logic while still using the cp_solver's helper functions. I also agree with your suggestion to move the interval library into datafusion_common.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants