-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Expr SimplifyWithGuarantee #6171
Comments
I think it would be really nice to integrate this idea with the range analysis that we already have in DataFusion - see #5535 (I think @ozankabak and @mustafasrepo are thinking about this) It would be amazing to incorporate bounds analysis into expr simplification, though I would like to request we don't use yet another representation of ranges / bounds |
This is totally doable with the interval library, we will be happy to help. BTW @alamb the interval/range unification is coming soon -- the interval library needs a couple more features and then we should be able to retire the old code |
Would the interval library support non-integer data types? Such as strings, booleans, dates, timestamps? When I was looking recently it seemed to mostly support integers. |
It will support all integral types (booleans, integers etc.), floats and dates/timestamps. Most of that support is already there. You can not do arithmetic with strings, so we haven't focused on those yet. But you can certainly analyze inequalities etc, and the code structure is amenable to that. |
I'm taking a look at this right now. Two issues I see right now:
I'm working on a PR right now that will include a struct that is parallel to |
I will also think about how we can add nullability support to the interval library without resulting in large changes or performance impacts. Moving it outside of physical-expr as a general purpose library is fairly trivial. Let's exchange ideas so we can support your use case with as little code/functionality duplication as possible. |
Could I extract the Here is the null tracking definition I created: Does that seem good to you? |
Based on your requirements, it appears that you only need two intervals in a struct. One interval would be for range analysis, and the other would be a boolean interval for null status. By having (false, false) as the input, you could indicate that the value is never null. (false, true) would indicate that it may be null, and (true, true) would indicate that it is always null. Therefore, there is no need to modify the interval library. Instead, you can create your own logic while still using the |
Is your feature request related to a problem or challenge?
We are starting to look for more advanced methods for filter pushdown. I was starting to think of porting SimplifyWithGuarantee. The critical functionality we are looking for is being able to evaluate a predicate against some statistics and get the residual expression. For example, if I have the predicate
x = 1 AND y < 2
:0 <= x <= 20
and0 < y <= 1
=> residual filterx = 1
(y < 2
is always satisfied) => scan this file withx = 1
filter3 < y <= 10
=> residual filterfalse
=> don't scan this file since it will never satisfy the predicateDescribe the solution you'd like
I think a straightforward port of that function would be useful, but if there is a design that integrates better with existing functionality, I'm open to other designs.
Describe alternatives you've considered
It seems like the current solutions with
PruningPredicate
don't give you the residual expression.Additional context
This is related to #5830
The text was updated successfully, but these errors were encountered: