Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate 3 Range Analysis / Interval implementations (cost model, pruning predicates, interval analysis) #5535

Closed
alamb opened this issue Mar 9, 2023 · 3 comments · Fixed by #6982
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 9, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

At the time of writing , DataFusion has three partially complete range/interval analysis implementations:

Range analysis used in cardinality estimation

https://docs.rs/datafusion-physical-expr/18.0.0/datafusion_physical_expr/trait.PhysicalExpr.html#method.analyze
https://docs.rs/datafusion-physical-expr/18.0.0/datafusion_physical_expr/struct.AnalysisContext.html
(cc @isidentical)

Pruning Predicate

https://docs.rs/datafusion/18.0.0/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html
(creates an expr and then evaluates it against statistics)

Intervals

https://github.com/apache/arrow-datafusion/blob/6a8c4a60114c7132de6d2fcd6d44111cf40b3df9/datafusion/physical-expr/src/intervals/mod.rs#L24
Introduced by @metesynnada and @ozankabak in #5322

The modules are all structured a differently / have different implementations but what they are computing (interval / range analysis) is basically the same and they all operate on PhysicalExprs.

Since there are three implementations, as we add new PhysicalExprs either the work must be triplicated or else (more likely) the implementations will keep diverging in functionality.

In addition it is hard to increase the scope of coverage when the logic is in different places

Describe the solution you'd like
I would like a single implementation of range analysis used for all three uses in DataFusion. The one introduced by @metesynnada and @ozankabak in #5322 seems to have the best documentation and theoretical foundation, but the other two are more functionally complete.

I don't have a preference for which approach is taken, other than that we end up with a single implementation that is easier to maintain

Describe alternatives you've considered

Additional context

I believe @ozankabak said they may have plans in this area (#5322 (comment)) but I am not sure

cc @mingmwang

@ozankabak
Copy link
Contributor

Thank you for opening an issue to track this. We are actively working on extending interval support and will unify the existing code along the way.

FYI, our first step will be supporting more data types (e.g. timestamps, floating point). Then, we will add support for more operations (multiplication and division). There are other remaining steps after these, but at this point we can probably unify the code before adding the other features.

@metesynnada
Copy link
Contributor

@alamb I started to work on this, do you have any suggestions on PruningPredicate?

@alamb
Copy link
Contributor Author

alamb commented Apr 5, 2023

@alamb I started to work on this, do you have any suggestions on PruningPredicate?

I would personally suggest breaking this project into a few parts:

  1. consolidating the first two (not the pruning predicate).
  2. consolidating pruning predicate

Doing PruningPredicate needs some thought -- perhaps an incremental approach like figure out how it will work for simple predicates (a < 5) or whatever, and get the basics working.

Then we can do a series of PRs to port over the rest of the expressions (and tests)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants