Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhysicalExpr Orderings with Range Information #10504

Merged
merged 17 commits into from
May 17, 2024

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented May 14, 2024

Which issue does this PR close?

Closes #9832.

Rationale for this change

There exists a bug in CastExpr orderings, as detailed in the issue. To solve this problem, we need to append DataType information of children in the get_ordering method of PhysicalExpr's. However, this problem can be solved in a more clever way offering more functionality.

I propose to extend get_ordering method to get_properties such that it can also carry the range information of expressions, which could have an important meaning for expressions when investigating the order information.

Having range information would help ScalarFunctionExpr's order calculations since many of them have monotonicity pattern on some defined intervals. I have given an example of it for ABS() function. CastExpr bug is also solved since the range carries the datatype information.

What changes are included in this PR?

  1. Function monotonicity in scalar functions should not be set via a boolean vector, wrt the effect of the parameter on function monotonicity. Instead, it should be hardcoded in the function bodies, as most functions have unique monotonicity patterns and need to decide order based on their input properties. (I have updated the existing monotonicity calculations and add some more, but there are still TODO's)
  2. To be able to use ScalarFunctionExpr's in interval arithmetic, we will have also evaluate_bounds() and propagate_constraints() methods in ScalarUDF.
  3. Minor updates in interval_arithmetic.rs
  • Storing unbounded unsigned integers as [0 - Null]
  • Or support for boolean intervals
  • Arithmetic negation of intervals

Are these changes tested?

Yes, in order.slt

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels May 14, 2024
@alamb
Copy link
Contributor

alamb commented May 14, 2024

I haven't had a chance to review this PR yet @berkaysynnada -- I wonder if you have seen the API in #10117 from @tinfoil-knight

datafusion/expr/src/udf.rs Outdated Show resolved Hide resolved
@berkaysynnada
Copy link
Contributor Author

I haven't had a chance to review this PR yet @berkaysynnada -- I wonder if you have seen the API in #10117 from @tinfoil-knight

Yes, I have. It is becoming a nicer and more useful API for the current version of monotonicity. However, I believe we will eventually need to carry this range data along with the expressions to be able to perform deeper analyses and optimizations.

My initial motivation was just to solve the CastExpr bug, but then I realized that this bug is part of a larger need. I am, of course, open to iterating over this PR.

@ozankabak
Copy link
Contributor

ozankabak commented May 15, 2024

That PR and this are orthogonal. The work in this PR will simply inherit/benefit from the refactor in the other PR.

@berkaysynnada kindly reminded me that the type undergoing the refactor disappears in this extended formulation. In that case this PR may supersede the refactor one.

I will review this one in detail tomorrow and expand further.

("a".to_string(), "a".to_string()),
("b".to_string(), "b".to_string()),
("c".to_string(), "c".to_string()),
];
Copy link
Contributor Author

@berkaysynnada berkaysynnada May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test bug shows up, fixed here.

@alamb
Copy link
Contributor

alamb commented May 15, 2024

@berkaysynnada kindly reminded me that the type undergoing the refactor disappears in this extended formulation. In that case this PR may supersede the refactor one.

Indeed, though I still see

    /// Calculates the [`SortProperties`] of this function based on its children's properties.
    fn monotonicity(&self, _input: &[ExprProperties]) -> Result<SortProperties> {
        Ok(SortProperties::Unordered)
    }

In this PR

@alamb
Copy link
Contributor

alamb commented May 15, 2024

Having range information would help ScalarFunctionExpr's order calculations since many of them have monotonicity pattern on some defined intervals. I have given an example of it for ABS() function.

This is a pretty clever idea. I think as long as we ensure that it is easy to implement the common patterns of montonically increasing and montonically decreasing adding a more general API is great

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this carefully and sent two commits improving on it. This is a fantastic PR that hits two birds in one stone:

  • It fixes the CAST bug,
  • It enables us to perform more optimizations w.r.t. monotonicity by utilizing range information (and obviating the need for a type like FuncMonotonicity in the process).

I only have a few asks and then we can merge this:

  1. The monotonicity definition example in advanced_udf.rs seems to be removed. If possible, let's add it back in to serve as an example to users.
  2. Similar comment for function_factory.rs.
  3. If all the inputs of a function are Singleton, should the output SortProperties also be Singleton? If so, maybe we can do this check once before calling individual monotonicity functions.

@alamb, it'd be good if you could take a quick look before we merge (but not necessary as I reviewed carefully). FYI, defining an "all-increasing" function becomes a one-liner in this approach, so your point is incorporated in this design as well.

@alamb
Copy link
Contributor

alamb commented May 17, 2024

@alamb, it'd be good if you could take a quick look before we merge (but not necessary as I reviewed carefully). FYI, defining an "all-increasing" function becomes a one-liner in this approach, so your point is incorporated in this design as well.

Sounds awesome -- I will try and do so tomorrow but if I don't get a chance feel free to merge it and I'll review afterwards. Thank you @ozankabak and @berkaysynnada

@berkaysynnada
Copy link
Contributor Author

3. If all the inputs of a function are Singleton, should the output SortProperties also be Singleton? If so, maybe we can do this check once before calling individual monotonicity functions.

For the most functions, it is so, but there could be edge cases like some time dependent functions or somehow depending randomly generated values. We have discussed it and decided to keep it as it is. I have addressed other minor issues, and this PR is ready to be merged.

@ozankabak
Copy link
Contributor

I will go ahead and merge this and we will fix any issues with a quick follow-on PR in case @alamb discovers any when he has time to take a look.

@ozankabak ozankabak merged commit d2fb05e into apache:main May 17, 2024
23 checks passed
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @berkaysynnada -- I looked at the API a little bit more today, and I think we may want to consider renaming ScalarUDF::monotonicity to better reflect what it does

The overall idea of allowing scalar UDFs to be part of the interval /boundary analysis is really nice


/// Calculates the [`SortProperties`] of this function based on its
/// children's properties.
fn monotonicity(&self, _inputs: &[ExprProperties]) -> Result<SortProperties> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we release this PR, I wonder if we should call this function sort_properties or calculate_sort, rather than monotonicty

Calling it monotonicty may cause upgrade pain as the signature (types) changed from

    /// This function specifies monotonicity behaviors for User defined scalar functions.
    fn monotonicity(&self) -> Result<Option<FuncMonotonicity>> {
        Ok(None)
    }

It also seems like monotonicity doesn't accurately reflect what it does any more, though I may misunderstand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I can rename it as calculate_order if you think it is explanatory enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just order() or output_ordering() (to mirror ExecutionPlan::output_ordering)?

// specific language governing permissions and limitations
// under the License.

use arrow::datatypes::DataType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see many (any?) tests for this new code

Specifically, I am imagining tests like "if we broke / introduced a bug in one of these implementations would a test fail"

Did I miss them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. These functions were not tested with the old API, and they haven't been tested now either. Should we add unit tests for them in this file, or would it be better to cover them in the .slt tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would recommend .slt tests (that show sorts/not sorts for exmaple), but I am not sure if you have sufficient bandwidth to do so. Maybe unit tests would be best.

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Self review

* Fix null interval accumulation

* Refactor monotonicity

* Ignore failing tests

* Initial impl

* Ready for review

* Update properties.rs

* Update configs.md

Update configs.md

* cargo doc

* Add abs test

* Update properties.rs

* Update udf.rs

* Review Part 1

* Review Part 2

* Minor

---------

Co-authored-by: Mehmet Ozan Kabak <[email protected]>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect Order Preserving Cases for Cast Expressions
3 participants