Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add bounds for unary math scalar functions #11584

Merged
merged 18 commits into from
Jul 24, 2024

Conversation

tshauck
Copy link
Contributor

@tshauck tshauck commented Jul 21, 2024

Which issue does this PR close?

Closes #11583 11583

Rationale for this change

Many math unary functions aren't unbounded, but because the macro doesn't include a way to modify this on a per function basis the trait default (of unbounded) is used.

What changes are included in this PR?

  1. Change the make_math_unary_udf macro to take a bounds function
  2. Implement a function for math functions where the bounds are known a priori.

Are these changes tested?

Yes, see unittests.

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate and removed core Core DataFusion crate labels Jul 21, 2024
@tshauck tshauck marked this pull request as ready for review July 21, 2024 17:12
pub(super) fn unbounded(_input: &[&Interval]) -> crate::Result<Interval> {
// We cannot assume the input datatype is the same of output type.
Interval::make_unbounded(&DataType::Null)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better here returning not_impl_err? The current version may cause type mismatch errors and debugging would require more time.

Copy link
Contributor Author

@tshauck tshauck Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would cause an error when trying to use these functions? The update macro is going to call whatever is passed as $EVALUATE_BOUNDS. Right now this implementation matches the default of the ScalarUDFImpl trait.

I think an alternative is to make something like Interval::make_symmetric_inf. I'm using unbounded here somewhat loosely, but probably in most of these cases it should be (-inf, inf) which could be given a type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went ahead and did that here: cf2b023

@@ -332,6 +332,46 @@ impl Interval {
Ok(Self::new(unbounded_endpoint.clone(), unbounded_endpoint))
}

/// Creates an interval between -∞ to ∞.
pub fn make_infinity_interval(data_type: &DataType) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use ScalarValue::Float(None) for infinite bounds by convention. Even if you give Float(Some(f::INF)) or Float(Some(f::NaN)), they are converted to Float(None)'s during interval creation. We had given such a decision to have unique representation of unboundedness. You can check the details here:

macro_rules! handle_float_intervals {

I recommend here to use make_unbounded API with floating types to not break this convention.

}

/// Create an interval from 0 to infinity.
pub fn make_non_negative_infinity_interval(data_type: &DataType) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion applies here as well

@berkaysynnada
Copy link
Contributor

Thanks @tshauck. This is a nice step towards having comprehensive interval analysis. I have left a small suggestion. Once that is addressed, the PR will be good to go 🚀

@ozankabak
Copy link
Contributor

Apart from @berkaysynnada's outstanding suggestions the only thing I see is that we would need to use two different π values, depending on whether it is in the lower bound or the upper bound of the resulting interval.

If π will be in the upper bound, we should use the smallest floating point number that is greater than the mathematical (precise) value of π. Let's call this pi_upper. Conversely, if π will be in the lower bound, we should use the largest floating point number that is less than the mathematical (precise) value of π.

Since π is not exactly representable, the pi you get from Rust is actually either pi_upper or pi_lower (you need to check). You can obtain the other one by using the nextafter function (let's just be careful to do this only once and use it as a constant everywhere).

Apart from this detail that is addressable by a small fix, this seems to be a great PR. Thank you.

@tshauck
Copy link
Contributor Author

tshauck commented Jul 23, 2024

Thanks to both of you for the thoughtful feedback.

@berkaysynnada, I updated infinity to use unbounded in 231a717.

@ozankabak, for handling π, do you mean libm::nextafter? There's also a .next_up, but unfortunately, it's in nightly. Did I overlook the actual function, or if not, any opinion on bringing in libm as dependency vs something w/i datafusion, albeit less robust?

@ozankabak
Copy link
Contributor

IIRC nextafter is the IEEE standard name for this function, and Rust's next_up does something similar (in the upward direction). If it is in nightly, we probably should just create constants with these values explicitly given for each float type.

@tshauck
Copy link
Contributor Author

tshauck commented Jul 24, 2024

Thanks @ozankabak, would you please have a look at the latest changes and lmk what you think?

@berkaysynnada
Copy link
Contributor

I checked the new version for both unbounded creation and rounded PI usage. Everything looks good and seems ready to merge. Thanks @tshauck!

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me -- thanks for this great work @tshauck!

@tshauck
Copy link
Contributor Author

tshauck commented Jul 24, 2024

Thanks! -- it looks like signum was changed in the meantime, so removing that for now as it doesn't use make_math_unary_udf anymore. The rebase seems ok locally, but I'll check to see if things don't go through for w/e reason here.

@alamb
Copy link
Contributor

alamb commented Jul 24, 2024

I took a quick look and this looks good to me. Thank you @tshauck @berkaysynnada and @ozankabak 🚀

@alamb alamb merged commit 5901df5 into apache:main Jul 24, 2024
24 checks passed
@tshauck tshauck deleted the add-unary-udf-bounds branch July 24, 2024 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement evaluate_bounds for math unary functions
4 participants