Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: clarify behavior of is_null, is_not_null, is_nan, is_finite, and is_infinite for nulls #285

Merged
merged 2 commits into from
Sep 2, 2022

Conversation

thisisnic
Copy link
Contributor

This PR adds the option to control whether calling is_null and is_not_null on a nan value returns true/false, and also adds the option to control is calling is_nan on a null value returns true/false/null.

@@ -62,6 +62,9 @@ scalar_functions:
description: Whether a value is null.
impls:
- args:
- name: nan_is_null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind elaborating on the motivation for adding this option? I'm wondering why an is_nan function wouldn't suffice.

Copy link
Contributor Author

@thisisnic thisisnic Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different systems have different definitions of whether NaN is equivalent to NULL or not.

This is something we encountered when working on the Acero compute functions; https://issues.apache.org/jira/browse/ARROW-12959 and https://issues.apache.org/jira/browse/ARROW-12960.

If we wanted is_null() to return TRUE for NaN values and TRUE for NULL, then I don't think mapping it to is_nan() would achieve this, as that would return FALSE for NULL. Or would it return NULL for NULL? Either way, I guess having this explicitly as an option removes that ambiguity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason producers can't handle this kind of transform?

Most systems treat nan and null as distinct. As far as I know, it's only R and Pandas that support treating these two values equivalently in some contexts but not all.

Ideally we can avoid propagating the mistake of treating these two values as optionally equivalent as part of the spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @cpcloud. I don't see why these shenanigans couldn't be modeled without having this option. Let's not embed arbitrary and questionable special cases into Substrait core.

@@ -78,9 +84,15 @@ scalar_functions:
description: Whether a value is not a number.
impls:
- args:
- name: value_for_null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, what's the motivation to allow NULL to take on different logical meaning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming there is some function that assigns a default value to null values, it's also not primitive. I'm too lazy to look up if it does, but if not it probably should; I'll define it as unwrap_or(T?, T) -> T for now (borrowing the name from Rust). In that case, this function is just is_nan(x), unwrap_or(is_nan(x), true), or unwrap_or(is_nan(x), false) for the three options respectively, with the added benefit that value_for_null can now be any expression.

extensions/functions_comparison.yaml Show resolved Hide resolved
Comment on lines 73 to 93
description: Whether a value is not null.
impls:
- args:
- name: nan_is_null
options: [ TRUE, FALSE ]
required: false
- value: any1
return: BOOLEAN
nullability: DECLARED_OUTPUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is rather redundant. It's just not(is_null(x)). What's the reason for having both?

@@ -78,9 +84,15 @@ scalar_functions:
description: Whether a value is not a number.
impls:
- args:
- name: value_for_null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming there is some function that assigns a default value to null values, it's also not primitive. I'm too lazy to look up if it does, but if not it probably should; I'll define it as unwrap_or(T?, T) -> T for now (borrowing the name from Rust). In that case, this function is just is_nan(x), unwrap_or(is_nan(x), true), or unwrap_or(is_nan(x), false) for the three options respectively, with the added benefit that value_for_null can now be any expression.

@@ -62,6 +62,9 @@ scalar_functions:
description: Whether a value is null.
impls:
- args:
- name: nan_is_null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @cpcloud. I don't see why these shenanigans couldn't be modeled without having this option. Let's not embed arbitrary and questionable special cases into Substrait core.

@thisisnic
Copy link
Contributor Author

OK, cool, thanks for taking a look; will close this PR

@thisisnic
Copy link
Contributor Author

thisisnic commented Aug 16, 2022

@cpcloud @jvanstraten In a separate conversation, @saulpw suggested that perhaps we could either update the function descriptions to clarify this difference and/or add a note to the types page to clarify it there. Would be good to get your thoughts on this - I'm thinking that if we don't want to add it to the description via my latest changes to this PR (I'm leaning towards not merging this given a lot of DBs don't operate this way anyway), it would still be good to add a few words to the docs to be explicit about it.

jvanstraten
jvanstraten previously approved these changes Aug 16, 2022
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't argue with making sure things are abundantly clear :)

@ianmcook
Copy link
Contributor

Thanks @thisisnic for doing this and thanks @cpcloud and @jvanstraten for the feedback. When @thisisnic and I initially discussed this idea of adding options for whether or not to treat NaN values as nulls, it seemed like a nice convenience that would simplify some producer-consumer integrations (such as the dplyr-Acero integration). But I understand the rationale for keeping this kind of thing out of Substrait for the sake of simplicity and consistency.

@ianmcook
Copy link
Contributor

ianmcook commented Aug 29, 2022

@jvanstraten could you confirm that you would expect the return values of these expressions to be as follows (assuming the input has floating point type)?

  • is_null(NaN)false
  • is_not_null(NaN)true
  • is_nan(null)false
  • is_finite(null)false (I'm uncertain about this one)
  • is_infinite(null)false

This PR makes explicit the first two. Should we also make explicit the last three?

@thisisnic
Copy link
Contributor Author

This PR makes explicit the first two. Should we also make explicit the last three?

I think so. Happy to add those in here once we've confirmed those are indeed our expected results.

@jvanstraten
Copy link
Contributor

jvanstraten commented Aug 30, 2022

Actually I would expect is_nan, is_finite, and is_infinite to have MIRROR nullability behavior and return null when given null.

The way I conceptualize a nullable type is as a sum type (aka tagged union) of the value null and the non-nullable base type. NaNs and infinities, on the other hand, are part of float types (thanks IEEE); they are part of the number. So a nullable float isn't a union(null, +inf, -inf, NaN, finite), it's union(null, float), where float := union(+inf, -inf, NaN, finite). is_nan/is_finite/is_infinite have nothing to do with nullability and thus should IMO be defined for floats, so they are simply not defined for the value null. So they should fail in that case, which customarily means returning null. Returning false is the next best thing and I won't fight it if that's how these things are historically implemented, but it's not what I would expect. If I would want to write something like is_finite by your definition I would intuitively write something like coalesce(is_finite(x), false).

Are they already defined in Substrait or in a PR somewhere?

ETA: I would also expect is_not_null(x) to be exactly the same as not(is_null(x)).

@ianmcook
Copy link
Contributor

ianmcook commented Aug 31, 2022

Thanks @jvanstraten. I think your rationale for is_nan, is_finite, and is_infinite returning null when given null makes sense and is the right way to do it. My prior thinking about this was influenced by R, in which is.nan(NA_real_) returns FALSE. But I think your rationale makes more sense than R's behavior.

@thisisnic does the above make sense to you too? If so, could you please modify the entries for is_nan, is_finite, and is_infinite to explicitly specify nullability: MIRROR? It looks like there is an example of this in functions_arithmetic.yaml for reference, and this MIRROR nullability handling behavior is explained here. And if it seems important to also clarify in the descriptions of these functions that null input yields null output, then maybe do that too? Thank you!

@jvanstraten jvanstraten changed the title feat: add options for null and nan checking behaviour feat: improve behavior documentation for is_null, is_not_null, is_nan, is_finite, is_ Sep 2, 2022
@jvanstraten jvanstraten changed the title feat: improve behavior documentation for is_null, is_not_null, is_nan, is_finite, is_ feat: improve behavior documentation for is_null, is_not_null, is_nan, is_finite, is_infinite Sep 2, 2022
@jvanstraten jvanstraten changed the title feat: improve behavior documentation for is_null, is_not_null, is_nan, is_finite, is_infinite feat: clarify behavior of is_null, is_not_null, is_nan, is_finite, and is_infinite for nulls Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants