feat: clarify behavior of is_null, is_not_null, is_nan, is_finite, and is_infinite for nulls #285

thisisnic · 2022-08-10T12:13:23Z

This PR adds the option to control whether calling is_null and is_not_null on a nan value returns true/false, and also adds the option to control is calling is_nan on a null value returns true/false/null.

cpcloud · 2022-08-12T14:47:33Z

extensions/functions_comparison.yaml

@@ -62,6 +62,9 @@ scalar_functions:
    description: Whether a value is null.
    impls:
      - args:
+          - name: nan_is_null


Would you mind elaborating on the motivation for adding this option? I'm wondering why an is_nan function wouldn't suffice.

Different systems have different definitions of whether NaN is equivalent to NULL or not.

This is something we encountered when working on the Acero compute functions; https://issues.apache.org/jira/browse/ARROW-12959 and https://issues.apache.org/jira/browse/ARROW-12960.

If we wanted is_null() to return TRUE for NaN values and TRUE for NULL, then I don't think mapping it to is_nan() would achieve this, as that would return FALSE for NULL. Or would it return NULL for NULL? Either way, I guess having this explicitly as an option removes that ambiguity.

Is there a reason producers can't handle this kind of transform?

Most systems treat nan and null as distinct. As far as I know, it's only R and Pandas that support treating these two values equivalently in some contexts but not all.

Ideally we can avoid propagating the mistake of treating these two values as optionally equivalent as part of the spec.

+1 for @cpcloud. I don't see why these shenanigans couldn't be modeled without having this option. Let's not embed arbitrary and questionable special cases into Substrait core.

cpcloud · 2022-08-12T20:39:58Z

extensions/functions_comparison.yaml

@@ -78,9 +84,15 @@ scalar_functions:
    description: Whether a value is not a number.
    impls:
      - args:
+          - name: value_for_null


Similarly, what's the motivation to allow NULL to take on different logical meaning?

Assuming there is some function that assigns a default value to null values, it's also not primitive. I'm too lazy to look up if it does, but if not it probably should; I'll define it as unwrap_or(T?, T) -> T for now (borrowing the name from Rust). In that case, this function is just is_nan(x), unwrap_or(is_nan(x), true), or unwrap_or(is_nan(x), false) for the three options respectively, with the added benefit that value_for_null can now be any expression.

extensions/functions_comparison.yaml

jvanstraten · 2022-08-15T16:44:57Z

extensions/functions_comparison.yaml

    description: Whether a value is not null.
    impls:
      - args:
+          - name: nan_is_null
+            options: [ TRUE, FALSE ]
+            required: false
          - value: any1
        return: BOOLEAN
        nullability: DECLARED_OUTPUT


This function is rather redundant. It's just not(is_null(x)). What's the reason for having both?

jvanstraten · 2022-08-15T16:50:14Z

extensions/functions_comparison.yaml

@@ -78,9 +84,15 @@ scalar_functions:
    description: Whether a value is not a number.
    impls:
      - args:
+          - name: value_for_null


Assuming there is some function that assigns a default value to null values, it's also not primitive. I'm too lazy to look up if it does, but if not it probably should; I'll define it as unwrap_or(T?, T) -> T for now (borrowing the name from Rust). In that case, this function is just is_nan(x), unwrap_or(is_nan(x), true), or unwrap_or(is_nan(x), false) for the three options respectively, with the added benefit that value_for_null can now be any expression.

jvanstraten · 2022-08-15T16:52:13Z

extensions/functions_comparison.yaml

@@ -62,6 +62,9 @@ scalar_functions:
    description: Whether a value is null.
    impls:
      - args:
+          - name: nan_is_null


+1 for @cpcloud. I don't see why these shenanigans couldn't be modeled without having this option. Let's not embed arbitrary and questionable special cases into Substrait core.

thisisnic · 2022-08-15T19:52:52Z

OK, cool, thanks for taking a look; will close this PR

thisisnic · 2022-08-16T11:13:57Z

@cpcloud @jvanstraten In a separate conversation, @saulpw suggested that perhaps we could either update the function descriptions to clarify this difference and/or add a note to the types page to clarify it there. Would be good to get your thoughts on this - I'm thinking that if we don't want to add it to the description via my latest changes to this PR (I'm leaning towards not merging this given a lot of DBs don't operate this way anyway), it would still be good to add a few words to the docs to be explicit about it.

jvanstraten

Can't argue with making sure things are abundantly clear :)

ianmcook · 2022-08-29T20:10:18Z

Thanks @thisisnic for doing this and thanks @cpcloud and @jvanstraten for the feedback. When @thisisnic and I initially discussed this idea of adding options for whether or not to treat NaN values as nulls, it seemed like a nice convenience that would simplify some producer-consumer integrations (such as the dplyr-Acero integration). But I understand the rationale for keeping this kind of thing out of Substrait for the sake of simplicity and consistency.

ianmcook · 2022-08-29T20:36:21Z

@jvanstraten could you confirm that you would expect the return values of these expressions to be as follows (assuming the input has floating point type)?

is_null(NaN) → false
is_not_null(NaN) → true
is_nan(null) → false
is_finite(null) → false (I'm uncertain about this one)
is_infinite(null) → false

This PR makes explicit the first two. Should we also make explicit the last three?

thisisnic · 2022-08-30T09:18:17Z

This PR makes explicit the first two. Should we also make explicit the last three?

I think so. Happy to add those in here once we've confirmed those are indeed our expected results.

jvanstraten · 2022-08-30T16:18:56Z

Actually I would expect is_nan, is_finite, and is_infinite to have MIRROR nullability behavior and return null when given null.

The way I conceptualize a nullable type is as a sum type (aka tagged union) of the value null and the non-nullable base type. NaNs and infinities, on the other hand, are part of float types (thanks IEEE); they are part of the number. So a nullable float isn't a union(null, +inf, -inf, NaN, finite), it's union(null, float), where float := union(+inf, -inf, NaN, finite). is_nan/is_finite/is_infinite have nothing to do with nullability and thus should IMO be defined for floats, so they are simply not defined for the value null. So they should fail in that case, which customarily means returning null. Returning false is the next best thing and I won't fight it if that's how these things are historically implemented, but it's not what I would expect. If I would want to write something like is_finite by your definition I would intuitively write something like coalesce(is_finite(x), false).

Are they already defined in Substrait or in a PR somewhere?

ETA: I would also expect is_not_null(x) to be exactly the same as not(is_null(x)).

ianmcook · 2022-08-31T01:07:12Z

Thanks @jvanstraten. I think your rationale for is_nan, is_finite, and is_infinite returning null when given null makes sense and is the right way to do it. My prior thinking about this was influenced by R, in which is.nan(NA_real_) returns FALSE. But I think your rationale makes more sense than R's behavior.

@thisisnic does the above make sense to you too? If so, could you please modify the entries for is_nan, is_finite, and is_infinite to explicitly specify nullability: MIRROR? It looks like there is an example of this in functions_arithmetic.yaml for reference, and this MIRROR nullability handling behavior is explained here. And if it seems important to also clarify in the descriptions of these functions that null input yields null output, then maybe do that too? Thank you!

thisisnic force-pushed the nan_and_null_options branch from 5e0801b to 57d75b5 Compare August 10, 2022 15:22

cpcloud reviewed Aug 12, 2022

View reviewed changes

jvanstraten requested changes Aug 15, 2022

View reviewed changes

thisisnic closed this Aug 15, 2022

thisisnic reopened this Aug 16, 2022

thisisnic force-pushed the nan_and_null_options branch from 57d75b5 to e22dd80 Compare August 16, 2022 11:09

jvanstraten previously approved these changes Aug 16, 2022

View reviewed changes

thisisnic dismissed jvanstraten’s stale review via a4d37ec September 1, 2022 13:59

thisisnic added 2 commits September 1, 2022 15:05

feat: add options for null and nan checking behaviour

fad8733

feat: update nullability options

928bf43

thisisnic force-pushed the nan_and_null_options branch from a4d37ec to 928bf43 Compare September 1, 2022 14:10

jvanstraten changed the title ~~feat: add options for null and nan checking behaviour~~ feat: improve behavior documentation for is_null, is_not_null, is_nan, is_finite, is_ Sep 2, 2022

jvanstraten changed the title ~~feat: improve behavior documentation for is_null, is_not_null, is_nan, is_finite, is_~~ feat: improve behavior documentation for is_null, is_not_null, is_nan, is_finite, is_infinite Sep 2, 2022

jvanstraten changed the title ~~feat: improve behavior documentation for is_null, is_not_null, is_nan, is_finite, is_infinite~~ feat: clarify behavior of is_null, is_not_null, is_nan, is_finite, and is_infinite for nulls Sep 2, 2022

jvanstraten approved these changes Sep 2, 2022

View reviewed changes

jvanstraten merged commit cb25124 into substrait-io:main Sep 2, 2022

ianmcook mentioned this pull request Sep 2, 2022

Establish intended scope of functions in extensions/*.yaml #307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: clarify behavior of is_null, is_not_null, is_nan, is_finite, and is_infinite for nulls #285

feat: clarify behavior of is_null, is_not_null, is_nan, is_finite, and is_infinite for nulls #285

thisisnic commented Aug 10, 2022

cpcloud Aug 12, 2022

thisisnic Aug 15, 2022 •

edited

Loading

cpcloud Aug 15, 2022

jvanstraten Aug 15, 2022

cpcloud Aug 12, 2022

jvanstraten Aug 15, 2022

jvanstraten Aug 15, 2022

jvanstraten Aug 15, 2022

jvanstraten Aug 15, 2022

thisisnic commented Aug 15, 2022

thisisnic commented Aug 16, 2022 •

edited

Loading

jvanstraten left a comment

ianmcook commented Aug 29, 2022

ianmcook commented Aug 29, 2022 •

edited

Loading

thisisnic commented Aug 30, 2022

jvanstraten commented Aug 30, 2022 •

edited

Loading

ianmcook commented Aug 31, 2022 •

edited

Loading

feat: clarify behavior of is_null, is_not_null, is_nan, is_finite, and is_infinite for nulls #285

feat: clarify behavior of is_null, is_not_null, is_nan, is_finite, and is_infinite for nulls #285

Conversation

thisisnic commented Aug 10, 2022

Choose a reason for hiding this comment

thisisnic Aug 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic commented Aug 15, 2022

thisisnic commented Aug 16, 2022 • edited Loading

jvanstraten left a comment

Choose a reason for hiding this comment

ianmcook commented Aug 29, 2022

ianmcook commented Aug 29, 2022 • edited Loading

thisisnic commented Aug 30, 2022

jvanstraten commented Aug 30, 2022 • edited Loading

ianmcook commented Aug 31, 2022 • edited Loading

thisisnic Aug 15, 2022 •

edited

Loading

thisisnic commented Aug 16, 2022 •

edited

Loading

ianmcook commented Aug 29, 2022 •

edited

Loading

jvanstraten commented Aug 30, 2022 •

edited

Loading

ianmcook commented Aug 31, 2022 •

edited

Loading