pairwise_comparison() outputs both NA and NaN #140

Bisaloo · 2021-11-10T20:32:32Z

Reprex

d <- tempfile()

download.file(
  "https://gist.githubusercontent.com/Bisaloo/62615a3821da71b8f542f5d5bc096327/raw/1415d04901e8168ceed2ca1f554b8799cf3ba6ac/reprex_NA_NaN_scoringutils.txt",
  d
)

df <- dget(d)

library(tidyverse)
library(scoringutils)
#> Note: The definition of the weighted interval score has slightly changed in version 0.1.5. If you want to use the old definition, use the argument `count_median_twice = TRUE` in the function `eval_forecasts()`
#> 
#> Attaching package: 'scoringutils'
#> The following object is masked from 'package:purrr':
#> 
#>     update_list

suppressWarnings({
  df %>%
    filter(n_quantiles == 23) %>%
    select(model, target_variable, horizon, location, location_name,
           forecast_date, interval_score = wis) %>%
    pairwise_comparison(
      metric = "interval_score",
      baseline = "EuroCOVIDhub-baseline",
      by = c("model", "target_variable", "forecast_date", "horizon",
             "location", "location_name"),
      summarise_by = c("model", "target_variable", "horizon", "location")
    ) %>%
    select(model, target_variable, horizon, location, rel_wis = scaled_rel_skill) %>%
    distinct()
})
#>                    model target_variable horizon location rel_wis
#> 1: EuroCOVIDhub-baseline       inc death       3       IS     NaN
#> 2: EuroCOVIDhub-ensemble       inc death       3       IS      NA
#> 3: EuroCOVIDhub-ensemble       inc death       3       IS     NaN
#> 4:               ILM-EKF       inc death       3       IS     NaN
#> 5:            MUNI-ARIMA       inc death       3       IS     NaN
#> 6:    RobertWalraven-ESG       inc death       3       IS     NaN
#> 7:       UMass-MechBayes       inc death       3       IS     NaN
#> 8:         USC-SIkJalpha       inc death       3       IS     NaN
#> 9:  epiforecasts-EpiNow2       inc death       3       IS       0

^{Created on 2021-11-10 by the reprex package (v2.0.1.9000)}

As you can see here, we have two almost identical lines for EuroCOVIDhub-ensemble excepted that one is with NA and the other with NaN. The fact that a model can have more than one row cause issues in downstream analyses in our cases.

Description of the problem

It looks like pairwise_comparison() sometimes returns NA and sometimes NaN when it cannot compute the value. This leads to confusion because both NA and NaN indicate almost the same thing but they have strange incompatibilites:

identical(NA, NaN)
#> [1] FALSE

NA + NaN
#> [1] NA

NaN + NA
#> [1] NaN

^{Created on 2021-11-10 by the reprex package (v2.0.1.9000)}

Proposed solution

NaN is rarely used and confusing (IMO). They often appear because the function doesn't control the output for errors/impossible computations. And they can cause serious issues in downstream analyses (such as in our case). Tomas Kalibera gives a good overview of the hell that is NA vs NaN:

The result of binary operations involving NA and NaN is hardware dependent (the propagation of NaN payload) - on some hardware, it actually works the way we would like - NA is returned - but on some hardware you get NaN or sometimes NA and sometimes NaN. Also there are C compiler optimizations re-ordering code, as mentioned in ?NaN. Then there are also external numerical libraries that do not distinguish NA from NaN (NA is an R concept)

So we should stick to only one of these. As far as I know, NA always propagates as NA while f(NaN) can return NA or NaN depending on f() so a conscious choice to always output NA would be much better / clearer IMO.

The text was updated successfully, but these errors were encountered:

Bisaloo · 2021-11-10T20:34:21Z

Ah, I was so slow to post this issue that there is already a fix: #139.

sbfnk · 2021-11-11T07:00:24Z

Good to have the problem laid out in detail, though - I chose to make everything NaN but it sounds like we're better of with NA.

For what it's worth the problem in our case was:

> dt <- data.table(id = c(rep(1, 3), rep(2, 3), rep(3, 3)), a = c(as.vector(replicate(3, c(NA_real_, rep(rnorm(1), 2))))))
> dt[, a := unique(na.omit(a)), by = id]
> unique(dt)
   id          a
1:  1 0.05785241
2:  2 0.05029081
3:  3 0.57708069
> dt <- data.table(id = c(rep(1, 3), rep(2, 3), rep(3, 3)), a = c(as.vector(replicate(2, c(NA_real_, rep(rnorm(1), 2)))), c(NA_real_, rep(NaN, 2))))
> dt[, a := unique(na.omit(a)), by = id]
> unique(dt)
   id           a
1:  1  0.09825976
2:  2 -1.06038560
3:  3          NA
4:  3         NaN

When saving to csv the second data table leads to a duplicated row.

nikosbosse · 2021-11-22T16:26:18Z

Hm wouldn't it be best to assign NA_real to theta in #139?

Bisaloo · 2021-11-22T16:33:48Z

It has been changed after (but directly to master): 24b1efc

nikosbosse · 2021-11-22T16:34:58Z

ah. perfect!

Bisaloo closed this as completed Nov 10, 2021

sbfnk reopened this Nov 11, 2021

sbfnk closed this as completed in 24b1efc Nov 11, 2021

nikosbosse reopened this Nov 22, 2021

nikosbosse closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pairwise_comparison() outputs both NA and NaN #140

pairwise_comparison() outputs both NA and NaN #140

Bisaloo commented Nov 10, 2021 •

edited

Loading

Bisaloo commented Nov 10, 2021 •

edited

Loading

sbfnk commented Nov 11, 2021

nikosbosse commented Nov 22, 2021

Bisaloo commented Nov 22, 2021 •

edited

Loading

nikosbosse commented Nov 22, 2021

pairwise_comparison() outputs both NA and NaN #140

pairwise_comparison() outputs both NA and NaN #140

Comments

Bisaloo commented Nov 10, 2021 • edited Loading

Reprex

Description of the problem

Proposed solution

Bisaloo commented Nov 10, 2021 • edited Loading

sbfnk commented Nov 11, 2021

nikosbosse commented Nov 22, 2021

Bisaloo commented Nov 22, 2021 • edited Loading

nikosbosse commented Nov 22, 2021

Bisaloo commented Nov 10, 2021 •

edited

Loading

Bisaloo commented Nov 10, 2021 •

edited

Loading

Bisaloo commented Nov 22, 2021 •

edited

Loading