Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pairwise_comparison() outputs both NA and NaN #140

Closed
Bisaloo opened this issue Nov 10, 2021 · 5 comments
Closed

pairwise_comparison() outputs both NA and NaN #140

Bisaloo opened this issue Nov 10, 2021 · 5 comments

Comments

@Bisaloo
Copy link
Member

Bisaloo commented Nov 10, 2021

Reprex

d <- tempfile()

download.file(
  "https://gist.githubusercontent.com/Bisaloo/62615a3821da71b8f542f5d5bc096327/raw/1415d04901e8168ceed2ca1f554b8799cf3ba6ac/reprex_NA_NaN_scoringutils.txt",
  d
)

df <- dget(d)

library(tidyverse)
library(scoringutils)
#> Note: The definition of the weighted interval score has slightly changed in version 0.1.5. If you want to use the old definition, use the argument `count_median_twice = TRUE` in the function `eval_forecasts()`
#> 
#> Attaching package: 'scoringutils'
#> The following object is masked from 'package:purrr':
#> 
#>     update_list

suppressWarnings({
  df %>%
    filter(n_quantiles == 23) %>%
    select(model, target_variable, horizon, location, location_name,
           forecast_date, interval_score = wis) %>%
    pairwise_comparison(
      metric = "interval_score",
      baseline = "EuroCOVIDhub-baseline",
      by = c("model", "target_variable", "forecast_date", "horizon",
             "location", "location_name"),
      summarise_by = c("model", "target_variable", "horizon", "location")
    ) %>%
    select(model, target_variable, horizon, location, rel_wis = scaled_rel_skill) %>%
    distinct()
})
#>                    model target_variable horizon location rel_wis
#> 1: EuroCOVIDhub-baseline       inc death       3       IS     NaN
#> 2: EuroCOVIDhub-ensemble       inc death       3       IS      NA
#> 3: EuroCOVIDhub-ensemble       inc death       3       IS     NaN
#> 4:               ILM-EKF       inc death       3       IS     NaN
#> 5:            MUNI-ARIMA       inc death       3       IS     NaN
#> 6:    RobertWalraven-ESG       inc death       3       IS     NaN
#> 7:       UMass-MechBayes       inc death       3       IS     NaN
#> 8:         USC-SIkJalpha       inc death       3       IS     NaN
#> 9:  epiforecasts-EpiNow2       inc death       3       IS       0

Created on 2021-11-10 by the reprex package (v2.0.1.9000)

As you can see here, we have two almost identical lines for EuroCOVIDhub-ensemble excepted that one is with NA and the other with NaN. The fact that a model can have more than one row cause issues in downstream analyses in our cases.

Description of the problem

It looks like pairwise_comparison() sometimes returns NA and sometimes NaN when it cannot compute the value. This leads to confusion because both NA and NaN indicate almost the same thing but they have strange incompatibilites:

identical(NA, NaN)
#> [1] FALSE

NA + NaN
#> [1] NA

NaN + NA
#> [1] NaN

Created on 2021-11-10 by the reprex package (v2.0.1.9000)

Proposed solution

NaN is rarely used and confusing (IMO). They often appear because the function doesn't control the output for errors/impossible computations. And they can cause serious issues in downstream analyses (such as in our case). Tomas Kalibera gives a good overview of the hell that is NA vs NaN:

The result of binary operations involving NA and NaN is hardware dependent (the propagation of NaN payload) - on some hardware, it actually works the way we would like - NA is returned - but on some hardware you get NaN or sometimes NA and sometimes NaN. Also there are C compiler optimizations re-ordering code, as mentioned in ?NaN. Then there are also external numerical libraries that do not distinguish NA from NaN (NA is an R concept)

So we should stick to only one of these. As far as I know, NA always propagates as NA while f(NaN) can return NA or NaN depending on f() so a conscious choice to always output NA would be much better / clearer IMO.

@Bisaloo
Copy link
Member Author

Bisaloo commented Nov 10, 2021

Ah, I was so slow to post this issue that there is already a fix: #139.

@Bisaloo Bisaloo closed this as completed Nov 10, 2021
@sbfnk
Copy link
Contributor

sbfnk commented Nov 11, 2021

Good to have the problem laid out in detail, though - I chose to make everything NaN but it sounds like we're better of with NA.

For what it's worth the problem in our case was:

> dt <- data.table(id = c(rep(1, 3), rep(2, 3), rep(3, 3)), a = c(as.vector(replicate(3, c(NA_real_, rep(rnorm(1), 2))))))
> dt[, a := unique(na.omit(a)), by = id]
> unique(dt)
   id          a
1:  1 0.05785241
2:  2 0.05029081
3:  3 0.57708069
> dt <- data.table(id = c(rep(1, 3), rep(2, 3), rep(3, 3)), a = c(as.vector(replicate(2, c(NA_real_, rep(rnorm(1), 2)))), c(NA_real_, rep(NaN, 2))))
> dt[, a := unique(na.omit(a)), by = id]
> unique(dt)
   id           a
1:  1  0.09825976
2:  2 -1.06038560
3:  3          NA
4:  3         NaN

When saving to csv the second data table leads to a duplicated row.

@sbfnk sbfnk reopened this Nov 11, 2021
@sbfnk sbfnk closed this as completed in 24b1efc Nov 11, 2021
@nikosbosse nikosbosse reopened this Nov 22, 2021
@nikosbosse
Copy link
Contributor

Hm wouldn't it be best to assign NA_real to theta in #139?

@Bisaloo
Copy link
Member Author

Bisaloo commented Nov 22, 2021

It has been changed after (but directly to master): 24b1efc

@nikosbosse
Copy link
Contributor

ah. perfect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants