Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting the InferenceData format? #85

Open
paul-buerkner opened this issue Aug 12, 2020 · 15 comments
Open

Supporting the InferenceData format? #85

paul-buerkner opened this issue Aug 12, 2020 · 15 comments
Labels
feature New feature or request question Further information is requested

Comments

@paul-buerkner
Copy link
Collaborator

The arviz team present their InferenceData format at StanCon this year and I wanted to open this issue as a reminder to discuss whether posterior should support this format as well. The format design can be found at https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html

@paul-buerkner paul-buerkner added feature New feature or request question Further information is requested labels Aug 12, 2020
@jgabry
Copy link
Member

jgabry commented Aug 12, 2020

I'm not familiar with the format (will check out their talk). Is your question whether we should implement an R version of the InferenceData format that arviz offers in Python?

@jgabry
Copy link
Member

jgabry commented Aug 12, 2020

Is your question whether we should implement an R version of the InferenceData format that arviz offers in Python?

I guess that's obviously the question ;)

I just took a look at the link and if I understand the format correctly, they're storing observed data, prior draws, posterior draws (and summary stats?), and posterior predictive draws, all in a single object.

To start the discussion of whether to do this, here are a few different topics to discuss:

(A) Do we like this format?

(B) If so, how should it be implemented it?

For (A) my initial reaction was that it's a bad idea to store all of that in a single object, but I admit that it does seem useful as a way of keeping the various components of an analysis/workflow together. So I'm definitely open to this. (Update: I'm less concerned about storing all of this in one object now. It can easily be unpacked into multiple objects if the user wants)

For (B) to do this in R would we just use a list of objects of different types (i.e. S3)? Or use R6/reference classes? The former is obviously much simpler but the latter is more in line with the Python and safer (we could provide methods for modifying the object instead of S3 where the object can be messed with in any arbitrary way).

@canyon289
Copy link

canyon289 commented Aug 15, 2020

Hey Jonah, nice to meet you at Stancon and @paul-buerkner thank you for opening this, particularly in a public format.

First off want to say you guys know a lot more about R than I do so don't take my priors strongly. From the chat Jonah and I had though I think something that is conflated here is an InferenceData object in R memory versus InferenceData IO from netcdf on disk. I think the latter is the most important for cross collaboration.

Colloquially stating the point above in R memory do whatever you want, whether it's one object, or 5 or 10, but would be cool and have most of the value is if Posterior could read InferenceData from netcdf, even if its just one group at a time. That way for example if I did my analysis in PyMC3, Numpyro, or pystan, if the posterior r package had some cool plot I as a user could still load my generated InferenceData object in R and get the pretty picture.

@mjskay
Copy link
Collaborator

mjskay commented Aug 15, 2020

Colloquially stating the point above in R memory do whatever you want, whether it's one object, or 5 or 10, but would be cool and have most of the value is if Posterior could read InferenceData from netcdf, even if its just one group at a time. That way for example if I did my analysis in PyMC3, Numpyro, or pystan, if the posterior r package had some cool plot I as a user could still load my generated InferenceData object in R and get the pretty picture.

A simple version of this approach would be to read it into a named list of "draws" objects. We could possibly implement some of the generics from #39 on top of the format if they fit with it too.

@jgabry
Copy link
Member

jgabry commented Aug 15, 2020

Hey Ravin, nice meeting you too! And thanks for clarifying about the distinction between the object in R memory and the IO. I agree the latter is most important in terms of cross collaboration. There are some R packages that facilitate working with netcdf data from R that we can look into using (or learning from to do our own implementation) so that we can read in InferenceData.

A simple version of this approach would be to read it into a named list of "draws" objects. We could possibly implement some of the generics from #39 on top of the format if they fit with it too.

Yeah good idea. This does seem like the simplest approach.

@OriolAbril
Copy link

OriolAbril commented Jan 21, 2021

Hi, I wanted to let everyone know that we are considering this as a GSoC project and see if you would be interested and if you see it feasible with the time constraints of the project. We still have to extend and clarify the idea if we decide to go forward.

There are no guarantees about anyone choosing this idea but if we include it we have to be able to mentor any student who were to choose it. I don't know how familiar are you with GSoC, I hope everything makes sense but in any case don't hesitate to ask anything.

@paul-buerkner
Copy link
Collaborator Author

I cannot speak for the other developers but I am currently not sure in which form and how much posterior should eventually support the InferenceData format. So we would have decide on that first, I guess.

@mjskay
Copy link
Collaborator

mjskay commented Jan 25, 2021

Looking at this: https://arviz-devs.github.io/arviz/schema/schema.html

The format that comes to mind for me is a list where each element is draws object representing one "group" in the terminology of InferenceData. Questions beyond that might be, do we want to create a whole class to represent this or is it enough to stick with a simple list of draws objects?

@paul-buerkner
Copy link
Collaborator Author

Yes, a list of draws seems to be appropriate. If we decide to support the format, making this a whole class would probably be preferable to enable a nice interface, even if we do not have many methods working specifically on that at the start. For example, one could write a simple subset_draws method, that can select groups, variables within groups, or can subset iterations across all groups.

@jburos
Copy link

jburos commented Feb 26, 2021

Seems relevant to this conversation, that in our workflow we are using InferenceData to store output from cmdstanpy as netCDF files, & finding it's pretty easy to use them in R.

A piece of this that we have converged on is splitting the InferenceData object into pieces, and writing out the result so that each Variable in each group is a separate *.nc file.

Then we use the package tidync to read in arbitrary data for draws & work with them.

An example of this process looks like the following:

library(tidync)
library(tidyverse)

run_dir <- file.path('~/.runs/2021-02-26T16:28:37.885639')

## load prior predictive draws for a particular slice of the data into memory
prior_draws <- tidync(file.path(run_dir, 'draws/prior_predictive/predicted_biomarker.nc')) %>%
    hyper_filter(subject = subject == 'f1919f28') %>%
    hyper_tibble()

This yields a data frame that looks very similar to the output from tidy_draws:

> glimpse(prior_draws, width = 1)
Rows: 196,650
Columns: 5
$ predicted_biomarker <dbl>$ biomarker_time      <dbl>$ subject             <chr>$ draw                <dbl>$ chain               <dbl>

So, it may be easier to support reading in these objects than it seems at first glance.

@jburos
Copy link

jburos commented Mar 1, 2021

Another option along this line is to expand support for the experimental tbl_cube data structure in R. This is very similar to the xarray.DataSet structure in Python, and has potential to improve performance particularly when working with large numbers of draws for hierarchical parameters.

For example:

prior_cube <- tidync(file.path(run_dir, 'draws/prior_predictive/predicted_biomarker.nc')) %>%
    hyper_filter(subject = subject == 'f1919f28') %>%
    hyper_tbl_cube()

This yields a 3-d array of draws with defined axes:

> glimpse(prior_cube, width = 0)
List of 2
 $ mets:List of 1
  ..$ predicted_biomarker: num [1:50, 1:1000, 1:4] xx, xx, xx, xx, ...
  ..- attr(*, "transforms")=List of 4
  .. ..$ biomarker_time: tibble [50 × 6] (S3: tbl_df/tbl/data.frame)
  .. ..$ subject       : tibble [1,198 × 6] (S3: tbl_df/tbl/data.frame)
  .. ..$ draw          : tibble [1,000 × 6] (S3: tbl_df/tbl/data.frame)
  .. ..$ chain         : tibble [4 × 6] (S3: tbl_df/tbl/data.frame)
  ..- attr(*, "source")= tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
  ..- attr(*, "class")= chr "tidync_data"
 $ dims:List of 4
  ..$ biomarker_time: num [1:50] 0 10 20 30 40 50 60 70 80 90 ...
  ..$ subject       : chr "f1919f28"
  ..$ draw          : num [1:1000] 0 1 2 3 4 5 6 7 8 9 ...
  ..$ chain         : num [1:4] 0 1 2 3
 - attr(*, "class")= chr "tbl_cube"

The cube has a lower memory footprint

> library(pryr)
> object_size(prior_cube)
1.8 MB
> object_size(prior_draws)
7.87 MB

However, there is not a ready set of methods for working with these objects that is specific to inferences. For example, it might be worth extending tidybayes-like summary functions to this storage format. Most of the existing utilities in R are application-specific, like the stars package.

@mike-lawrence
Copy link

mike-lawrence commented Jun 28, 2021

A piece of this that we have converged on is splitting the InferenceData object into pieces, and writing out the result so that each Variable in each group is a separate *.nc file.

Is this purely bc tidync doesn't support groups? Oh, maybe not, as you're also splitting each variable out to a separate nc. What was the motivation for that choice?

@jburos
Copy link

jburos commented Jun 28, 2021

@mike-lawrence This is a good question.

There are two reasons for this:

  1. We're storing the nc files in the cloud, and the netcdf format doesn't support read-access over network well. Right now (I think) we have to download the file locally before reading, and splitting the files into pieces means we don't have to download everything to read one thing. Also it's pretty easy to open multiple nc files and concatenate them, so this allows us to construct the posterior we would like to have.
  2. In our model-fitting process, we are often generating posterior predicted values in python using xarray that run after the initial model run has finished. Writing out separate nc files for each quantity means we can add new quantities without modifying the existing ones.

Some remaining pain points we have not quite resolved, in case this is helpful:

  1. One issue we run into with tidync is that we cannot read from noncontiguous indices at the same time. So if I have subjects A, B, C, and D in that order, I can read ABC or BCD but not AD -- at least not in a single read command.
  2. The inability to read slices from the nc file without downloading the full thing is the biggest issue. We are considering trying the zagg format.
  3. Also, we would definitely make use of tbl_cube structure if it were more available. This would reduce the memory footprint considerably.

@mike-lawrence
Copy link

mike-lawrence commented Jun 28, 2021

1. We're storing the nc files in the cloud, and the netcdf format doesn't support read-access over network well. 

Keep an eye out for zarr-backed netcdf4 files, which should bring better remote-access support.

2. In our model-fitting process, we are often generating posterior predicted values in python using xarray that run after the initial model run has finished. Writing out separate nc files for each quantity means we can add new quantities without modifying the existing ones.

Oh! NetCDF4 files aren't append-friendly? Shoot, that nixes something I had planned for generate quantities mode...

Well, in the meantime, I have a netcdf4-to-draws_rvars working now here.

@mjskay
Copy link
Collaborator

mjskay commented Jun 28, 2021

@jburos I checked out the tbl_cube thing and it seems pretty useful! Another approach in that spirit could be to use rvar columns in a tibble. From a quick test this gets almost down to a similar size as using tbl_cube, with a possible advantage of allowing multiple rvar columns (I couldn't figure out how to do multiple "measure" columns in tbl_cube?) and possibly more readable output (the latter is also going to be a great advantage over the long-tibble-of-draws format too I think):

library(dplyr)
library(modelr)
library(tidybayes)
library(posterior)
library(rstanarm)
library(cubelyr)

m_mpg = stan_glm(mpg ~ hp * cyl, data = mtcars)

pred_df = mtcars %>%
  data_grid(cyl, hp = seq_range(hp, n = 51)) %>%
  add_predicted_draws(m_mpg)

pred_df
#> # A tibble: 612,000 x 7
#> # Groups:   cyl, hp, .row [153]
#>      cyl    hp  .row .chain .iteration .draw .prediction
#>    <dbl> <dbl> <int>  <int>      <int> <int>       <dbl>
#>  1     4    52     1     NA         NA     1        18.4
#>  2     4    52     1     NA         NA     2        23.7
#>  3     4    52     1     NA         NA     3        26.3
#>  4     4    52     1     NA         NA     4        29.6
#>  5     4    52     1     NA         NA     5        25.8
#>  6     4    52     1     NA         NA     6        33.4
#>  7     4    52     1     NA         NA     7        30.7
#>  8     4    52     1     NA         NA     8        30.2
#>  9     4    52     1     NA         NA     9        30.3
#> 10     4    52     1     NA         NA    10        23.3
#> # ... with 611,990 more rows

rvar_df = mtcars %>%
  data_grid(cyl, hp = seq_range(hp, n = 51)) %>%
  # this is a bit ugly atm but will eventually get a shortcut similar to
  # add_predicted_draws in tidybayes
  ungroup() %>%
  mutate(.prediction = rvar(posterior_predict(m_mpg, newdata = .)))

rvar_df
#> # A tibble: 153 x 3
#>      cyl    hp .prediction
#>    <dbl> <dbl>      <rvar>
#>  1     4  52      29 ± 3.3
#>  2     4  57.7    28 ± 3.3
#>  3     4  63.3    28 ± 3.2
#>  4     4  69.0    28 ± 3.2
#>  5     4  74.6    27 ± 3.2
#>  6     4  80.3    27 ± 3.1
#>  7     4  86.0    26 ± 3.2
#>  8     4  91.6    26 ± 3.2
#>  9     4  97.3    25 ± 3.2
#> 10     4 103.     25 ± 3.2
#> # ... with 143 more rows

cube_df = pred_df %>%
  ungroup() %>%
  select(-.row, -.chain, -.iteration) %>%
  as.tbl_cube(dim_names = c("cyl", "hp", ".draw"))

cube_df
#> Source: local array [612,000 x 3]
#> D: cyl [dbl, 3]
#> D: hp [dbl, 51]
#> D: .draw [int, 4000]
#> M: .prediction [dbl[,51,4000]]

pryr::object_size(pred_df)
#> 26.9 MB
pryr::object_size(rvar_df)
#> 5.16 MB
pryr::object_size(cube_df)
#> 4.91 MB

Created on 2021-06-28 by the reprex package (v2.0.0)

When posterior hits CRAN (which is imminent!) I will also be adding some convenience functions to tidybayes for manipulating rvar columns (like analogs to `spread_draws(), add_predicted_draws(), etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants