-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting the InferenceData format? #85
Comments
I'm not familiar with the format (will check out their talk). Is your question whether we should implement an R version of the InferenceData format that arviz offers in Python? |
I guess that's obviously the question ;) I just took a look at the link and if I understand the format correctly, they're storing observed data, prior draws, posterior draws (and summary stats?), and posterior predictive draws, all in a single object. To start the discussion of whether to do this, here are a few different topics to discuss: (A) Do we like this format? (B) If so, how should it be implemented it? For (A) my initial reaction was that it's a bad idea to store all of that in a single object, but I admit that it does seem useful as a way of keeping the various components of an analysis/workflow together. So I'm definitely open to this. (Update: I'm less concerned about storing all of this in one object now. It can easily be unpacked into multiple objects if the user wants) For (B) to do this in R would we just use a list of objects of different types (i.e. S3)? Or use R6/reference classes? The former is obviously much simpler but the latter is more in line with the Python and safer (we could provide methods for modifying the object instead of S3 where the object can be messed with in any arbitrary way). |
Hey Jonah, nice to meet you at Stancon and @paul-buerkner thank you for opening this, particularly in a public format. First off want to say you guys know a lot more about R than I do so don't take my priors strongly. From the chat Jonah and I had though I think something that is conflated here is an InferenceData object in R memory versus InferenceData IO from netcdf on disk. I think the latter is the most important for cross collaboration. Colloquially stating the point above in R memory do whatever you want, whether it's one object, or 5 or 10, but would be cool and have most of the value is if Posterior could read InferenceData from netcdf, even if its just one group at a time. That way for example if I did my analysis in PyMC3, Numpyro, or pystan, if the posterior r package had some cool plot I as a user could still load my generated InferenceData object in R and get the pretty picture. |
A simple version of this approach would be to read it into a named list of "draws" objects. We could possibly implement some of the generics from #39 on top of the format if they fit with it too. |
Hey Ravin, nice meeting you too! And thanks for clarifying about the distinction between the object in R memory and the IO. I agree the latter is most important in terms of cross collaboration. There are some R packages that facilitate working with netcdf data from R that we can look into using (or learning from to do our own implementation) so that we can read in InferenceData.
Yeah good idea. This does seem like the simplest approach. |
Hi, I wanted to let everyone know that we are considering this as a GSoC project and see if you would be interested and if you see it feasible with the time constraints of the project. We still have to extend and clarify the idea if we decide to go forward. There are no guarantees about anyone choosing this idea but if we include it we have to be able to mentor any student who were to choose it. I don't know how familiar are you with GSoC, I hope everything makes sense but in any case don't hesitate to ask anything. |
I cannot speak for the other developers but I am currently not sure in which form and how much posterior should eventually support the InferenceData format. So we would have decide on that first, I guess. |
Looking at this: https://arviz-devs.github.io/arviz/schema/schema.html The format that comes to mind for me is a list where each element is draws object representing one "group" in the terminology of InferenceData. Questions beyond that might be, do we want to create a whole class to represent this or is it enough to stick with a simple list of draws objects? |
Yes, a list of draws seems to be appropriate. If we decide to support the format, making this a whole class would probably be preferable to enable a nice interface, even if we do not have many methods working specifically on that at the start. For example, one could write a simple subset_draws method, that can select groups, variables within groups, or can subset iterations across all groups. |
Seems relevant to this conversation, that in our workflow we are using InferenceData to store output from cmdstanpy as netCDF files, & finding it's pretty easy to use them in R. A piece of this that we have converged on is splitting the InferenceData object into pieces, and writing out the result so that each Variable in each group is a separate *.nc file. Then we use the package tidync to read in arbitrary data for draws & work with them. An example of this process looks like the following: library(tidync)
library(tidyverse)
run_dir <- file.path('~/.runs/2021-02-26T16:28:37.885639')
## load prior predictive draws for a particular slice of the data into memory
prior_draws <- tidync(file.path(run_dir, 'draws/prior_predictive/predicted_biomarker.nc')) %>%
hyper_filter(subject = subject == 'f1919f28') %>%
hyper_tibble() This yields a data frame that looks very similar to the output from tidy_draws: > glimpse(prior_draws, width = 1)
Rows: 196,650
Columns: 5
$ predicted_biomarker <dbl> …
$ biomarker_time <dbl> …
$ subject <chr> …
$ draw <dbl> …
$ chain <dbl> … So, it may be easier to support reading in these objects than it seems at first glance. |
Another option along this line is to expand support for the experimental tbl_cube data structure in R. This is very similar to the For example: prior_cube <- tidync(file.path(run_dir, 'draws/prior_predictive/predicted_biomarker.nc')) %>%
hyper_filter(subject = subject == 'f1919f28') %>%
hyper_tbl_cube() This yields a 3-d array of draws with defined axes: > glimpse(prior_cube, width = 0)
List of 2
$ mets:List of 1
..$ predicted_biomarker: num [1:50, 1:1000, 1:4] xx, xx, xx, xx, ...
..- attr(*, "transforms")=List of 4
.. ..$ biomarker_time: tibble [50 × 6] (S3: tbl_df/tbl/data.frame)
.. ..$ subject : tibble [1,198 × 6] (S3: tbl_df/tbl/data.frame)
.. ..$ draw : tibble [1,000 × 6] (S3: tbl_df/tbl/data.frame)
.. ..$ chain : tibble [4 × 6] (S3: tbl_df/tbl/data.frame)
..- attr(*, "source")= tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
..- attr(*, "class")= chr "tidync_data"
$ dims:List of 4
..$ biomarker_time: num [1:50] 0 10 20 30 40 50 60 70 80 90 ...
..$ subject : chr "f1919f28"
..$ draw : num [1:1000] 0 1 2 3 4 5 6 7 8 9 ...
..$ chain : num [1:4] 0 1 2 3
- attr(*, "class")= chr "tbl_cube" The cube has a lower memory footprint > library(pryr)
> object_size(prior_cube)
1.8 MB
> object_size(prior_draws)
7.87 MB However, there is not a ready set of methods for working with these objects that is specific to inferences. For example, it might be worth extending |
Is this purely bc tidync doesn't support groups? Oh, maybe not, as you're also splitting each variable out to a separate nc. What was the motivation for that choice? |
@mike-lawrence This is a good question. There are two reasons for this:
Some remaining pain points we have not quite resolved, in case this is helpful:
|
Keep an eye out for zarr-backed netcdf4 files, which should bring better remote-access support.
Oh! NetCDF4 files aren't append-friendly? Shoot, that nixes something I had planned for generate quantities mode... Well, in the meantime, I have a netcdf4-to-draws_rvars working now here. |
@jburos I checked out the library(dplyr)
library(modelr)
library(tidybayes)
library(posterior)
library(rstanarm)
library(cubelyr)
m_mpg = stan_glm(mpg ~ hp * cyl, data = mtcars)
pred_df = mtcars %>%
data_grid(cyl, hp = seq_range(hp, n = 51)) %>%
add_predicted_draws(m_mpg)
pred_df
#> # A tibble: 612,000 x 7
#> # Groups: cyl, hp, .row [153]
#> cyl hp .row .chain .iteration .draw .prediction
#> <dbl> <dbl> <int> <int> <int> <int> <dbl>
#> 1 4 52 1 NA NA 1 18.4
#> 2 4 52 1 NA NA 2 23.7
#> 3 4 52 1 NA NA 3 26.3
#> 4 4 52 1 NA NA 4 29.6
#> 5 4 52 1 NA NA 5 25.8
#> 6 4 52 1 NA NA 6 33.4
#> 7 4 52 1 NA NA 7 30.7
#> 8 4 52 1 NA NA 8 30.2
#> 9 4 52 1 NA NA 9 30.3
#> 10 4 52 1 NA NA 10 23.3
#> # ... with 611,990 more rows
rvar_df = mtcars %>%
data_grid(cyl, hp = seq_range(hp, n = 51)) %>%
# this is a bit ugly atm but will eventually get a shortcut similar to
# add_predicted_draws in tidybayes
ungroup() %>%
mutate(.prediction = rvar(posterior_predict(m_mpg, newdata = .)))
rvar_df
#> # A tibble: 153 x 3
#> cyl hp .prediction
#> <dbl> <dbl> <rvar>
#> 1 4 52 29 ± 3.3
#> 2 4 57.7 28 ± 3.3
#> 3 4 63.3 28 ± 3.2
#> 4 4 69.0 28 ± 3.2
#> 5 4 74.6 27 ± 3.2
#> 6 4 80.3 27 ± 3.1
#> 7 4 86.0 26 ± 3.2
#> 8 4 91.6 26 ± 3.2
#> 9 4 97.3 25 ± 3.2
#> 10 4 103. 25 ± 3.2
#> # ... with 143 more rows
cube_df = pred_df %>%
ungroup() %>%
select(-.row, -.chain, -.iteration) %>%
as.tbl_cube(dim_names = c("cyl", "hp", ".draw"))
cube_df
#> Source: local array [612,000 x 3]
#> D: cyl [dbl, 3]
#> D: hp [dbl, 51]
#> D: .draw [int, 4000]
#> M: .prediction [dbl[,51,4000]]
pryr::object_size(pred_df)
#> 26.9 MB
pryr::object_size(rvar_df)
#> 5.16 MB
pryr::object_size(cube_df)
#> 4.91 MB Created on 2021-06-28 by the reprex package (v2.0.0) When posterior hits CRAN (which is imminent!) I will also be adding some convenience functions to tidybayes for manipulating |
The arviz team present their InferenceData format at StanCon this year and I wanted to open this issue as a reminder to discuss whether posterior should support this format as well. The format design can be found at https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html
The text was updated successfully, but these errors were encountered: