Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Support for InferenceData files #166

Closed
mike-lawrence opened this issue Jun 24, 2021 · 9 comments
Closed

FR: Support for InferenceData files #166

mike-lawrence opened this issue Jun 24, 2021 · 9 comments

Comments

@mike-lawrence
Copy link

mike-lawrence commented Jun 24, 2021

I'm actually about to start work on this myself, being nearly done an R-based cmdstan-csv-to-ID converter, but wanted to make it a public issue for posterity. The InferenceData spec subclasses NetCDF4 which in turn subclasses HDF5. What's nice about this format is that it permits fast indexed access to on-disk arrays, making for lighter RAM footprint relative to the current status-quo of having to hold everything in RAM. Admittedly this only becomes pertinent for very large models, but I thought it would be worth implementing. I'm hoping (and advocating) that the fabled cmdstan IO rewrite will use ID, in which case this feature would become much more universally applicable.

@mike-lawrence
Copy link
Author

mike-lawrence commented Jun 25, 2021

@mjskay: The access scheme for ID files seems to me to be very much aligned with the rvars format, whereby each variable is stored as a multidimensional array (chain & draw as first and second dimensions, respectively) and reading can consist of slices of arbitrary shape. I can easily read everything into memory outright and store as an rvar, but I wonder if a "lazy-rvar" type might be possible whereby the values are actually pointers so that the data is read from the file only when a computation actually needs them. Any thoughts?

@mjskay
Copy link
Collaborator

mjskay commented Jun 25, 2021

I love the idea of a lazy rvar! Not sure the best way to do it... Maybe if there were a lazy array implementation for R backed by netcdf? Then we could allow use of that as the backing array. Not sure if that would be easier or harder than just implementing all the netcdf access in a subclass of rvar. Anyway, I'm open to it!

@mike-lawrence
Copy link
Author

mike-lawrence commented Jun 28, 2021

@mjskay not working on the lazy-rvars yet, simply trying to get regular ol' rvars working and hit a snag that I'm hoping you can help me with. I'm sure I'm just not understanding the format properly. So for a Stan variable that is a 1D array, we get a $ndraws \times nchains \times N$ array in the posterior, which converts to rvar fine:

a = array(rnorm(1e3*4*2),dim = c(1e3,4,2)) 
a_rvar = posterior::as_draws_rvars(a)

But when a Stan variable has more than one dimension, I'm getting an error:

b = array(rnorm(1e3*4*2*2),dim = c(1e3,4,2,2))
b_rvar = posterior::as_draws_rvars(b)
# Error: Don't know how to transform an object of class 'array' to any supported draws format.

I'm also fuzzy on how to set the dimnames. I have this clunky code:

dim_names = c('draw','chain','b.1','b.2')
dimnames(b) = list()
for(i in 1:length(dim_names)){
  dimnames(b)[[i]] = paste(dim_names[i],1:dim_counts[i],sep='.')
}

But I feel like that's the wrong way to name the last two dimensions, and surely there's a more straightforward way I'm missing.

@mjskay
Copy link
Collaborator

mjskay commented Jun 28, 2021

Hi Mike, happy to help - I think the issue you are running into comes down to the distinction between the rvar and the draws_rvars formats. For a single variable you will want to convert it to an rvar, not a draws_rvars (the latter is a collection of rvars). Should then be able to create a draws_rvars object with one or more rvars in it.

My guess is the docs aren't totally clear about this and we should clarify it - maybe in the rvar vignette or somewhere else

@mike-lawrence
Copy link
Author

mike-lawrence commented Jun 28, 2021

Oh shoot. I just now see that my FR is a dupe. I'll close this and continue the thread on the original.

@mike-lawrence
Copy link
Author

mike-lawrence commented Jun 28, 2021

@mjskay Maybe just to finish off the posterior-array-to-rvars thread here, after looking at the vignette, it seems all the rvars conversion options assume that either the chain & draw dimensions have been collapsed to a single longer dimension, or that the chain & draw dimensions are left alone and the variable's remaining dimensions are collapsed to a single longer dimension. That is, one can do:

nchains = 4
ndraws_per_chain = 1e3
rows <- 4
cols <- 3
x <- rvar(array(
  rnorm(ndraws_per_chain * nchains * rows * cols, mean = 1, sd = 1)
  , dim = c(
    ndraws_per_chain * nchains #note two dims being collapsed here
    , rows
    , cols
)))

And one can do:

a = posterior::example_draws('multi_normal')[,,3:12] #extracts the "Sigma" parameter entries
dim(a) #shows 10,4,12
dimnames(a) # 3rd dim has names that combine the dims of Sigma
ar = posterior::as_draws_rvars(a)

But it's not possible to go directly from a ndraws_per_chain X nchains X rows X cols array to an rvar/draws_rvar.

@mjskay
Copy link
Collaborator

mjskay commented Jun 28, 2021

But it's not possible to go directly from a ndraws_per_chain X nchains X rows X cols array to an rvar/draws_rvar.

Ah yes, there will be soon though - once #153 is merged you will be able to use the with_chains argument to do that. That reminds me the vignette will need an example of it too though.

@mike-lawrence
Copy link
Author

mike-lawrence commented Jun 28, 2021

Thanks! In case it helps anyone for any reason, I learned that this works too:

b = array(rnorm(1e3*4*2*2),dim = c(1e3,4,2,2)) #3+D array from elsewhere
#collapse first two dims
dim(b) = c(prod(dim(b)[1:2]),dim(b)[3:length(dim(b))]) #Yay, no copies to reshape! 
b_rvar = posterior::rvar(b)
#we've lost the variable name, but can get it back by using a named list input to as_draws_rvars:
b_draws_rvar = posterior::as_draws_rvars(list(b=b_rvar)) 
summary(b_draws_rvar) #shows "b" as the variable name with proper dimension indexing

@mjskay
Copy link
Collaborator

mjskay commented Jun 28, 2021

Two additional things that might (or might not) be helpful: (1) you can currently use the nchains argument to add chain info to rvars if the dims have already been collapsed (your example loses chain info) (2) if you use the constructor (draws_rvars()) instead of the conversion function (as_draws_rvars()) you can skip creating the list and just supply rvars as named arguments:

b = array(rnorm(1e3*4*2*2),dim = c(1e3,4,2,2))
dim(b) = c(prod(dim(b)[1:2]),dim(b)[3:length(dim(b))])
# so long as chains was the second dimension this will recover chain info correctly
b_rvar = posterior::rvar(b, nchains = 4)
# could also use as_draws_rvars(list(b = b_rvar)), though this is more direct:
b_draws_rvar = posterior::draws_rvars(b = b_rvar)
b_draws_rvar
# A draws_rvars: 1000 iterations, 4 chains, and 1 variables
$b: rvar<1000,4>[2,2] mean ± sd:
     [,1]         [,2]        
[1,]  0.0011 ± 1  -0.0034 ± 1 
[2,] -0.0228 ± 1   0.0195 ± 1 

After #153 is merged, this will become even simpler:

b = array(rnorm(1e3*4*2*2),dim = c(1e3,4,2,2))
# with_chains = TRUE will work after the rvar-conv branch is merged:
b_rvar = posterior::rvar(b, with_chains = TRUE)
b_draws_rvar = posterior::draws_rvars(b = b_rvar)
b_draws_rvar
# A draws_rvars: 1000 iterations, 4 chains, and 1 variables
$b: rvar<1000,4>[2,2] mean ± sd:
     [,1]            [,2]           
[1,]  0.0013 ± 0.99  -0.0133 ± 1.00 
[2,] -0.0054 ± 1.01   0.0099 ± 1.00 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants