-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FR: Support for InferenceData files #166
Comments
@mjskay: The access scheme for ID files seems to me to be very much aligned with the rvars format, whereby each variable is stored as a multidimensional array (chain & draw as first and second dimensions, respectively) and reading can consist of slices of arbitrary shape. I can easily read everything into memory outright and store as an rvar, but I wonder if a "lazy-rvar" type might be possible whereby the values are actually pointers so that the data is read from the file only when a computation actually needs them. Any thoughts? |
I love the idea of a lazy rvar! Not sure the best way to do it... Maybe if there were a lazy array implementation for R backed by netcdf? Then we could allow use of that as the backing array. Not sure if that would be easier or harder than just implementing all the netcdf access in a subclass of rvar. Anyway, I'm open to it! |
@mjskay not working on the lazy-rvars yet, simply trying to get regular ol' rvars working and hit a snag that I'm hoping you can help me with. I'm sure I'm just not understanding the format properly. So for a Stan variable that is a 1D array, we get a a = array(rnorm(1e3*4*2),dim = c(1e3,4,2))
a_rvar = posterior::as_draws_rvars(a) But when a Stan variable has more than one dimension, I'm getting an error: b = array(rnorm(1e3*4*2*2),dim = c(1e3,4,2,2))
b_rvar = posterior::as_draws_rvars(b)
# Error: Don't know how to transform an object of class 'array' to any supported draws format. I'm also fuzzy on how to set the dimnames. I have this clunky code: dim_names = c('draw','chain','b.1','b.2')
dimnames(b) = list()
for(i in 1:length(dim_names)){
dimnames(b)[[i]] = paste(dim_names[i],1:dim_counts[i],sep='.')
} But I feel like that's the wrong way to name the last two dimensions, and surely there's a more straightforward way I'm missing. |
Hi Mike, happy to help - I think the issue you are running into comes down to the distinction between the rvar and the draws_rvars formats. For a single variable you will want to convert it to an rvar, not a draws_rvars (the latter is a collection of rvars). Should then be able to create a draws_rvars object with one or more rvars in it. My guess is the docs aren't totally clear about this and we should clarify it - maybe in the rvar vignette or somewhere else |
Oh shoot. I just now see that my FR is a dupe. I'll close this and continue the thread on the original. |
@mjskay Maybe just to finish off the posterior-array-to-rvars thread here, after looking at the vignette, it seems all the rvars conversion options assume that either the chain & draw dimensions have been collapsed to a single longer dimension, or that the chain & draw dimensions are left alone and the variable's remaining dimensions are collapsed to a single longer dimension. That is, one can do: nchains = 4
ndraws_per_chain = 1e3
rows <- 4
cols <- 3
x <- rvar(array(
rnorm(ndraws_per_chain * nchains * rows * cols, mean = 1, sd = 1)
, dim = c(
ndraws_per_chain * nchains #note two dims being collapsed here
, rows
, cols
))) And one can do: a = posterior::example_draws('multi_normal')[,,3:12] #extracts the "Sigma" parameter entries
dim(a) #shows 10,4,12
dimnames(a) # 3rd dim has names that combine the dims of Sigma
ar = posterior::as_draws_rvars(a) But it's not possible to go directly from a |
Ah yes, there will be soon though - once #153 is merged you will be able to use the |
Thanks! In case it helps anyone for any reason, I learned that this works too: b = array(rnorm(1e3*4*2*2),dim = c(1e3,4,2,2)) #3+D array from elsewhere
#collapse first two dims
dim(b) = c(prod(dim(b)[1:2]),dim(b)[3:length(dim(b))]) #Yay, no copies to reshape!
b_rvar = posterior::rvar(b)
#we've lost the variable name, but can get it back by using a named list input to as_draws_rvars:
b_draws_rvar = posterior::as_draws_rvars(list(b=b_rvar))
summary(b_draws_rvar) #shows "b" as the variable name with proper dimension indexing |
Two additional things that might (or might not) be helpful: (1) you can currently use the b = array(rnorm(1e3*4*2*2),dim = c(1e3,4,2,2))
dim(b) = c(prod(dim(b)[1:2]),dim(b)[3:length(dim(b))])
# so long as chains was the second dimension this will recover chain info correctly
b_rvar = posterior::rvar(b, nchains = 4)
# could also use as_draws_rvars(list(b = b_rvar)), though this is more direct:
b_draws_rvar = posterior::draws_rvars(b = b_rvar)
b_draws_rvar
After #153 is merged, this will become even simpler: b = array(rnorm(1e3*4*2*2),dim = c(1e3,4,2,2))
# with_chains = TRUE will work after the rvar-conv branch is merged:
b_rvar = posterior::rvar(b, with_chains = TRUE)
b_draws_rvar = posterior::draws_rvars(b = b_rvar)
b_draws_rvar
|
I'm actually about to start work on this myself, being nearly done an R-based cmdstan-csv-to-ID converter, but wanted to make it a public issue for posterity. The InferenceData spec subclasses NetCDF4 which in turn subclasses HDF5. What's nice about this format is that it permits fast indexed access to on-disk arrays, making for lighter RAM footprint relative to the current status-quo of having to hold everything in RAM. Admittedly this only becomes pertinent for very large models, but I thought it would be worth implementing. I'm hoping (and advocating) that the fabled cmdstan IO rewrite will use ID, in which case this feature would become much more universally applicable.
The text was updated successfully, but these errors were encountered: