-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rv-like interface #8
Comments
Just adding the link to the rv website for easy reference: https://jsta.github.io/rv/ |
I am definitely in favor of supporting this format! I will need to take a look at the source code though before I can have an informed opinion about the best way to get there. |
Same! |
Yeah I forgot that it was Jouni who wrote that! He left before I started working with Andrew but Andrew has mentioned him a few times. Andrew has always wanted a more rv like implementation for working with Stan output but I don’t think he ended up using the rv package that much (I forget why but I should ask again). I’m definitely comfortable reaching out to Jouni if we want to go that route. |
Okay --- maybe as a first step, @jgabry could you reach out to Jouni and/or Andrew? Perhaps they would have insight on either lessons learned from implementing |
Yeah I can do that. I haven’t found a current email address for Jouni but I can ask Andrew if he has it. |
Thanks! |
Andrew responded fast and connected me to Jouni. I just sent an email to everyone. |
Since rv objects are just named lists, I think we could actually replace some or even all of the desired list classes (as per #4 (comment)) by rv objects, or lists of rv objects, since each rv represents only on vector/matrix/array of random variables and a posterior contains multiple of those. We need to think of how to represent chains though, but I am confident we can find an elegant solution to this. If we build rv natively into posterior this way, I would still like to have more control over it in order not to depend on a not-so-much maintained package. |
I like this idea.
Agreed. Is this worth raising in the email thread?
I was wondering what this would look like... My first instinct was that the chain information is best put inside the rv objects, but this would require changes to the rv package no? |
We may indeed need to raise the maintaining issue in the email. I would also prefer to put the chain information directly into the rv objects so that we do not have to take care of it in the list of rv objects that make up the whole posterior and can pass single rv objects to convergence diagnostics etc. Another aspect of rv is that they are represented internally as lists even if they can have a dimension assigned. This makes it potentially quite computationally inefficient to work this format for larger objects. For instance, a matrix multiplication is syntactically possible but I would guess quite inefficient (I haven't tested it yet). In any case, this is something to keep in mind. |
Hmm yeah, I see your point about the internal organization. E.g., here is the internal representation for a 16x16 matrix: > x1 = rvmatrix(rvnorm(256), nrow = 16)
> str(x1, list.len = 5)
List of 256
$ : num [1:4000] 1.415 -0.593 -0.605 -0.16 -0.233 ...
$ : num [1:4000] 0.561 -1.522 -0.757 -0.589 0.105 ...
$ : num [1:4000] -0.452 0.369 0.626 1.168 -0.923 ...
$ : num [1:4000] -0.806 -0.669 -1.443 -0.395 0.815 ...
$ : num [1:4000] 1.822 1.416 0.213 -0.433 -0.186 ...
[list output truncated]
- attr(*, "class")= chr "rv"
- attr(*, "dim")= int [1:2] 16 16 And a matrix multiply takes a little bit of time: > x2 = rvmatrix(rvnorm(256), nrow = 16)
> system.time(x1 %**% x2)
user system elapsed
0.15 0.00 0.15 (Not sure why Meanwhile you can re-organize the objects into lists of matrices by draw using the internal > x1_list = rv:::.sims.as.list(x1)
> x2_list = rv:::.sims.as.list(x2)
> str(x1_list, list.len = 5)
List of 4000
$ : num [1:16, 1:16] 1.415 0.561 -0.452 -0.806 1.822 ...
$ : num [1:16, 1:16] -0.593 -1.522 0.369 -0.669 1.416 ...
$ : num [1:16, 1:16] -0.605 -0.757 0.626 -1.443 0.213 ...
$ : num [1:16, 1:16] -0.16 -0.589 1.168 -0.395 -0.433 ...
$ : num [1:16, 1:16] -0.233 0.105 -0.923 0.815 -0.186 ...
[list output truncated] This format is more what I would have imagined as the internal storage format before looking, as it makes implementing functions very straightforward (just a map operation). E.g. a quick-and-dirty matrix multiply using this format is much faster: > system.time(purrr::map2(x1_list, x2_list, `%*%`))
user system elapsed
0.01 0.00 0.01 This also makes it easier to support more arbitrary object types by overloading the typical math operators and functions to just be applied to every list element. This might even make it easy for other people to create custom objects compatible with our rv-like interface for other classes if they wanted to (e.g. if someone wanted to use Given the simplicity of an implementation with that format, I am perhaps coming around to just rolling our own. There are other things that x = rv(rnorm, 0, 1) Or: x = rv("norm", 0, 1) Another question there (maybe out of scope for now but something to consider in the future) is whether we would want to support rv objects that represent analytical distributions. These might have draws associated with them (for operations with other RVs) but supply their known density and distribution functions when requested, which could be useful for visualizing priors, for example. This is maybe a longer-term thing, but I think there would be something useful to a class that unified both representations: for example, the dev version of tidybayes has a bunch of variants of geoms that can either take samples or named distributions and their parameters to make visualizations (see here), but unifying these into a single object type that can be put into data frames would allow both to be used with the same geom. |
Also, if we want to get really crazy with this (just brainstorming here), a kind of "out there" feature could be an Say we created an NSE function n_obs = 10
mu = rv(rnorm, 0, 1)
sigma = rv(rgamma, 1, 1)
y = rv_do(rnorm(n_obs, mu, sigma)) into the equivalent of this: # everything else the same
y = purrr::pmap(list(mu = mu, sigma = sigma), function(mu, sigma) rnorm(n_obs, mu, sigma)) In fact, then |
Thanks for looking into this! The problem with the list-by-draw representation is that summary statistics are quite inefficient to compute I would expect but I am unsure how much this is a problem as compared to the efficiency of numerical transformations. I am wondering if we need all the extra features from rv to sample from distribution etc. in the posterior package? Sure, if we reuse the existing rv, it will be shipped with it but if we build our own, I don't see too much need to support all this natively, at least not from the start. Of course, if we have an More generally, I think we run into the same problems of different formats in rv as we run into posterior. So, to me, the question is if we accidentally build a posterior package within the posterior package by not only supporting different draws representations (of which one is rv-like objects) and then again support different representation of rv objects themselves. I am not sure yet, how a good combination of the two purposes (representing posterior distributions in different common formats on the one hand, and having a nice random-variable syntax on the other) could look like. We may need to also develop a clearer of the use cases of these formats. If, for instance, rv-like objects were mostly for teaching purposes (not saying they should necessarily be) than speed might be less of an issue. |
Excellent points. I really like the idea of developing clear use cases: if we write down what operations we want to be efficient for a given use case, then we can choose a format for each use case that ensures those operations are efficient (or we might identify through that process a single format that is efficient for all the operations we care about --- does that seem likely?). Seems a nice, systematic way to avoid the package blowing up.
I think an interface to sample from distributions would be useful as it would make it easy to do things like construct posterior predictions without round-tripping back through Stan (or whatever sampler the user uses) if desired (not important for brms since it does it for you, but can be useful with custom models). However, I think an interface like |
Cool ideas and lots to think about! I’ve been super busy recently but looking forward to getting back into this when we talk Monday and after |
I've been playing with this stuff some more and have some sketches of two different potential directions for this format on the rv-like branch, currently called The It comes with
> x = rfun(rnorm)(10)
> x
rvar<4000>:
[1] 0.00950±1.01 -0.01749±0.99 0.00048±1.00 -0.00052±1.00 0.00540±0.99 -0.00353±0.99 -0.00794±1.01
[8] 0.01832±1.00 0.00723±1.00 -0.00667±1.01 The internal structure contains a single element, > str(x$draws, list.len = 5)
List of 4000
$ : num [1:10] -1.3053 0.4274 -1.5071 0.2732 -0.0349 ...
$ : num [1:10] 0.258 -0.866 0.13 1.164 1.193 ...
$ : num [1:10] 0.737 0.251 -0.172 -0.376 1.163 ...
$ : num [1:10] -0.5474 0.0898 -0.6763 -1.4545 1.37 ...
$ : num [1:10] 0.226 -0.903 0.553 -1.96 -0.337 ...
[list output truncated]
> mu = rdo(rnorm(1))
> sigma = rdo(rgamma(1, 1, 1))
> y = rdo(rnorm(10, mu, sigma))
> y
rvar<4000>:
[1] -0.0189±1.8 -0.0058±1.8 -0.0116±1.8 0.0095±1.8 -0.0033±1.7 -0.0359±1.7 -0.0511±1.7 0.0037±1.7 0.0095±1.7
[10] -0.0398±1.8 Basic math ops also work: > y + 1
rvar<4000>:
[1] 0.98±1.8 0.99±1.8 0.99±1.8 1.01±1.8 1.00±1.7 0.96±1.7 0.95±1.7 1.00±1.7 1.01±1.7 0.96±1.8
I haven't quite figured out how to get these objects to play perfectly nicely with all tibble-related things. The pivoting functions in tidyr especially are giving me a bunch of grief. I think I need to rebuild these types on top of vctrs (currently I have just been hacking with base-R approaches and the occasional stuff from vctrs).
> x = rdo(array(rnorm(2*2*2), dim = c(2,2,2)))
> x
rvar<4000>:
, , 1
[,1] [,2]
[1,] -0.0121±0.99 -0.0014±0.99
[2,] 0.0116±0.98 -0.0171±1.00
, , 2
[,1] [,2]
[1,] -0.0022±1.00 -0.0084±1.01
[2,] -0.0080±1.00 -0.0124±0.99 The alternative format I have been playing with is called Since it doesn't have a nice constructor I'll make one manually: > xa = rvar_array(array(rnorm(2*2*2*4000), dim = c(2,2,2,4000)))
> xa
rvar_array<4000>:
, , 1
[,1] [,2]
[1,] 0.0142±1.01 0.0126±0.99
[2,] 0.0059±0.98 0.0141±0.99
, , 2
[,1] [,2]
[1,] -0.0124±0.99 -0.0181±1.00
[2,] 0.0114±1.00 -0.0088±0.99 Internally we have: > str(xa$draws)
num [1:2, 1:2, 1:2, 1:4000] -0.0969 0.0888 1.2426 1.4416 1.1179 ... And indexing and such work (this would also work with > xa[1:2,1,1]
rvar_array<4000>:
, , 1
[,1]
[1,] 0.0142±1.01
[2,] 0.0059±0.98 I haven't implemented math ops yet for In terms of formats for posterior, one way forward might be to include something like the If we adopt one of these styles of interface, we might also want something like a Thoughts on something like this? I admit the implementation is pretty hackish, but I thought I'd share what I've been playing with before I spend more time on it. |
Wow, this looks so nice already! I am very exited for this feature! Just some quick thoughts/questions:
|
Thanks, this is exactly the kind of feedback I needed for the next round of prototyping!
Works for me. The computer scientist in me really liked that it was even possible to build the
The current design attempts to make this possible in a couple of ways. First, most basic math operations should I think be just as efficient. Second, by having the array format stored internally rather than subclassing from array directly, if you have an operation you want to do that is best done directly on the array you can always bypass the wrapper by acting on
Good question. I think when I was early-on prototyping something with one of the
Good question. I wasn't sure if we wanted it or not yet so I left off prototyping it before solving other stuff. I can try adding a dimension for that in the next prototyping round. Another option (maybe easier) would be to let this be handled at the level of a
Currently there's an argument to rvar / rdo (
Ah yeah. From what I'm reading of
Great! I am going to continue playing with this then. Possibly not this week though (bunch of other stuff on my plate...) |
@mjskay Thanks for working on this! I started writing comments before looking at @paul-buerkner's but he basically mentioned all my initial questions! Thanks! |
Totally get that :) Definitely cool!
There's a chance that future RStan v3 and some other package won't have iteration first, but that's certainly the standard at the moment (at least for Stan)
I'm wondering if we need chain information. It could make sense for this format to be recommended only once the user is satisfied that diagnostics are ok and they can go ahead and use the draws for interesting stuff. I'm certainly not opposed to having chain information, but maybe it's not essential.
An error seems reasonable to me.
Awesome, thanks again for working on this! |
Makes sense. Since the position of that dimension is hidden from everyone except power users by this format, one thing that might help decide is if there are existing use cases that argue for one versus the other (e.g., due to it being easier/faster to do some needed operations given one order versus another). @paul-buerkner I recall from awhile back you mentioned needing to do operations on these kinds of objects; is there a use case you have that would be easier or harder given the choice of dimension order?
This makes sense to me. I'll give it a try and see if it complicates life a lot, but if it does I won't worry about it too much. TODOs so I don't forget:
|
With regard to order of dimension I see two possibly relevant aspects: (1) Convention. This favors iteration as first dimension. (2) Efficiency. If it makes a difference at all. R stores arrays in column major order (last dimension first for higher dimensional arrays). As a result, values of parameters are consequtive in memory if they come at the later dimensions (columns in case of matrices) so I would presume storing parameters in later dimensions is, if it makes any difference, probably preferable. I am definitely out of depth here though. With regard to the chain stuff, I am not sure what would be the best approch. Storing chains as an additional dimension prevents a lot of the matrix multiplication-like operations that I would like to use arrays for (and hence the rv-like objects we decide to use them for #13). So if we would still like to use chain info, we could store them as a separete element next to |
Some updates:
> A = rdo(matrix(rnorm(6, 1:6), ncol = 3))
> A
rvar<4000>[2,3]:
[,1] [,2] [,3]
[1,] 0.97±1.00 2.99±1.00 5.00±1.01
[2,] 1.99±1.00 3.98±0.98 6.01±1.01
> B = rdo(matrix(rnorm(6, 1:6), ncol = 2))
> B
rvar<4000>[3,2]:
[,1] [,2]
[1,] 0.98±1.01 3.98±1.01
[2,] 2.02±0.99 4.99±0.99
[3,] 3.01±1.01 6.00±1.00
> A %*% B
rvar<4000>[2,2]:
[,1] [,2]
[1,] 22±7.3 49±10.7
[2,] 28±8.5 64±11.5 And it's not bad speed wise either (compare to > microbenchmark(A %*% B, rdo(A %*% B))
Unit: milliseconds
expr min lq mean median uq max neval
A %*% B 1.1267 1.2926 1.658963 1.4364 1.5539 17.2971 100
rdo(A %*% B) 109.1457 127.0384 139.782670 136.3684 145.5802 391.3957 100 This did require treating the objects as S4 (at least, setting the S4 object flag on them) as
> str(A)
rvar<4000>[2,3] 0.97±1.00 1.99±1.00 2.99±1.00 3.98±0.98 ...
> str(draws_of(A))
num [1:2, 1:3, 1:4000] 0.748 1.275 5.071 3.194 5.397 ...
- attr(*, "dimnames")=List of 3
..$ : NULL
..$ : NULL
..$ : NULL
|
This all looks really good! Please tell us when you need more feedback on something or want us to try something out. |
Very cool! (And what Paul said.) |
There has been some discussion about features in posteriordb, that I think will essentially come down to features in posterior and perhaps speciall the rv-like interface (not sure about the latter yet). Here is the link:https://discourse.mc-stan.org/t/beta-release-bayesian-posterior-database/12141/19 @mjskay what is your current status with the feature? No rush of course, but I just super excited for it! |
Ah cool! So, the basic There are two big TODOs at the moment:
|
Brilliant, thank you! Did you had some more thoughts about the positioning of the iteration dimension already (first or last)? |
I haven't had a chance to experiment with a version with the iteration dimension first, so I don't have any new thoughts on that yet. Would that be helpful? |
I am afraid I didnt understand your question. What would be helpful? My
argument for having it as the first dimension is for consistency with the
other formats. Presumably is does not matter too much efficiency wise but
that remains to be checked.
Matthew Kay <[email protected]> schrieb am Fr., 6. Dez. 2019, 21:15:
… I haven't had a chance to experiment with a version with the iteration
dimension first, so I don't have any new thoughts on that yet. Would that
be helpful?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCW2ACKNVDQ3D2SMG7F653QXKQFNANCNFSM4I6MCLGQ>
.
|
Ah, I meant: would me experimenting with that format be helpful? I take it from your response the answer is yes :). I have waited on playing with that in the short term because I want to figure out the issues with |
I think it would be good to have a beta release of posterior soon so I am trying to clean up the remaining issues with the beta-release tag. @mjskay you indicated that you currently do not have much time to work on the rv-like feature. Does that still hold? If yes, I would like to drop this issue from the beta-release list as I don't see a problem in adding this feature later. Would you agree? If we decide to drop this issue from the beta-release, we need to figure out what happens to issue #13 in that regard. I would like to have draws arrays of arbitrary dimension for brms, ideally already in the beta/CRAN release although it is not a problem if it does not end up there. Last time, we spoke we agreed on that the rv-like interface could provide us with such arrays quite naturally. So the question would be whether to have an rv-independent implementation of multi-dimensional arrays or to post-pone #13 as well. |
Good question! My time remains pretty limited (currently changing a class over to online instruction and preparing another one). Probably won't have dedicated time until the end of semester (late April). I just took a look to see if the tibble/dplyr support was working now with recent changes in the dev versions of vctrs and dplyr. I think it's closer to working and will be easier to get right. I had hoped there would be a quick fix for the outstanding problems with tibble integration, but I expect I'll still need some dedicated time to get everything I want working working on that front. What's your timeline from beta to CRAN? If a CRAN submission is not going to happen until May anyway I could see merging an experimental version of the rvar interface into the master branch by the end of this week (as that would not be much work, compared to getting it integrated into tibbles properly). Then it can at least be in the beta. The main concern I have with pushing a version to CRAN before tibble integration works is I don't want to have to re-architect the structure of the rvar type much after it goes to CRAN, and I can see that potentially being needed to properly support tibbles. However, I suspect the part of the interface you would need for brms would not change much. So if the planned CRAN timeline is further out (May or later) we could consider rolling it into the beta. If the planned CRAN timeline is sooner we might want to wait until the next version. To your last question, either way I don't think we'll need an additional interface for draws of arbitrary dimension. |
Thanks, that makes sense! I am not too worried about having a CRAN release soon but having posterior in a state in which it is releasable in principle. This is important as posterior already serves as the backend of two important packages, cmdstanr by @jgabry and posteriordb by @MansMeg. If one of these packages have to be released, so does posterior. I am not sure about the schedule for these projects but we should make sure posterior is in a good shape regardless. Accordingly, I think it makes sense to postpone merging the rv-interface until you have the time to make sure it works as expected. If that means it won't make it into beta/first CRAN release that this is totally fine. The rv-like interface has some potential implications not only for #13 but also for some of the generics, we want to move to posterior (see #39). I will comment on this separately in the corresponding issue. |
Alright! Finally getting back to this. Lots of updates on the tibbles workThe basic x = rdo(rnorm(4, mean = 1:4))
df = tibble(y = letters[1:4], x)
df
compatibility with {ggdist} and {distributional}I was able to update the library(tidyverse)
library(ggdist) # install the dev branch
df %>%
ggplot(aes(y = y, dist = x)) +
stat_dist_halfeye() Or any of the other stat_dist... geoms. We also now have library(distributional)
tibble(
x = list(dist_normal(0,1), rvar(rnorm(4000, 0.5, 0.5))),
which = c("prior", "posterior")
) %>%
ggplot(aes(y = "", dist = x, color = which, fill = which)) +
stat_dist_slab(alpha = 0.5) (as an aside, it would be awesome if eventually brms / rstanarm could output their priors in the {distributional} format so plots like this would be easier to make --- see also mjskay/ggdist#21). internal format changes
expectations
x
mean(x)
This was intended to mimic base-R E(x) # could have mean(x) do this as well
What do folks think? That would also make the interface consistent with the {distributional} package, which provides a similar vector format for analytical distributions. Container typeI have started a basic implementation of a container type (with lots of d = draws_rvars(mu = rvar(rnorm(4000)), Rho = rvar(rethinking::rlkjcorr(4000, 5, 3)))
d
(I haven't implemented a print function yet, it just uses list print + the print function for rvars). As a proof of concept I have implemented basic conversion to/from dm = as_draws_matrix(d)
dm
And back --- note the dimensions gain names on the way back; this is so that character dimensions (e.g. if it were as_draws_rvars(dm)
The back-conversion will also sort dimensions and fill in missing dimensions, in case the input does not include every cell of the array: as_draws_rvars(dm[,c("Rho[3,2]", "Rho[1,1]", "Rho[2,2]")])
Although I'm not sure about the sorting aspect here (possibly dimensions should be left in whatever the input order is? That would preserve order for dimensions with string names and numeric dimensions as long as they are ordered on input...). It seems there are some corner cases to consider here. Another bit to consider is what the API for the draws_matrix(x = rnorm(10))
Whereas draws_rvars() does this: draws_rvars(x = rnorm(10))
This is because it just uses Questions / feedbackSummarizing some questions above (plus some new ones):
What do folks think? @paul-buerkner @jgabry? |
Followup on this:
I just changed the behavior so that numeric indices are sorted and character indices are not. This seems more sensible to me than sorting character indices, but this is easily changed. |
Thank you @mjskay! This is really exciting! I need to understand some details a little better, but here are some thoughts on the questions you had:
|
Yeah this is exciting, thanks for working on this! I'm in the process of moving this week so I might not have time to really dive into this until next week. Definitely feel free to keep discussing without me and I'll catch up. For now just one comment on the chains question. If there's an option include chains that doesn't come with too much overhead, that'd be fine, but I'm coming around to the idea that we don't really need chain information. The ability to work with draws as random variables is most useful after we're happy the chains are ok, so I would think we'd convert to an rvar from e.g. a draws_array after we've checked diagnostics using the draws_array. Does that make sense? |
Great, thanks both! Some TODOs for myself where there's pretty clear consensus:
Re; chains I am opening a separate issue since I think we need a bit of discussion there. |
One minor note worth raising as I implement more of d = draws_rvars(x = rnorm(1000), y = rethinking::rlkjcorr(1000, 3, 1))
d
That is 2 variables as a as_draws_matrix(d)
I don't think adjusting |
I think it is ok to have this inconsistency given that the definition of 'variable' varies between the rvar format and the others. |
I wanted to gather conversations about a potential
rv
-like interface here so as not to derail other conversations (like #4).I think a solid rv-like interface would be incredibly useful. More specifically, something that:
P()
andE()
Rv already has those down, but would be even better if it:
With all of those requirements in place, I could see tidybayes moving largely towards using tables of these rv-like objects. It would be very useful for a lot of the posterior manipulation/summarization/visualization tasks tidybayes is designed for.
If you all are interested in supporting such a format here, then the question is what's the best way to get there? The options might be:
rv
maintainer and ask if they are willing to let us take it over and make backwards-incompatible changes to it, followed by a new major release deprecating some stuff.rv
code that we want to build on into this package, come up with a new class name (to not clobber "rv") and go from there.(1) Could work if the new maintainer is not planning much with the package and if there aren't a lot of users. Currently the package has ~400 downloads/month and no revdeps on CRAN.
(2) Would be doable depending on license preferences (it is GPL-2). This could also be aided by the fact that
rv
looks to have been written by one of Andrew Gelman's former students, Jouni Kerman (@jgabry do you know him?).I would be willing to float (1) to the current
rv
maintainer, Joseph Stachelek --- I've interacted with him once or twice on twitter and github so it wouldn't be a complete cold email (unless either of you know him better). If we'd rather go for (2) or (3) it might be good to reach out to Jouni Kerman to get his thoughts (either on using his code or on things he would have done differently if he wrote the package again).The text was updated successfully, but these errors were encountered: