Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rv-like interface #8

Closed
mjskay opened this issue Oct 8, 2019 · 42 comments · Fixed by #106
Closed

rv-like interface #8

mjskay opened this issue Oct 8, 2019 · 42 comments · Fixed by #106
Assignees
Labels
feature New feature or request

Comments

@mjskay
Copy link
Collaborator

mjskay commented Oct 8, 2019

I wanted to gather conversations about a potential rv-like interface here so as not to derail other conversations (like #4).

I think a solid rv-like interface would be incredibly useful. More specifically, something that:

  • Acts as a high-level interface to "random variables" which can be vectors, matrices, arrays, etc.
  • Allows math operations on those
  • Supports nice syntax for thing like P() and E()

Rv already has those down, but would be even better if it:

  • Used a naming scheme consistent with ours
  • Supported pretty-printing in tibbles
  • Kept chain information

With all of those requirements in place, I could see tidybayes moving largely towards using tables of these rv-like objects. It would be very useful for a lot of the posterior manipulation/summarization/visualization tasks tidybayes is designed for.

If you all are interested in supporting such a format here, then the question is what's the best way to get there? The options might be:

  1. Reach out to the rv maintainer and ask if they are willing to let us take it over and make backwards-incompatible changes to it, followed by a new major release deprecating some stuff.
  2. Bring some of the existing rv code that we want to build on into this package, come up with a new class name (to not clobber "rv") and go from there.
  3. Start from scratch here.
  4. Start from scratch in a different package and add that dependency here.

(1) Could work if the new maintainer is not planning much with the package and if there aren't a lot of users. Currently the package has ~400 downloads/month and no revdeps on CRAN.
(2) Would be doable depending on license preferences (it is GPL-2). This could also be aided by the fact that rv looks to have been written by one of Andrew Gelman's former students, Jouni Kerman (@jgabry do you know him?).

I would be willing to float (1) to the current rv maintainer, Joseph Stachelek --- I've interacted with him once or twice on twitter and github so it wouldn't be a complete cold email (unless either of you know him better). If we'd rather go for (2) or (3) it might be good to reach out to Jouni Kerman to get his thoughts (either on using his code or on things he would have done differently if he wrote the package again).

@paul-buerkner
Copy link
Collaborator

Just adding the link to the rv website for easy reference: https://jsta.github.io/rv/

@paul-buerkner
Copy link
Collaborator

I am definitely in favor of supporting this format! I will need to take a look at the source code though before I can have an informed opinion about the best way to get there.

@jgabry
Copy link
Member

jgabry commented Oct 8, 2019

I am definitely in favor of supporting this format! I will need to take a look at the source code though before I can have an informed opinion about the best way to get there.

Same!

@jgabry
Copy link
Member

jgabry commented Oct 8, 2019

Yeah I forgot that it was Jouni who wrote that! He left before I started working with Andrew but Andrew has mentioned him a few times. Andrew has always wanted a more rv like implementation for working with Stan output but I don’t think he ended up using the rv package that much (I forget why but I should ask again). I’m definitely comfortable reaching out to Jouni if we want to go that route.

@mjskay
Copy link
Collaborator Author

mjskay commented Oct 8, 2019

Okay --- maybe as a first step, @jgabry could you reach out to Jouni and/or Andrew? Perhaps they would have insight on either lessons learned from implementing rv and what a good path forward for us might be. Then we can use that to help decide a path forward.

@jgabry
Copy link
Member

jgabry commented Oct 8, 2019

Yeah I can do that. I haven’t found a current email address for Jouni but I can ask Andrew if he has it.

@mjskay
Copy link
Collaborator Author

mjskay commented Oct 8, 2019

Thanks!

@jgabry
Copy link
Member

jgabry commented Oct 8, 2019

Andrew responded fast and connected me to Jouni. I just sent an email to everyone.

@paul-buerkner
Copy link
Collaborator

Since rv objects are just named lists, I think we could actually replace some or even all of the desired list classes (as per #4 (comment)) by rv objects, or lists of rv objects, since each rv represents only on vector/matrix/array of random variables and a posterior contains multiple of those.

We need to think of how to represent chains though, but I am confident we can find an elegant solution to this.

If we build rv natively into posterior this way, I would still like to have more control over it in order not to depend on a not-so-much maintained package.

@mjskay
Copy link
Collaborator Author

mjskay commented Oct 11, 2019

Since rv objects are just named lists, I think we could actually replace some or even all of the desired list classes (as per #4 (comment)) by rv objects, or lists of rv objects, since each rv represents only on vector/matrix/array of random variables and a posterior contains multiple of those.

I like this idea.

If we build rv natively into posterior this way, I would still like to have more control over it in order not to depend on a not-so-much maintained package.

Agreed. Is this worth raising in the email thread?

We need to think of how to represent chains though, but I am confident we can find an elegant solution to this.

I was wondering what this would look like... My first instinct was that the chain information is best put inside the rv objects, but this would require changes to the rv package no?

@paul-buerkner
Copy link
Collaborator

We may indeed need to raise the maintaining issue in the email.

I would also prefer to put the chain information directly into the rv objects so that we do not have to take care of it in the list of rv objects that make up the whole posterior and can pass single rv objects to convergence diagnostics etc.

Another aspect of rv is that they are represented internally as lists even if they can have a dimension assigned. This makes it potentially quite computationally inefficient to work this format for larger objects. For instance, a matrix multiplication is syntactically possible but I would guess quite inefficient (I haven't tested it yet). In any case, this is something to keep in mind.

@mjskay
Copy link
Collaborator Author

mjskay commented Oct 15, 2019

Hmm yeah, I see your point about the internal organization.

E.g., here is the internal representation for a 16x16 matrix:

> x1 = rvmatrix(rvnorm(256), nrow = 16)
> str(x1, list.len = 5)
List of 256
 $ : num [1:4000] 1.415 -0.593 -0.605 -0.16 -0.233 ...
 $ : num [1:4000] 0.561 -1.522 -0.757 -0.589 0.105 ...
 $ : num [1:4000] -0.452 0.369 0.626 1.168 -0.923 ...
 $ : num [1:4000] -0.806 -0.669 -1.443 -0.395 0.815 ...
 $ : num [1:4000] 1.822 1.416 0.213 -0.433 -0.186 ...
  [list output truncated]
 - attr(*, "class")= chr "rv"
 - attr(*, "dim")= int [1:2] 16 16

And a matrix multiply takes a little bit of time:

> x2 = rvmatrix(rvnorm(256), nrow = 16)
> system.time(x1 %**% x2)
   user  system elapsed 
   0.15    0.00    0.15 

(Not sure why %*% isn't overloaded to do the matrix multiply, maybe there's a limitation in what operators can be overloaded? Seems surprising to me)

Meanwhile you can re-organize the objects into lists of matrices by draw using the internal rv:::.sims.as.list function:

> x1_list = rv:::.sims.as.list(x1)
> x2_list = rv:::.sims.as.list(x2)
> str(x1_list, list.len = 5)
List of 4000
 $ : num [1:16, 1:16] 1.415 0.561 -0.452 -0.806 1.822 ...
 $ : num [1:16, 1:16] -0.593 -1.522 0.369 -0.669 1.416 ...
 $ : num [1:16, 1:16] -0.605 -0.757 0.626 -1.443 0.213 ...
 $ : num [1:16, 1:16] -0.16 -0.589 1.168 -0.395 -0.433 ...
 $ : num [1:16, 1:16] -0.233 0.105 -0.923 0.815 -0.186 ...
  [list output truncated]

This format is more what I would have imagined as the internal storage format before looking, as it makes implementing functions very straightforward (just a map operation). E.g. a quick-and-dirty matrix multiply using this format is much faster:

> system.time(purrr::map2(x1_list, x2_list, `%*%`))
   user  system elapsed 
   0.01    0.00    0.01 

This also makes it easier to support more arbitrary object types by overloading the typical math operators and functions to just be applied to every list element. This might even make it easy for other people to create custom objects compatible with our rv-like interface for other classes if they wanted to (e.g. if someone wanted to use rray or Matrix or some other particular array implementation we might be able to make it possible to do that without a whole lot of work).

Given the simplicity of an implementation with that format, I am perhaps coming around to just rolling our own. There are other things that rv does that I don't think are strictly necessary with some careful API design; e.g. rather than implementing an rv*** for every distribution (rvnorm, rvgamma, etc) it strikes me that we could construct a function that creates an appropriate rv object given an RNG function from base R; something like:

x = rv(rnorm, 0, 1)

Or:

x = rv("norm", 0, 1)

Another question there (maybe out of scope for now but something to consider in the future) is whether we would want to support rv objects that represent analytical distributions. These might have draws associated with them (for operations with other RVs) but supply their known density and distribution functions when requested, which could be useful for visualizing priors, for example. This is maybe a longer-term thing, but I think there would be something useful to a class that unified both representations: for example, the dev version of tidybayes has a bunch of variants of geoms that can either take samples or named distributions and their parameters to make visualizations (see here), but unifying these into a single object type that can be put into data frames would allow both to be used with the same geom.

@mjskay
Copy link
Collaborator Author

mjskay commented Oct 15, 2019

Also, if we want to get really crazy with this (just brainstorming here), a kind of "out there" feature could be an rv_do operation that automatically rewrites expressions with random variables into appropriate apply() or map() calls. This could be seen as a bit of an abuse of non-standard evaluation, but does have the nice property of allowing arbitrary functions to be used with rvs (sort of getting around the problems Jouni was talking about re: object orientation in R).

Say we created an NSE function rv_do() that takes names used in the passed-in code expression and checks the calling environment to see if those names correspond to rv objects and then did a parallel map over all used rv objects. It might turn code like this:

n_obs = 10
mu = rv(rnorm, 0, 1)
sigma = rv(rgamma, 1, 1)
y = rv_do(rnorm(n_obs, mu, sigma))

into the equivalent of this:

# everything else the same
y = purrr::pmap(list(mu = mu, sigma = sigma), function(mu, sigma) rnorm(n_obs, mu, sigma))

In fact, then mu = rv(rnorm, 0, 1) becomes equivalent to mu = rv_do(rnorm(1, 0, 1)).

@paul-buerkner
Copy link
Collaborator

Thanks for looking into this! The problem with the list-by-draw representation is that summary statistics are quite inefficient to compute I would expect but I am unsure how much this is a problem as compared to the efficiency of numerical transformations.

I am wondering if we need all the extra features from rv to sample from distribution etc. in the posterior package? Sure, if we reuse the existing rv, it will be shipped with it but if we build our own, I don't see too much need to support all this natively, at least not from the start. Of course, if we have an rv_do function that we don't need to worry about all those specifics as users can simply work with this.

More generally, I think we run into the same problems of different formats in rv as we run into posterior. So, to me, the question is if we accidentally build a posterior package within the posterior package by not only supporting different draws representations (of which one is rv-like objects) and then again support different representation of rv objects themselves. I am not sure yet, how a good combination of the two purposes (representing posterior distributions in different common formats on the one hand, and having a nice random-variable syntax on the other) could look like.

We may need to also develop a clearer of the use cases of these formats. If, for instance, rv-like objects were mostly for teaching purposes (not saying they should necessarily be) than speed might be less of an issue.

@mjskay
Copy link
Collaborator Author

mjskay commented Oct 17, 2019

Excellent points. I really like the idea of developing clear use cases: if we write down what operations we want to be efficient for a given use case, then we can choose a format for each use case that ensures those operations are efficient (or we might identify through that process a single format that is efficient for all the operations we care about --- does that seem likely?). Seems a nice, systematic way to avoid the package blowing up.

I am wondering if we need all the extra features from rv to sample from distribution etc. in the posterior package? Sure, if we reuse the existing rv, it will be shipped with it but if we build our own, I don't see too much need to support all this natively, at least not from the start. Of course, if we have an rv_do function that we don't need to worry about all those specifics as users can simply work with this.

I think an interface to sample from distributions would be useful as it would make it easy to do things like construct posterior predictions without round-tripping back through Stan (or whatever sampler the user uses) if desired (not important for brms since it does it for you, but can be useful with custom models). However, I think an interface like rv_do() (or perhaps a less esoteric form of it) should make this easy without implementing a bunch of custom functions like rv currently does.

@jgabry
Copy link
Member

jgabry commented Oct 26, 2019

Cool ideas and lots to think about! I’ve been super busy recently but looking forward to getting back into this when we talk Monday and after

@mjskay
Copy link
Collaborator Author

mjskay commented Nov 5, 2019

I've been playing with this stuff some more and have some sketches of two different potential directions for this format on the rv-like branch, currently called rvar and rvar_array (crappy names chosen for testing). I'm not proposing to use both (or even either) of these, but more using them as early explorations of what might be possible.

The rvar experimental type just uses a list of draws, so it can support distributions over arbitrary objects. This is obviously a far end of the spectrum in terms of flexibility versus performance tradeoffs --- not necessarily fast, but you can do basically anything with it.

It comes with rfun() and rdo() functions for making and manipulating objects (the names aren't great, but this is just experimental :)). It also has a printing method that summarizes numerical objects using their original structure (I'll get to that).

rfun() turns a function into something that accepts and produces rvars:

> x = rfun(rnorm)(10)
> x
rvar<4000>:
 [1]  0.00950±1.01 -0.01749±0.99  0.00048±1.00 -0.00052±1.00  0.00540±0.99 -0.00353±0.99 -0.00794±1.01
 [8]  0.01832±1.00  0.00723±1.00 -0.00667±1.01

The internal structure contains a single element, draws, that looks like this:

> str(x$draws, list.len = 5)
List of 4000
 $ : num [1:10] -1.3053 0.4274 -1.5071 0.2732 -0.0349 ...
 $ : num [1:10] 0.258 -0.866 0.13 1.164 1.193 ...
 $ : num [1:10] 0.737 0.251 -0.172 -0.376 1.163 ...
 $ : num [1:10] -0.5474 0.0898 -0.6763 -1.4545 1.37 ...
 $ : num [1:10] 0.226 -0.903 0.553 -1.96 -0.337 ...
  [list output truncated]

rdo, in contrast to rfun, wraps an expression that can contain rvars (or not) and produces rvars. Basically it turns the expression into a function, takes any variables the expression uses from the environment that are rvars, wraps the function using rfun, then evaluates it using those rvars. The upshot (from the user's perspective) is you can wrap fairly arbitrary R code with rdo and get back rvars (saves us implementing a bunch of stuff, like random number generators). Like this:

> mu = rdo(rnorm(1))
> sigma = rdo(rgamma(1, 1, 1))
> y = rdo(rnorm(10, mu, sigma))
> y
rvar<4000>:
 [1] -0.0189±1.8 -0.0058±1.8 -0.0116±1.8  0.0095±1.8 -0.0033±1.7 -0.0359±1.7 -0.0511±1.7  0.0037±1.7  0.0095±1.7
[10] -0.0398±1.8

Basic math ops also work:

> y + 1
rvar<4000>:
 [1] 0.98±1.8 0.99±1.8 0.99±1.8 1.01±1.8 1.00±1.7 0.96±1.7 0.95±1.7 1.00±1.7 1.01±1.7 0.96±1.8

rvars can also be put into tibbles:

> tibble(y)
# A tibble: 10 x 1
             y
        <rvar>
 1 -0.0189±1.8
 2 -0.0058±1.8
 3 -0.0116±1.8
 4  0.0095±1.8
 5 -0.0033±1.7
 6 -0.0359±1.7
 7 -0.0511±1.7
 8  0.0037±1.7
 9  0.0095±1.7
10 -0.0398±1.8

I haven't quite figured out how to get these objects to play perfectly nicely with all tibble-related things. The pivoting functions in tidyr especially are giving me a bunch of grief. I think I need to rebuild these types on top of vctrs (currently I have just been hacking with base-R approaches and the occasional stuff from vctrs).

rvar also works for matrices and arrays of arbitrary dimension. E.g.:

> x = rdo(array(rnorm(2*2*2), dim = c(2,2,2)))
> x
rvar<4000>:
, , 1

     [,1]         [,2]        
[1,] -0.0121±0.99 -0.0014±0.99
[2,]  0.0116±0.98 -0.0171±1.00

, , 2

     [,1]         [,2]        
[1,] -0.0022±1.00 -0.0084±1.01
[2,] -0.0080±1.00 -0.0124±0.99

The alternative format I have been playing with is called rvar_array (just for testing). I haven't implemented rfun / rdo for it, but in principle it should be possible (though with some sanity checks since rvar_array is a more strict format), and everything else works similarly (printing, sticking it in tibbles, etc). The main difference is that the internal format is just an array, which means some operations will be faster, but arbitrary object types are not supported (only numerics, logicals, and characters really --- we could probably also hack a factor-like thing onto it if desired).

Since it doesn't have a nice constructor I'll make one manually:

> xa = rvar_array(array(rnorm(2*2*2*4000), dim = c(2,2,2,4000)))
> xa
rvar_array<4000>:
, , 1

     [,1]         [,2]        
[1,]  0.0142±1.01  0.0126±0.99
[2,]  0.0059±0.98  0.0141±0.99

, , 2

     [,1]         [,2]        
[1,] -0.0124±0.99 -0.0181±1.00
[2,]  0.0114±1.00 -0.0088±0.99

Internally we have:

> str(xa$draws)
 num [1:2, 1:2, 1:2, 1:4000] -0.0969 0.0888 1.2426 1.4416 1.1179 ...

And indexing and such work (this would also work with rvar):

> xa[1:2,1,1]
rvar_array<4000>:
, , 1

     [,1]       
[1,] 0.0142±1.01
[2,] 0.0059±0.98

I haven't implemented math ops yet for rvar_array but I'm thinking I might be able to use the broadcasting stuff from rray to do it efficiently.

In terms of formats for posterior, one way forward might be to include something like the rvar_array format (with a better name) as the format for draws with arbitrary dimensions (for #13).

If we adopt one of these styles of interface, we might also want something like a draws_rvar format that is just a list of rvar-like objects (for representing all the variables in a posterior).

Thoughts on something like this? I admit the implementation is pretty hackish, but I thought I'd share what I've been playing with before I spend more time on it.

@paul-buerkner
Copy link
Collaborator

Wow, this looks so nice already! I am very exited for this feature! Just some quick thoughts/questions:

  • I think the rvar_array version is possibly preferable in the long run if we can make the interface similarily nice as the one of the list version due to the efficiency increases you mentioned. Since we will need to store multiple rvar-like objects in a list anyway to represent the full posterior, I think an array object for each element comes very natural to how we think of parameters/variables in Stan.

  • Another aspect of the rvar_array version is that all "elements" will automatically have the same length while in the list objects there would have to be some checking or automatic broadcasting.

  • Depending on the implementation, rvar_array might indeed be a proper backend for Arrays of draws of arbitrary dimension #13, but for this purpose we need to make sure it is basically as efficient as working with standard arrays and similar in flexibility

  • Currently, the draws dimension is last in the dimension list of the array version which deviates a little from the existing array-like formats. What would be your argument for this order?

  • How can we best store chain information in those objects? Provided that we do want to store them.

  • What is the interface to decide how many draws are generated by default? I see that they are 4000 in your examples but I am not sure where this is coming from?

  • One problem with vctrs and rray is that, as I understand it, you cannot built child classes on top of them that are retained by standard operations. At least this is true for rray and is bacially the reason I decided against using that package for now in the standard formats. See rray removes custom classes r-lib/rray#247

  • I will have to think more about proper function names but I am already exited about the interface in general. It looks so clean to me!

@mjskay
Copy link
Collaborator Author

mjskay commented Nov 5, 2019

Thanks, this is exactly the kind of feedback I needed for the next round of prototyping!

I think the rvar_array version is possibly preferable in the long run

Works for me. The computer scientist in me really liked that it was even possible to build the rvar version to work with any type of object, but in the end the efficiency issues with that format are not worth it.

Depending on the implementation, rvar_array might indeed be a proper backend for #13, but for this purpose we need to make sure it is basically as efficient as working with standard arrays and similar in flexibility

The current design attempts to make this possible in a couple of ways. First, most basic math operations should I think be just as efficient. Second, by having the array format stored internally rather than subclassing from array directly, if you have an operation you want to do that is best done directly on the array you can always bypass the wrapper by acting on x$draws directly. Obviously that's a potentially "dangerous" operation (if you modify x$draws to something invalid), but seems like a reasonable escape hatch. I might wrap it in a function (can't think of a good name... both draws() and draws_array() are obvious candidates but conflict with other concepts already in the package).

Currently, the draws dimension is last in the dimension list of the array version which deviates a little from the existing array-like formats. What would be your argument for this order?

Good question. I think when I was early-on prototyping something with one of the *apply functions it was spitting out stuff joined by the last dimension, so it seemed a natural enough way to implement everything for consistency. However, if having the iteration dimension be first would be better for your use cases I'm happy to try doing that for the next round.

How can we best store chain information in those objects? Provided that we do want to store them.

Good question. I wasn't sure if we wanted it or not yet so I left off prototyping it before solving other stuff. I can try adding a dimension for that in the next prototyping round. Another option (maybe easier) would be to let this be handled at the level of a draws_rvar object that could store different chains as different rvar objects for the same variable.

What is the interface to decide how many draws are generated by default? I see that they are 4000 in your examples but I am not sure where this is coming from?

Currently there's an argument to rvar / rdo (.ndraws, not a great name) with a default value, which I could also make be set by a global option (that's how rv does it). This argument is used for expressions without existing rvars; for expressions containing existing rvars the number of draws in those objects is what is used. I haven't decided what to do if someone tries to combine rvars with different numbers of iterations (probably throw an error?).

One problem with vctrs and rray is that, as I understand it, you cannot built child classes on top of them that are retained by standard operations. At least this is true for rray and is bacially the reason I decided against using that package for now in the standard formats. See r-lib/rray#247

Ah yeah. From what I'm reading of vec_proxy / vec_restore and what I've played with so far, I think you can build custom types on top of vctrs but not rray currently? Or I might have missed something...

I will have to think more about proper function names but I am already exited about the interface in general. It looks so clean to me!

Great! I am going to continue playing with this then. Possibly not this week though (bunch of other stuff on my plate...)

@jgabry
Copy link
Member

jgabry commented Nov 5, 2019

@mjskay Thanks for working on this! I started writing comments before looking at @paul-buerkner's but he basically mentioned all my initial questions! Thanks!

@jgabry
Copy link
Member

jgabry commented Nov 5, 2019

Works for me. The computer scientist in me really liked that it was even possible to build the rvar version to work with any type of object

Totally get that :) Definitely cool!

However, if having the iteration dimension be first would be better for your use cases I'm happy to try doing that for the next round.

There's a chance that future RStan v3 and some other package won't have iteration first, but that's certainly the standard at the moment (at least for Stan)

How can we best store chain information in those objects? Provided that we do want to store them.

Good question. I wasn't sure if we wanted it or not yet so I left off prototyping it before solving other stuff. I can try adding a dimension for that in the next prototyping round. Another option (maybe easier) would be to let this be handled at the level of a draws_rvar object that could store different chains as different rvar objects for the same variable.

I'm wondering if we need chain information. It could make sense for this format to be recommended only once the user is satisfied that diagnostics are ok and they can go ahead and use the draws for interesting stuff. I'm certainly not opposed to having chain information, but maybe it's not essential.

I haven't decided what to do if someone tries to combine rvars with different numbers of iterations (probably throw an error?).

An error seems reasonable to me.

Great! I am going to continue playing with this then. Possibly not this week though (bunch of other stuff on my plate...)

Awesome, thanks again for working on this!

@mjskay
Copy link
Collaborator Author

mjskay commented Nov 6, 2019

There's a chance that future RStan v3 and some other package won't have iteration first, but that's certainly the standard at the moment (at least for Stan)

Makes sense. Since the position of that dimension is hidden from everyone except power users by this format, one thing that might help decide is if there are existing use cases that argue for one versus the other (e.g., due to it being easier/faster to do some needed operations given one order versus another). @paul-buerkner I recall from awhile back you mentioned needing to do operations on these kinds of objects; is there a use case you have that would be easier or harder given the choice of dimension order?

I'm wondering if we need chain information. It could make sense for this format to be recommended only once the user is satisfied that diagnostics are ok and they can go ahead and use the draws for interesting stuff. I'm certainly not opposed to having chain information, but maybe it's not essential.

This makes sense to me. I'll give it a try and see if it complicates life a lot, but if it does I won't worry about it too much.

TODOs so I don't forget:

  • try chain stuff
  • decide on where the iteration/chain (if present) dimensions go
  • error if chain/iteration don't match when using multiple objects
  • some function to access draws (maybe draws_of()?)
  • vctrs interface
  • math interface
  • rfun / rdo for array interface
  • pick names

@paul-buerkner
Copy link
Collaborator

With regard to order of dimension I see two possibly relevant aspects: (1) Convention. This favors iteration as first dimension. (2) Efficiency. If it makes a difference at all. R stores arrays in column major order (last dimension first for higher dimensional arrays). As a result, values of parameters are consequtive in memory if they come at the later dimensions (columns in case of matrices) so I would presume storing parameters in later dimensions is, if it makes any difference, probably preferable. I am definitely out of depth here though.

With regard to the chain stuff, I am not sure what would be the best approch. Storing chains as an additional dimension prevents a lot of the matrix multiplication-like operations that I would like to use arrays for (and hence the rv-like objects we decide to use them for #13). So if we would still like to use chain info, we could store them as a separete element next to $draws in the rvar object, where we could basically store .chain, .iteration, .draws as in draws_df
Manual subsetting inside the draws element would of course invalidate this but that is not something common users should be doing. Not sure if this proposal is sensible but perhaps it helps us thinking about if/how to incorporate chains.

@mjskay
Copy link
Collaborator Author

mjskay commented Nov 12, 2019

Some updates:

  • I have re-written the interface using array-backed vctrs (or well, some parts of it). I still haven't figured out every corner of the interfaces that need to be supported by a valid "vector" in R, defined in the broad sense of what functions people tend use on vectors and the output they expect from them. There are a lot of corner cases it seems, and this format is somewhat outside the parameters typically expected by vctrs so we can't rely on it to handle all of them (especially not if we want things to be fast, as several of its operators are implemented on top of "proxy" objects which in our case are inefficient to generate, so I basically have to implement those operations myself).

  • Basic math ops work now and should be efficient, including matrix multiplication which is translated into a tensor multiply using tensorA::mul.tensor() (if folks prefer a different dependency for tensor multiplication let me know). So this works:

> A = rdo(matrix(rnorm(6, 1:6), ncol = 3))
> A
rvar<4000>[2,3]:
     [,1]      [,2]      [,3]     
[1,] 0.97±1.00 2.99±1.00 5.00±1.01
[2,] 1.99±1.00 3.98±0.98 6.01±1.01
> B = rdo(matrix(rnorm(6, 1:6), ncol = 2))
> B
rvar<4000>[3,2]:
     [,1]      [,2]     
[1,] 0.98±1.01 3.98±1.01
[2,] 2.02±0.99 4.99±0.99
[3,] 3.01±1.01 6.00±1.00
> A %*% B
rvar<4000>[2,2]:
     [,1]    [,2]   
[1,] 22±7.3  49±10.7
[2,] 28±8.5  64±11.5

And it's not bad speed wise either (compare to rdo(A %*% B), which does the naive thing of a matrix multiply within each draw):

> microbenchmark(A %*% B, rdo(A %*% B))
Unit: milliseconds
         expr      min       lq       mean   median       uq      max neval
      A %*% B   1.1267   1.2926   1.658963   1.4364   1.5539  17.2971   100
 rdo(A %*% B) 109.1457 127.0384 139.782670 136.3684 145.5802 391.3957   100

This did require treating the objects as S4 (at least, setting the S4 object flag on them) as %*% does not dispatch correctly on S3 objects. So it is a sort of mostly-S3-except-for-%*% implementation at the moment.

  • You can now use draws_of() and draws_of<-() to get/set the underlying array if you need to manipulate the array directly for some more efficient operation. Not sure of the name, but that's easily changed. E.g.: (note the abbreviated output of str() as well):
> str(A)
 rvar<4000>[2,3] 0.97±1.00 1.99±1.00 2.99±1.00 3.98±0.98 ...
> str(draws_of(A))
 num [1:2, 1:3, 1:4000] 0.748 1.275 5.071 3.194 5.397 ...
 - attr(*, "dimnames")=List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  • Currently I am allowing operations only between objects with the same number of draws, or with ndraws = 1. The latter case is used to handle constants: if you cast a numeric to an rvar, for example, it becomes an rvar with 1 draw, which can then easily be broadcast to whatever number of draws are necessary when doing operations with other rvars.

  • I like @paul-buerkner's idea of storing chain info as a parallel attribute, I think that's probably the way to go if we do put chain info in these objects.

  • With respect to dimension order, I am continuing to prototype with draws last since that's where I started. Once I have tests and if there's a strong motivation I will try a draws-first structure to see if it's more efficient.

@paul-buerkner
Copy link
Collaborator

This all looks really good! Please tell us when you need more feedback on something or want us to try something out.

@jgabry
Copy link
Member

jgabry commented Nov 14, 2019

Very cool! (And what Paul said.)

@paul-buerkner
Copy link
Collaborator

There has been some discussion about features in posteriordb, that I think will essentially come down to features in posterior and perhaps speciall the rv-like interface (not sure about the latter yet). Here is the link:https://discourse.mc-stan.org/t/beta-release-bayesian-posterior-database/12141/19

@mjskay what is your current status with the feature? No rush of course, but I just super excited for it!

@mjskay
Copy link
Collaborator Author

mjskay commented Dec 6, 2019

Ah cool! So, the basic rvar works if you check out the rv-like branch. Over the Christmas break I am hoping to put together a little notebook (maybe a vignette) introducing the syntax and demo-ing it a bit in order to get feedback, but if you want to give it a try sooner you can check out that branch.

There are two big TODOs at the moment:

  • rvars still do not work perfectly with dplyr / tibbles. I have tried several ways of organizing the object internally and each one exhibits weird corner cases with tibbles or some tidyverse functions. You can put rvars into tibbles fine, but can't really use them with grouped tibbles or with some specific operations in dplyr / tidyr. I think this is a bigger problem that I will need to raise as an issue with the tidyverse folks to either figure out what I am doing wrong or if rvars are just stretching the concepts behind vctrs in ways that were not anticipated.
    For usage of rvars in posterior I don't think this matters much, but it's something I want to solve as I would like to be able to use rvars in tidybayes.
  • I haven't built anything like a draws_rvar_list object yet; i.e. a collection of rvars that could represent a full posterior. I assume this would be relatively straightforward: new_rvar() can take arrays of arbitrary dimension and spit out rvars. I suspect most of the functions in the package that currently apply to draws objects could get implementations for rvars and then most functions for a draws_rvar_list object could just apply themselves over all rvars in the collection.

@paul-buerkner
Copy link
Collaborator

Brilliant, thank you! Did you had some more thoughts about the positioning of the iteration dimension already (first or last)?

@mjskay
Copy link
Collaborator Author

mjskay commented Dec 6, 2019

I haven't had a chance to experiment with a version with the iteration dimension first, so I don't have any new thoughts on that yet. Would that be helpful?

@paul-buerkner
Copy link
Collaborator

paul-buerkner commented Dec 6, 2019 via email

@paul-buerkner paul-buerkner added this to the beta release milestone Dec 10, 2019
@mjskay
Copy link
Collaborator Author

mjskay commented Dec 10, 2019

Ah, I meant: would me experimenting with that format be helpful? I take it from your response the answer is yes :).

I have waited on playing with that in the short term because I want to figure out the issues with vctrs first.

@paul-buerkner
Copy link
Collaborator

I think it would be good to have a beta release of posterior soon so I am trying to clean up the remaining issues with the beta-release tag.

@mjskay you indicated that you currently do not have much time to work on the rv-like feature. Does that still hold? If yes, I would like to drop this issue from the beta-release list as I don't see a problem in adding this feature later. Would you agree?

If we decide to drop this issue from the beta-release, we need to figure out what happens to issue #13 in that regard. I would like to have draws arrays of arbitrary dimension for brms, ideally already in the beta/CRAN release although it is not a problem if it does not end up there. Last time, we spoke we agreed on that the rv-like interface could provide us with such arrays quite naturally. So the question would be whether to have an rv-independent implementation of multi-dimensional arrays or to post-pone #13 as well.

@mjskay
Copy link
Collaborator Author

mjskay commented Mar 24, 2020

Good question! My time remains pretty limited (currently changing a class over to online instruction and preparing another one). Probably won't have dedicated time until the end of semester (late April).

I just took a look to see if the tibble/dplyr support was working now with recent changes in the dev versions of vctrs and dplyr. I think it's closer to working and will be easier to get right. I had hoped there would be a quick fix for the outstanding problems with tibble integration, but I expect I'll still need some dedicated time to get everything I want working working on that front.

What's your timeline from beta to CRAN? If a CRAN submission is not going to happen until May anyway I could see merging an experimental version of the rvar interface into the master branch by the end of this week (as that would not be much work, compared to getting it integrated into tibbles properly). Then it can at least be in the beta. The main concern I have with pushing a version to CRAN before tibble integration works is I don't want to have to re-architect the structure of the rvar type much after it goes to CRAN, and I can see that potentially being needed to properly support tibbles. However, I suspect the part of the interface you would need for brms would not change much.

So if the planned CRAN timeline is further out (May or later) we could consider rolling it into the beta.

If the planned CRAN timeline is sooner we might want to wait until the next version.

To your last question, either way I don't think we'll need an additional interface for draws of arbitrary dimension.

@paul-buerkner
Copy link
Collaborator

Thanks, that makes sense! I am not too worried about having a CRAN release soon but having posterior in a state in which it is releasable in principle. This is important as posterior already serves as the backend of two important packages, cmdstanr by @jgabry and posteriordb by @MansMeg. If one of these packages have to be released, so does posterior. I am not sure about the schedule for these projects but we should make sure posterior is in a good shape regardless. Accordingly, I think it makes sense to postpone merging the rv-interface until you have the time to make sure it works as expected. If that means it won't make it into beta/first CRAN release that this is totally fine.

The rv-like interface has some potential implications not only for #13 but also for some of the generics, we want to move to posterior (see #39). I will comment on this separately in the corresponding issue.

@mjskay
Copy link
Collaborator Author

mjskay commented Jul 5, 2020

Alright! Finally getting back to this. Lots of updates on the rv-like branch if folks want to check it out. Definitely some stuff I need feedback/potential help with now. More details below + questions summarized at the end.

tibbles work

The basic rvar implementation is now fairly feature-complete. Some TODOs/stubs remain but should be marked. New stuff includes the ability to put rvars in data.frames / tibbles, which previously did not work due to issues with vctrs:

x = rdo(rnorm(4, mean = 1:4))
df = tibble(y = letters[1:4], x)
df
# A tibble: 4 x 2
  y          x
  <chr> <rvar>
1 a     0.99±1
2 b     1.98±1
3 c     2.99±1
4 d     4.00±1

compatibility with {ggdist} and {distributional}

I was able to update the ggdist::stat_dist_ family to support visualizing rvars. That means if you install the dev branch of {ggdist} you can do stuff like this:

library(tidyverse)
library(ggdist)    # install the dev branch 
df %>% 
  ggplot(aes(y = y, dist = x)) + 
  stat_dist_halfeye()

image

Or any of the other stat_dist... geoms.

We also now have cdf(), density(), and quantile() functions patterned after how distributional does this. Combining {distributional} functions for analytical distributions (like dist_normal(0,1)) with rvars we can do stuff like this:

library(distributional)
tibble(
  x = list(dist_normal(0,1), rvar(rnorm(4000, 0.5, 0.5))),
  which = c("prior", "posterior")
) %>%
  ggplot(aes(y = "", dist = x, color = which, fill = which)) +
  stat_dist_slab(alpha = 0.5)

image

(as an aside, it would be awesome if eventually brms / rstanarm could output their priors in the {distributional} format so plots like this would be easier to make --- see also mjskay/ggdist#21).

internal format changes

  • The internal format of rvars now has the draws dimension first, as requested by @paul-buerkner.

  • {rray} appears to no longer be on CRAN (since several months now) pending a major re-write, so I re-wrote things to remove that dependency. The main upshot is broadcasting is now a bit more restricted than before: rvars will do broadcasting from dimensions of size 1 up to higher dimensions. This seems to cover most cases and was simplest to implement. Presumably in the future when rray hits CRAN again we can easily modify this to use rray's more complete broadcasting implementation (which I think recycles dimensions as long as they are whole multiples).

expectations

  • E() and Pr() are implemented for doing expectations.

  • mean() and median() currently do means/medians across draws rather than means/medians of the draws. i.e. they return rvars, like this:

x
rvar<4000>[4] mean±sd:
[1] 0.99±1 2.02±1 2.99±1 3.96±1
mean(x)
rvar<4000>[1] mean±sd:
[1] 2.5±0.5

This was intended to mimic base-R mean() (e.g. mean(1:4) -> 2.5) but in retrospect I think that use case might be better served by a different function name (like rvar_mean()). Then mean() would be an alias for E(), which does this:

E(x)   # could have mean(x) do this as well
[1] 0.9935992 2.0236997 2.9876973 3.9577762

What do folks think? That would also make the interface consistent with the {distributional} package, which provides a similar vector format for analytical distributions.

Container type

I have started a basic implementation of a container type (with lots of stop("TODO: IMPLEMENT") stubs) called draws_rvars. I'm not sold on the name, but that's easily changed. The basic format currently is a named list where each element is an rvar:

d = draws_rvars(mu = rvar(rnorm(4000)), Rho = rvar(rethinking::rlkjcorr(4000, 5, 3)))
d
$mu
rvar<4000>[1] mean±sd:
[1] 0.02±1

$Rho
rvar<4000>[5,5] mean±sd:
     [,1]             [,2]             [,3]             [,4]             [,5]            
[1,]  1.00000±0.0e+00 -0.00199±3.2e-01 -0.00208±3.2e-01  0.00027±3.2e-01  0.00020±3.2e-01
[2,] -0.00199±3.2e-01  1.00000±9.1e-17 -0.00050±3.2e-01 -0.00086±3.2e-01  0.00251±3.2e-01
[3,] -0.00208±3.2e-01 -0.00050±3.2e-01  1.00000±9.9e-17 -0.00047±3.1e-01 -0.00739±3.1e-01
[4,]  0.00027±3.2e-01 -0.00086±3.2e-01 -0.00047±3.1e-01  1.00000±1.1e-16 -0.00084±3.1e-01
[5,]  0.00020±3.2e-01  0.00251±3.2e-01 -0.00739±3.1e-01 -0.00084±3.1e-01  1.00000±1.1e-16

attr(,"class")
[1] "draws_rvars" "draws"       "list" 

(I haven't implemented a print function yet, it just uses list print + the print function for rvars). As a proof of concept I have implemented basic conversion to/from draws_matrix:

dm = as_draws_matrix(d)
dm
# A draws_matrix: 4000 draws, and 26 variables
    variable
draw    mu Rho[1,1] Rho[2,1] Rho[3,1] Rho[4,1] Rho[5,1] Rho[1,2] Rho[2,2]
  1   0.52        1    0.070   -0.232    0.261   -0.647    0.070        1
  2   0.29        1   -0.245   -0.347   -0.142   -0.236   -0.245        1
  3  -0.59        1    0.480   -0.081    0.211   -0.219    0.480        1
  4  -0.51        1    0.060    0.476   -0.587    0.381    0.060        1
  5   1.26        1    0.354    0.097    0.398    0.276    0.354        1
  6   2.32        1    0.193    0.093   -0.279    0.388    0.193        1
  7  -1.06        1   -0.049    0.013    0.127   -0.104   -0.049        1
  8   0.05        1   -0.184   -0.115    0.189   -0.107   -0.184        1
  9   1.18        1    0.246    0.335   -0.141    0.524    0.246        1
  10  0.84        1   -0.284    0.122   -0.046   -0.059   -0.284        1
# ... with 3990 more draws, and 18 more variables

And back --- note the dimensions gain names on the way back; this is so that character dimensions (e.g. if it were Rho[x,y]) would get names:

as_draws_rvars(dm)
$mu
rvar<4000>[1] mean±sd:
[1] 0.02±1

$Rho
rvar<4000>[5,5] mean±sd:
  1                2                3                4                5               
1  1.00000±0.0e+00 -0.00199±3.2e-01 -0.00208±3.2e-01  0.00027±3.2e-01  0.00020±3.2e-01
2 -0.00199±3.2e-01  1.00000±9.1e-17 -0.00050±3.2e-01 -0.00086±3.2e-01  0.00251±3.2e-01
3 -0.00208±3.2e-01 -0.00050±3.2e-01  1.00000±9.9e-17 -0.00047±3.1e-01 -0.00739±3.1e-01
4  0.00027±3.2e-01 -0.00086±3.2e-01 -0.00047±3.1e-01  1.00000±1.1e-16 -0.00084±3.1e-01
5  0.00020±3.2e-01  0.00251±3.2e-01 -0.00739±3.1e-01 -0.00084±3.1e-01  1.00000±1.1e-16

attr(,"class")
[1] "draws_rvars" "draws"       "list" 

The back-conversion will also sort dimensions and fill in missing dimensions, in case the input does not include every cell of the array:

as_draws_rvars(dm[,c("Rho[3,2]", "Rho[1,1]", "Rho[2,2]")])
$Rho
rvar<4000>[3,2] mean±sd:
  1               2              
1  1.0000±0.0e+00      NA±NA     
2      NA±NA       1.0000±9.1e-17
3      NA±NA      -0.0005±3.2e-01

attr(,"class")
[1] "draws_rvars" "draws"       "list" 

Although I'm not sure about the sorting aspect here (possibly dimensions should be left in whatever the input order is? That would preserve order for dimensions with string names and numeric dimensions as long as they are ordered on input...). It seems there are some corner cases to consider here.

Another bit to consider is what the API for the draws_rvars constructor should be. Currently it expects input to already be rvars, and casts anything that isn't an rvar to one using as_rvar(). However, this yields behavior that is not consistent with the constructors of other draws types, like this:

draws_matrix(x = rnorm(10))
# A draws_matrix: 10 draws, and 1 variables
    variable
draw     x
  1   0.55
  2   0.27
  3   0.44
  4   0.14
  5  -0.55
  6  -1.47
  7   0.82
  8   0.55
  9  -0.83
  10 -1.16

Whereas draws_rvars() does this:

draws_rvars(x = rnorm(10))
$x
rvar<1>[10] mean±sd:
 [1]  0.412±NA -1.416±NA -0.311±NA  0.364±NA  1.326±NA  0.171±NA -0.525±NA -0.070±NA  2.325±NA  0.021±NA

attr(,"class")
[1] "draws_rvars" "draws"       "list"  

This is because it just uses as_rvar(), which converts numerics to constant rvars (rvars with one draw). This is the correct behavior for as_rvar as it allows casting to work correctly when doing math with rvars and numerics. However, it causes an inconsistency in this part of the API --- one option would be to have draws_rvars use rvar() instead of as_rvar() when passed objects that are not rvars... then the behavior would be more consistent with the other draws_ constructors.

Questions / feedback

Summarizing some questions above (plus some new ones):

  • Should mean() behave more like E(), with a separate function for means across draws (like rvar_mean())?
  • What should the rvar summary display be: mean±sd? median±mad? Something else?
  • What name do we want for draws_rvars?
  • Should draws_rvars pass non-rvar objects through rvar() instead of as_rvar() for consistency with other parts of the API?
  • How should we deal with character-vector versus numeric indices when converting to draws_rvars? Is sorting a good or bad idea?
  • What do we want to do with chains? Some options are: make number of chains an attribute of an rvar, make it an attribute of draws_rvars, or split up rvars from different chains into different variables inside a draws_rvars object (I'm not a fan of this last one).

What do folks think? @paul-buerkner @jgabry?

mjskay added a commit that referenced this issue Jul 6, 2020
@mjskay
Copy link
Collaborator Author

mjskay commented Jul 6, 2020

Followup on this:

How should we deal with character-vector versus numeric indices when converting to draws_rvars? Is sorting a good or bad idea?

I just changed the behavior so that numeric indices are sorted and character indices are not. This seems more sensible to me than sorting character indices, but this is easily changed.

@paul-buerkner
Copy link
Collaborator

Thank you @mjskay! This is really exciting!

I need to understand some details a little better, but here are some thoughts on the questions you had:

  • I think that indeed mean() should work the same as E().
  • I am not sure about the best default for the rvar summary, but we should have an argument, globally amendable via options, to switch between mean +- sd and median +- mad. The latter is the more robust but the former is probably what most people would expect as default and what should work in most cases (unless we are dealing with a cauchy, say).
  • draws_rvars is fine for me.
  • I cannot really add much to the rvar() vs as_rvar() discussion but I think consistency is preferable here.
  • Sorting numeric but not character indices sounds good. I would not sort characters, as users may want a certain ordering, for example, similar to how we can change the level ordering of factors.
  • I had though a bit of the chains issue, but didn't really see a natural solution. If we implement chains for rvars, I think we should store them inside rvar itself rather than in draws_rvar. That way, we can use chain information, for example, in rhat even if we just use an rvar object. The question is though, how much overhead this creates. Perhaps not too much if we just store the number of chains and then only construct the draws by chain if requested?

@jgabry
Copy link
Member

jgabry commented Jul 7, 2020

Yeah this is exciting, thanks for working on this! I'm in the process of moving this week so I might not have time to really dive into this until next week. Definitely feel free to keep discussing without me and I'll catch up.

For now just one comment on the chains question. If there's an option include chains that doesn't come with too much overhead, that'd be fine, but I'm coming around to the idea that we don't really need chain information. The ability to work with draws as random variables is most useful after we're happy the chains are ok, so I would think we'd convert to an rvar from e.g. a draws_array after we've checked diagnostics using the draws_array. Does that make sense?

@mjskay
Copy link
Collaborator Author

mjskay commented Jul 8, 2020

Great, thanks both! Some TODOs for myself where there's pretty clear consensus:

  • rename current mean() and median() to rvar_mean() and rvar_median() + implement new mean() and median()
  • add print argument for rvar summary type + options() entry for it. A simple idea for the option would be "rvar_summary" with values "mean_sd" and "median_mad", unless there are objections. Then print.rvar() would gain a summary argument that defaults to the global option unless changed.
  • make as_draws_rvars use rvar instead of as_rvar on non-rvar objects.
  • sort numeric but not character indices on conversion to draws_rvars

Re; chains I am opening a separate issue since I think we need a bit of discussion there.

@mjskay
Copy link
Collaborator Author

mjskay commented Jul 9, 2020

One minor note worth raising as I implement more of draws_rvars: because draws_rvars considers all cells in an array variable as part of a single variable, it counts variables differently from other formats:

d = draws_rvars(x = rnorm(1000), y = rethinking::rlkjcorr(1000, 3, 1))
d
# A draws_rvars: 1000 iterations, 1 chains, and 2 variables
$x: rvar<1000>[1] mean±sd:
[1] -0.067±0.94

$y: rvar<1000>[3,3] mean±sd:
     [,1]            [,2]            [,3]           
[1,]  1.0000±0.0e+00  0.0081±5.1e-01 -0.0054±4.9e-01
[2,]  0.0081±5.1e-01  1.0000±8.1e-17  0.0230±4.8e-01
[3,] -0.0054±4.9e-01  0.0230±4.8e-01  1.0000±1.2e-16

That is 2 variables as a draws_rvars but 10 as any other format:

as_draws_matrix(d)
# A draws_matrix: 1000 draws, and 10 variables
    variable
draw     x y[1,1] y[2,1] y[3,1] y[1,2] y[2,2] y[3,2] y[1,3]
  1   1.29      1   0.68 -0.039   0.68      1  0.133 -0.039
  2  -0.40      1  -0.11 -0.618  -0.11      1  0.235 -0.618
  3  -0.75      1  -0.75 -0.353  -0.75      1  0.712 -0.353
  4  -1.42      1  -0.17  0.046  -0.17      1  0.068  0.046
  5   0.56      1  -0.26  0.257  -0.26      1 -0.391  0.257
  6  -0.54      1  -0.98  0.615  -0.98      1 -0.746  0.615
  7  -0.88      1  -0.60  0.051  -0.60      1  0.506  0.051
  8   0.27      1   0.92  0.632   0.92      1  0.299  0.632
  9   0.92      1  -0.81 -0.642  -0.81      1  0.082 -0.642
  10 -1.15      1  -0.50 -0.812  -0.50      1  0.743 -0.812
# ... with 990 more draws, and 2 more variables

I don't think adjusting nvariables.draws_rvars() to report the same number as the other formats makes sense (because then it would be inconsistent with variables.draws_rvars()), but I wanted to surface this as a minor inconsistency between formats to be aware of.

@paul-buerkner
Copy link
Collaborator

I think it is ok to have this inconsistency given that the definition of 'variable' varies between the rvar format and the others.

@mjskay mjskay linked a pull request Dec 15, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants