Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control Variates #273

Open
simonrmaskell opened this issue Feb 22, 2023 · 32 comments
Open

Control Variates #273

simonrmaskell opened this issue Feb 22, 2023 · 32 comments
Labels
feature New feature or request

Comments

@simonrmaskell
Copy link

It would be really great if posterior had a capacity to have reduced the standard errors associated with the expectations it can produce (eg mean, var and sd). Recent work [1] has shown that, if we use the score function as a control variate, we can reduce such standard errors and do so in a way that is effective in the context of constrained variables (as often considered in applications of interest in the context of, for example, estimating volatility or a frequency).

Based on initial discussion, we think the approach to adopt is to augment posterior with the capacity to accept, alongside the samples, the gradient of the log posterior of mapped parameters with respect to those mapped parameters (these mapped parameters are unconstrained but are nonlinear functions of the constrained variables). The implementation then decomposes into three components:
* Storage: we will need to use protected variables (called something like .grad_log_p_unconstrained) within the posterior object itself to pass the samples and gradients around the code such that anything we contribute is compatible with existing functionality.
* Interface: We would sensibly subtly augment the functionality of summarise_draws such that it has additional measures such that the user can use (eg summarise_draws(x, "mean_control_variates",”sd_control_variates”), where "mean_control_variates" calculates the mean using control variates and “sd_control_variates” does the same for the standard deviation).
* Estimation: writing the functions to calculate the estimates themselves in such a way that, for example, mean__control_variates fails gracefully (eg by calling mean, without control variates) when .grad_log_p_unconstrained is not populated. These functions are likely to be based heavily on a combination of the code written (in Python here: https://github.com/zhouyifan233/ControlVariates) for [1] and a pre-existing R repo (here: https://github.com/LeahPrice/ZVCV).

We would welcome advice on how to progress towards a pull-request associated with this issue.

[1] S Maskell, Y Zhou and A Mira. Control Variates for Constrained Parameters. IEEE Signal Processing Letters. Vol 29, pp. 2333-2337. 2022

@paul-buerkner
Copy link
Collaborator

I would suggest the following procedure: You post example code for one control variates function here, say for mean_control_variates. Based on this, I will built a small protoype of the described functionality in posterior, which we can then discuss before you start working on a PR that fully implements the feature. What do you think?

@mjskay
Copy link
Collaborator

mjskay commented Feb 22, 2023

Is this another example of the general problem of how to incorporate information from special variables (like weights - see #184) into summary calculations? Maybe the solution to this should also solve #184 and #105.

@paul-buerkner
Copy link
Collaborator

paul-buerkner commented Feb 22, 2023 via email

@LeahPrice
Copy link

I've done up a couple of draft mean_control_variates functions for general input and for Stan input. The latter currently only works if at least one variable is constrained and seems quite inefficient at extracting gradients, but it gives you the general idea. Both are using the simplest control variate I can think of for now (the score function, which is equivalent to using first order zero-variance control variates)

The function for general input and an example is below:

## For output that's independent of Stan

# Obtains the estimates of marginal posterior means
# "samples" is the MCMC draws and "gradients" is the gradients
# of the log unnormalised target (ideally on the unconstrained space)
mean_control_variates <- function(samples, gradients){
  
  # Getting optimal coefficients for control variates
  coefs <- lm(samples ~ gradients)$coefficients
  
  # Obtaining the estimate with control variates
  if (is.null(ncol(coefs))){
    meanCV <- mean(samples - gradients*coefs[-1])
  } else{
    meanCV <- colMeans(samples - gradients%*%coefs[-1,])
  }
  
  return(meanCV)
  
}

# An example of estimating the mean for a 4-dimensional unit Gaussian
library(mvtnorm)
samples <- rmvnorm(100,mean=rep(0,4),sigma=diag(4))
gradients <- -samples
mean_control_variates(samples,gradients)

The function for Stan input and an example is below:

## For Stan output (with a constrained variable)
library(rstan)

# Obtains the estimates of marginal posterior means
# "fit" is the Stan fit obtained from "sampling" or "stan" functions
mean_control_variates_stan <- function(fit){
  
  # Extract constrained samples
  samples <- extract(fit)
  d <- length(samples)-1
  samples <- data.frame(samples)[,-(d+1)]
  N <- nrow(samples)
  
  # Finding gradients of log unnormalised target wrt unconstrained parameters
  gradients <- matrix(NA, nrow = N, ncol = d)
  for (i in 1:N){
    samples_unconstrained <- unconstrain_pars(fit, pars=as.list(samples[i,]))
    gradients[i,] <- grad_log_prob(fit, upars = samples_unconstrained, adjust_transform = TRUE)
  }
  
  # Getting optimal coefficients for control variates
  coefs <- lm(as.matrix(samples) ~ gradients)$coefficients
  
  # Obtaining the estimate with control variates
  if (is.null(ncol(coefs))){
    meanCV <- mean(samples - gradients*coefs[-1])
  } else{
    meanCV <- colMeans(samples - gradients%*%coefs[-1,])
  }
  
  return(meanCV)
}

library(rstan)

scode <- "
parameters {
  real a;
  real<lower=0> b;
} 
model {
  a ~ normal(0, 1);
  b ~ exponential(1);
} 
"

fit <- stan(model_code = scode, iter = 1000, verbose = FALSE) 
mean_control_variates_stan(fit)

Lots of extensions to this are possible, e.g.

I'm happy to add any of this functionality as we go.

@paul-buerkner
Copy link
Collaborator

paul-buerkner commented Mar 22, 2023

I have now added the basic functionality for the control variates feature in the control_variates branch. With this you can do something like:

x <- as_draws_df(example_draws())
x$.gradient_mu <- rnorm(ndraws(x))
x$.gradient_sigma <- rnorm(ndraws(x))

print(x)

reserved_variables(x)
grads <- gradients(x)
str(grads)

summarise_draws(x, mean_control_variates, .args = list(gradients = grads))

The main internal change is that .gradient_* are now reserved variables. What is more, we can extract them with the gradients method.

The mean_control_variates function based on @LeahPrice code provides a blue print for adding other control variates summaries.

The branch lacks doc and testing so far. It is purely to provide the basis for further developing this feature.

What I have not done yet is an automatic detection of the need of gradients and extraction thereof in summarize_draws. I am not sure how feasible this is currently. And I don't know if we actually need it because we can pass the gradients via .args. The latter approach is not the most pretty perhaps but it works until we may have figured out a nicer way of doing this, perhaps following an approach that is more like dplyr::summarise as also discussed in #184 and #105.

@simonrmaskell
Copy link
Author

We (@simonrmaskell, @LeahPrice, @YifanZhou, @jswright-dstl) were talking about what we need to do in response to your kind work on integrating control variates into posterior. We have identified some functionality that we think we should add and are relatively clear on how to go about doing that. Where we are less certain is around the documentation and testing that is needed. Is there an example of something similar with documentation we can use as a calibration point? Similarly, we can come up with some tests that don’t use any specific ppl, but those tests are likely to be quite benign. We’d like to have some tests/examples of how we can use the code with more complex models and can’t see how to do that without using Stan or similar. Do you see such examples as distinct from the tests? If so, my guess is that you see that as a separate repo for the examples that uses posterior. Is that right? In that case, are you happy that the documentation for posterior points to that other repo (since we anticipate people will struggle to use the “right” gradients and really want to make it easy for them to do so)?

@paul-buerkner
Copy link
Collaborator

paul-buerkner commented Apr 3, 2023

Perhaps the documentation and tests that we use for convergence diagnostics (see https://github.com/stan-dev/posterior/blob/master/R/convergence.R for the doc and https://github.com/stan-dev/posterior/blob/master/tests/testthat/test-convergence.R) could serve as a helpful starting point. It is okay if the tests that live in posterior are relatively simple.

The more complicated tests should probably best live in another repo, as you say. I am happy with the posterior doc pointing to this other repo (or case studies, blog posts or whatever). We could also think about a vignette in posterior itself for this purpose, but this decision is likely better made at a later point, when we have both the implementaitons in posterior and your outside repo in a more concrete shape.

@LeahPrice
Copy link

Thanks @paul-buerkner and sorry I've been slow with this.

I've done a very basic implementation of sd_control_variates in my forked version of this branch. I also started adding some documentation to the control variate functions but it's very poor at this stage and I haven't started documenting the gradients function.

I'd like to improve the control variates but the improvements I have in mind would require access to the following things in the control variate functions:

  • Variable/column names in x so we know which marginal of the parameter samples we're looking at (for implementing control variates in high dimensions)
  • The parameter samples for all dimensions on the constrained space (for vectorising)
  • The parameter samples for one/all dimensions on the unconstrained space (for more advanced control variates - i.e. anything beyond just the gradients as control variates)

I suspect these changes need to be done through additions/edits/special cases in R/summarise_draws.R, and I suspect the order above is in increasing level of difficulty. It would be great to have all three but even 1-2 of these would be helpful. Are you happy for me to have a go at this in my own repo? I figured I'd check first in case this is off limits or in case you have any suggestions.

@paul-buerkner
Copy link
Collaborator

No worries. There is no rush on our side.

I don't fully understand the points (1) to (3) though. Can you please explain them a bit more?

@LeahPrice
Copy link

Sorry I think I got ahead of myself and I should have implemented some examples first.

Is there a way to get gradients automatically? In the example they're calculated separately but there's quite a bit of room for error in this process because people could get gradients for the wrong parameterisation which invalidates the control variates. We're after gradients on the unconstrained space (e.g. $\nabla_{\log\sigma}P(\log\sigma|y)$ rather than $\nabla_{\sigma}P(\sigma|y)$ ). I have some code to calculate this for an example with constraints but it involves a for loop over all of the samples, calling unconstrain_pars and grad_log_prob at each iteration. If we can use the already-calculated gradients that would be ideal.

The questions I asked earlier assumed we had the gradients mentioned above and that the samples x were on the original space (e.g. samples for $\sigma$ rather than $\log\sigma$). The questions probably don't make as much sense now but an alternate description of them is below:

  • The summarise_draws code does a loop over the dimensions of the parameter space, so our x in mean_control_variates gives the samples for only a single parameter. We don't know whether it's mu, tau, theta[1], ... or theta[8] in our example because there's no name attached to x. It would be good to know which parameter we're looking at in x, e.g. through a column name. If we can do that then we can pick just the single control variate that's most relevant to improve performance in high dimensions.
  • Taking the previous point one step further, if we change x from having samples for just a single named parameter to having samples for all parameters, then we can vectorise calculations for a massive speed-up.
  • The two points above talk about x on the original parameter space. Having access to the transformed parameters (e.g. samples for $\log\sigma$ instead of for $\sigma$) as well would allow us to implement better control variates.

Sorry for all of the questions.

@paul-buerkner
Copy link
Collaborator

Is there a way to get gradients automatically? In the example they're calculated separately but there's quite a bit of room for error in this process because people could get gradients for the wrong parameterisation which invalidates the control variates. We're after gradients on the unconstrained space (e.g. rather than ). I have some code to calculate this for an example with constraints but it involves a for loop over all of the samples, calling unconstrain_pars and grad_log_prob at each iteration. If we can use the already-calculated gradients that would be ideal.

This aspect should not be part of posterior, as it needs to be independent of Stan or any other implementation. It can merely take and use what the user says are gradient. It cannot validate if those are the right ones.

The summarise_draws code does a loop over the dimensions of the parameter space, so our x in mean_control_variates gives the samples for only a single parameter. We don't know whether it's mu, tau, theta[1], ... or theta[8] in our example because there's no name attached to x. It would be good to know which parameter we're looking at in x, e.g. through a column name. If we can do that then we can pick just the single control variate that's most relevant to improve performance in high dimensions.

Not sure how realistic this is to implement reliably. Why would we need that?

Taking the previous point one step further, if we change x from having samples for just a single named parameter to having samples for all parameters, then we can vectorise calculations for a massive speed-up.

Agreed. There should be multiple methods with the same function name (generic). One that takes a vector to work with summarize_draws on a parameter by parameter basis. And one that takes full draws objects to vectorize computation. We should offer both.

The two points above talk about x on the original parameter space. Having access to the transformed parameters (e.g. samples for log sigma instead of for sigma) as well would allow us to implement better control variates.

Not sure I understand. In what place do we need to have access to that (in terms of control variates)? I assume you don't just mean taking the logarithm of the samples within posterior, which would of course be trivial to do.

@LeahPrice
Copy link

Would it be possible to have two options: providing gradients or provide a Stan object? I'd be much more comfortable if we had a Stan-specific option available so it's more foolproof.

Not sure how realistic this is to implement reliably. Why would we need that?

The number of control variates depends on the dimension $d$ of the parameter space. There are just $d$ control variates in the simple first-order methods we've got implemented now but $k$ th order control variates use choose(d+k,d)=O(d^k) control variates so the number blows up massively with $d$. This is where knowing the dimension you're looking at comes in. If you know you're estimating the mean of say the 7th parameter then perhaps you can just use the control variates based on the 7th parameter. That'll take it down from O(d^k) to O(k) control variates.

Therefore knowing which dimension you're looking at lets you use more advanced control variates without requiring you to estimate way more coefficients. This is important because we often need higher order control variates, or at least the second order control variates, to get substantial variance reductions. E.g. if you have a Gaussian target then first order control variates are exact for the mean, second order control variates are exact for the variance, and so on for kth order control variates and the kth moment.

We should offer both.

Thanks, that's good to hear you agree about offering a vectorised option.

Not sure I understand. In what place do we need to have access to that (in terms of control variates)? I assume you don't just mean taking the logarithm of the samples within posterior, which would of course be trivial to do.

We need this information for higher order control variates. For example, second order control variates have terms that are of the form 1+TransformedSamples*gradients.

@paul-buerkner
Copy link
Collaborator

Would it be possible to have two options: providing gradients or provide a Stan object? I'd be much more comfortable if we had a Stan-specific option available so it's more foolproof.

I of course understand your concern. As discussed in our joint meeting, that (stan dependent) method would need to be implemented in a separte package, not in posterior. This separate pacakge would of course interface with all the stuff we now implement in posterior and additionally could provide more validation etc. depending on the used backend (e.g., Stan).

About the specific selection of control variates, is that something to leave up to the user? Or implement in your higher-level package that also does the validation?

Your questions seem quite advanced right now and don't raally touch the basic implementation of control variates in posterior at the moment. Do you think it could make sense to come back to these questions at a later point, once the basic implementation (of stuff we have to do anyway) is done?

@avehtari
Copy link
Collaborator

avehtari commented Aug 4, 2023

Thinking about the generic implementation in posterior (and more elaborate features can then be elsewhere), I have a question: If

x$.gradient_mu <- rnorm(ndraws(x))
x$.gradient_sigma <- rnorm(ndraws(x))

print(x)

then if we subset x, say

mutau = subset_draws(x, variable = c("mu", "tau"))

does that subset include gradients for mu and sigma? If it does, it seems that would (at least partially) solve the dimensionality issue at this point?

I think the control variates would be mostly used for a small number of quantities of interest and not automatically used for every possible parameter in the model.

@paul-buerkner
Copy link
Collaborator

I am not sure I understand your question. Can you elaborate?

@avehtari
Copy link
Collaborator

What happens if you run the following code?

x <- as_draws_df(example_draws())
x$.gradient_mu <- rnorm(ndraws(x))
x$.gradient_sigma <- rnorm(ndraws(x))

xmu <- subset_draws(x, variable="mu")

Does xmu include all gradients or only .gradient_mu? Is it possible to subset gradients?

@paul-buerkner
Copy link
Collaborator

paul-buerkner commented Aug 22, 2023

So far, all gradients are retained upon subsetting. But we could perhaps add an option to automatically subset gradients with the same name as the selected variables. What do you think?

@avehtari
Copy link
Collaborator

I think "subset gradients with the same name as the selected variables" would provide sufficient functionality related to what @LeahPrice was asking above

@LeahPrice
Copy link

Thanks @avehtari , yes that would cover one of the main things I was asking about and it'd definitely be helpful. It'd be needed to get reasonable performance when the number of MCMC iterations is less than the number of dimensions. The other two main things that would help a lot are access to the samples on the unconstrained space and access to the gradients of the log target on the unconstrained space. It seems like there's a lot of room for error if users need to manually get the latter themselves.

@avehtari
Copy link
Collaborator

The other two main things that would help a lot are access to the samples on the unconstrained space and access to the gradients of the log target on the unconstrained space. It seems like there's a lot of room for error if users need to manually get the latter themselves.

These things will be added to CmdStanR, RStan, etc. interfaces after the posterior package provides the framework-agnostic functions.

@simonrmaskell
Copy link
Author

I think the "problem" is that if neither Posterior or the Stan software components take ownership of the specifics of the interface, users will make mistakes and won't use the functionality. I think our proposed solution is therefore to progress with integrating control variates into Stan, rather than into Posterior. That seems suboptimal.

@paul-buerkner
Copy link
Collaborator

I think we have discussed this, haven't we? Posterior shall not be dependent on Stan or any other specific PPL and we currently don't intent to change this principle for control variates. Therefore, I think you thought about a separate package which builds on the posterior functionality but ensures the correct extraction from Stan or other PPLs(?) At least I remember something along those lines, I think.

@simonrmaskell
Copy link
Author

Yes. We did talk about the idea of using a separate package. However, a separate package that uses control variates already exists (thanks to efforts by @LeahPrice). Given that, it's not clear to us what the merits would be of having a second separate package and interfacing that to Posterior. That's what motivates our revised plan to migrate to integrating control variates into Stan itself.

@paul-buerkner
Copy link
Collaborator

paul-buerkner commented Aug 22, 2023

I see, sorry I forgot about your existing package in my line of argument above. So this existing package could theoretically be refactored to use the posterior implementation (once implemented) and validate the Stan input, right? Kind of as it does right now but with native posterior support. Back when we discussed this, I understood this was your plan, I think.

Or are you suggesting that given your existing package's functionality, a separate feature of this in posterior (this issue here) would not make sense unless we implemented input Stan validation directly in posterior?

@simonrmaskell
Copy link
Author

I think our view now (which I recognise may have evolved and may sensibly continue to do so) is that given the separate package can already calculate the control variates, assuming we ensure it can interface that package correctly to Stan etc, it's not clear to us what benefit we get (or a user of either package gets) from us moving some of the functionality that is already in that package into Posterior. Is it clear to you what those benefits would be?

@paul-buerkner
Copy link
Collaborator

Your package would be more integrated with the Stan universe of package via posterior, which will probably make it easier to use by a larger group of people and thus ultimately applied more. Whether this is enough of a motativation for you, I don't know of course. If you like, I am also happy to have another call with you to make sure we are all on the same page.

@avehtari
Copy link
Collaborator

I think the "problem" is that if neither Posterior or the Stan software components take ownership of the specifics of the interface, users will make mistakes and won't use the functionality.

That's why I said that specifics will be implemented in the interface the users are using. I feel like there is still some confusion on the modular structure of the Stan software.

That's what motivates our revised plan to migrate to integrating control variates into Stan itself.

It's not clear what you mean by "Stan itself". Currently, all posterior summaries are computed by various interfaces like CmdStan (in C++), RStan (it's own R code but switching to posterior package), PyStan (own Python code or ArviZ), CmdStanR (using posterior package). CmdStanR uses posterior package to handle draws and summaries. If you want CmdStanR user to use control variates, it would be natural to support the existing way to get summaries, that is using the posterior package. Thus, I recommend adding 1) some code to CmdStanR to handle gradients and constrain/unconstrain part of the control variates, 2) some code to posterior package to allow storing the control variate information along the draws, 3) some code to the posterior package or to a separate package to do the control variate computation given the draws object that includes the control variate information.

Having no control variate related support in the posterior package would mean more work in the separate package to support different posterior draw formats (data frame, array, matrix, rvars). I agree with Paul that

Your package would be more integrated with the Stan universe of package via posterior, which will probably make it easier to use by a larger group of people and thus ultimately applied more.

@simonrmaskell
Copy link
Author

Sorry to have been imprecise about "Stan itself", I meant that we currently plan to expose the control variates functionality via CmdStan with a view to CmdStanR and CmdStanPy then being able to use the functionality. If we did that, we'd need to modify some of the Stan backend to make that work. There is a design document being prepared and comments will be welcome in due course. However, I think I now see what you mean; I hadn't fully appreciated that CmdStanR uses posterior and could therefore ensure that the interfacing was correct. That would remove the issue of needing this additional package, which has been motivating us to consider the alternative approach. I'll follow up with the CmdStanR folks to understand how that could work from that end. Thank you.

@simonrmaskell
Copy link
Author

Looks like cmdstanr now has the functionality we need....

@avehtari
Copy link
Collaborator

Looks like cmdstanr now has the functionality we need....

Yes, CmdStanR has now functionality for getting gradients and handling unconstrain/constrain mapping. When I wrote above that more code would be added to CmdStanR, I meant that additional code adding the desired control variate information to the posterior object, but that requires first some support in the posterior package, which is the topic of this issue.

Would you like to have a call to clarify the role of different packages and available functionality?

@simonrmaskell
Copy link
Author

Almost certainly... Will follow up via email.

@simonrmaskell
Copy link
Author

@alecksphillips: as discussed, be good if you could pick this up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants