-
-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rand_distr: Fix dirichlet sample method for small alpha. #1209
Conversation
FYI: A similar change was made to numpy in numpy/numpy#14924 |
Generating Dirichlet samples using the method based on samples from the gamma distribution can result in samples being nan if all the values in alpha are sufficiently small. The fix is to instead use the method based on the marginal distributions being the beta distribution (i.e. the "stick breaking" method) when all values in alpha are small.
f7873d0
to
745ace8
Compare
Thanks for the PR. At a glance this looks good but I'd prefer it have a proper review from someone besides the author (doesn't have to be myself or an existing contributor, if anyone is interested); I didn't find the time yet myself. |
For anyone who might review this: I added some more details to the description above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to move the initialization of the new algorithm for small alpha from sample
to new
. This probably requires making Dirichlet
an enum similar to this:
struct DirichletFromGamma<F> { alpha: Box<[F]> }
struct DirichletFromBeta<F> { alpha: Box<[F]>, alpha_sum_r1: Box<[F]> }
pub enum Dirichlet<F> where ... {
FromGamma(DirichletFromGamma<F>), FromBeta(DirichletFromBeta<F>)
}
Alternatively, we could use this opportunity to refactor that even more initialization can be moved out of sample
by switching to something akin to the following:
struct DirichletFromGamma<F> { samplers: Box<[Gamma<F>]> }
struct DirichletFromBeta<F> { samplers: Box<[Beta<F>]> }
The last suggestion however would increase storage by a factor of ca. 4, so it would have to be justified with benchmarks. |
@vks, thanks for the review. I haven't forgotten about this! A couple other projects moved to the top of my to-do list, but I will get back to it. |
@WarrenWeckesser can I remind you of this PR? |
@dhardy, yes you can! It has been slowly bubbling back up my "to do" list. I'll get back to it this week. |
I pushed an update that implements my interpretation of the comments made by @vks last year. There are now two structs that implement the two different methods, and This might not be what was intended, and I suspect my Rust code is not idiomatic in a lot of places, so I still view this as a rough draft that will need more iteration. |
I implemented |
Answering my own question: yes, I think making |
I pushed an update to make the implementation follow the style (more or less) of |
rand_distr/src/dirichlet.rs
Outdated
/// Dirichlet distribution that generates samples using the gamma distribution. | ||
FromGamma(DirichletFromGamma<F>), | ||
|
||
/// Dirichlet distribution that generates samples using the beta distribution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
/// Dirichlet distribution that generates samples using the beta distribution. | |
/// Dirichlet distribution that generates samples using the Beta distribution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for the update! Besides a few nits, I think we need to change the error handling to propagate Result
s instead of unwrapping them.
I did not review the algorithms yet.
Design question: how to deal with a limitation of const generics? My update to this PR will have the same basic design, but the use of, for example,
Any recommendations? Any other ideas? |
I'm going to go with |
Absolutely not by default. It's an unstable feature and only available in nightly Rust. (In theory you could, behind a Rust features can take a long time to make it into stable. I guess we'll have
A few dozen wasted bytes, but it's fine.
Also fine. A little less memory usage, but also less data locality. |
The main change is that Dirichlet now has a const trait N for the dimension. The other significant change is to propagate errors that could occur when Beta:new or Gamma::new is called.
I have updated the pull request with significant changes, both to handle In There are now more |
The only options I can think of are panic with |
In such cases, the easiest way is to initialize the array with some default values. We could use Avoiding the indirection requires unsafe code, or a safe abstraction like |
Then when
we would need something like
correct? Another less-than-ideal alternative is to initialize the array as
Yet another alternative for using an array is to assemble the Or, I'm fine with "Just leave it as it is" or "Just do X". |
I'd forgotten (or missed) this method of constructing an array: https://doc.rust-lang.org/stable/std/vec/struct.Vec.html#impl-TryFrom%3CVec%3CT%2C%20A%3E%3E-for-%5BT%3B%20N%5D |
If I make these changes:
and I run
I have to add the bound |
Of course, just minutes after I comment, I dig a bit more and realize the cause of the issue. It is |
I pushed an update with this change. |
|
||
/// gamma_dists.try_into() failed (in theory, this should not happen). | ||
GammaArrayCreationFailed, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure about the error types here. Ideally, they should refer to invalid parameters supplied by the caller.
However, in this case they are only used internally, so I think this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking a while to come back to this. It looks great, thanks!
Generating Dirichlet samples using the method based on samples from the gamma distribution can result in samples being nan if all the values in alpha are sufficiently small. The fix is to instead use the method based on the marginal distributions being the beta distribution (i.e. the "stick breaking" method) when all values in alpha are small.
More details:
Here's an example where the current method produces
nan
s:Typical output:
The method based on samples from the gamma distribution is described in the wikipedia page on the Dirichlet distribution. The problem is that when alpha is small, there is a high probability that the gamma random variate will be 0 (given that we are limited to finite precision floating point). In fact, we can predict the above result: with alpha=0.001, the probability that a gamma variate G(alpha, 1) will be less than 2.225e-308 (approximately the smallest normal 64 bit floating point value) is approximately 0.4927. For the Dirichlet distribution with vector parameter [0.001, 0.001, 0.001], the problem occurs when all three gamma variates are 0, which has probabilty (0.4927)**3 = 0.1196. So when generating 1000 samples, we expect about 120 to contain nan.
A way to avoid this problem is to switch to the less efficient method based on the marginal beta distributions of the Dirichlet distribution. This method is also described on the wikipedia page. In this PR, this method is used when all the alpha values are less than 0.1. This threshold was discussed in the NumPy PR, where it seemed like a reasonable compromise between (i) using the gamma variate method for as wide a range as possible and (ii) ensuring that the probability of generating nans is negligibly small.