Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HyperGeometric distribution gives wrong results #4366

Closed
ricardoV94 opened this issue Dec 21, 2020 · 1 comment
Closed

HyperGeometric distribution gives wrong results #4366

ricardoV94 opened this issue Dec 21, 2020 · 1 comment

Comments

@ricardoV94
Copy link
Member

ricardoV94 commented Dec 21, 2020

Reading #4108 and #4249, it seems that there was some back and forth to try to match the behavior of scipy.stats.hypergeom with that of HyperGeometric, specially to pass the default unit tests when given invalid parameters. Unfortunately the current solution misbehaves often with invalid parameters.

Here is an example:

pm.HyperGeometric.dist(N=3, k=3, n=4).logp(3).eval()  # returns array(1.38629434)
st.hypergeom(M=3, n=3, N=4).logpmf(3)  # returns nan

(n=4 is impossible since it must be smaller or equal than N)

The initial PR draft had a bound to deal with invalid parameters, but it was removed so that logp would return the same values as the scipy counterpart (otherwise we would sometimes get pymc=-inf vs scipy=nan). Unfortunately, as shown in the example above, we cannot trust the pymc3 implementation to reliably return nan for invalid parameters. Scipy seems to have some parameter validation method, but it is not clear to me if and how it is used by the logpmf method:

https://github.com/scipy/scipy/blob/624fd766e2886c458012d98e3c6ec85551ab2916/scipy/stats/_discrete_distns.py#L455-L458

Besides not being reliable, returning nan also seems to deviate from the usual pymc behavior of getting -inf for invalid configurations. In addition, the logcdf method for the HyperGeometric (see #4331) will also be incompatible with results of the scipy counterpart, which actually gives wrong results for some configurations of impossible parameters:

st.hypergeom(M=3, n=3, N=4).logpmf(np.arange(5))  # returns array([nan, nan, nan, nan, nan])
st.hypergeom(M=3, n=3, N=4).logcdf(np.arange(5))  # returns array([nan, nan, nan,  0.,  0.])

Given these issues, I suggest we depart from the scipy implementation and return to using bound to replace impossible configurations with -inf. I am willing to do a PR (or give it to @Harivallabha if he is interested). To overcome the original issue with the unit tests, we can change the check_logp method in pymc3_matches_scipy to replace scipy nan with -inf (perhaps with an optional argument, in case this conflicts with other distributions tests):
https://github.com/pymc-devs/pymc3/blob/9dbf9eabf3b18bf2529a8855c086e90249be0b5e/pymc3/tests/test_distributions.py#L543-L551

What do you think?

@Harivallabha
Copy link
Contributor

I'm not sure how scipy does its parameter validation. Please go ahead and make the PR if you have some idea as to how it can be done! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants