HyperGeometric distribution gives wrong results #4366

ricardoV94 · 2020-12-21T16:58:20Z

Reading #4108 and #4249, it seems that there was some back and forth to try to match the behavior of scipy.stats.hypergeom with that of HyperGeometric, specially to pass the default unit tests when given invalid parameters. Unfortunately the current solution misbehaves often with invalid parameters.

Here is an example:

pm.HyperGeometric.dist(N=3, k=3, n=4).logp(3).eval()  # returns array(1.38629434)
st.hypergeom(M=3, n=3, N=4).logpmf(3)  # returns nan

(n=4 is impossible since it must be smaller or equal than N)

The initial PR draft had a bound to deal with invalid parameters, but it was removed so that logp would return the same values as the scipy counterpart (otherwise we would sometimes get pymc=-inf vs scipy=nan). Unfortunately, as shown in the example above, we cannot trust the pymc3 implementation to reliably return nan for invalid parameters. Scipy seems to have some parameter validation method, but it is not clear to me if and how it is used by the logpmf method:

https://github.com/scipy/scipy/blob/624fd766e2886c458012d98e3c6ec85551ab2916/scipy/stats/_discrete_distns.py#L455-L458

Besides not being reliable, returning nan also seems to deviate from the usual pymc behavior of getting -inf for invalid configurations. In addition, the logcdf method for the HyperGeometric (see #4331) will also be incompatible with results of the scipy counterpart, which actually gives wrong results for some configurations of impossible parameters:

st.hypergeom(M=3, n=3, N=4).logpmf(np.arange(5))  # returns array([nan, nan, nan, nan, nan])
st.hypergeom(M=3, n=3, N=4).logcdf(np.arange(5))  # returns array([nan, nan, nan,  0.,  0.])

Given these issues, I suggest we depart from the scipy implementation and return to using bound to replace impossible configurations with -inf. I am willing to do a PR (or give it to @Harivallabha if he is interested). To overcome the original issue with the unit tests, we can change the check_logp method in pymc3_matches_scipy to replace scipy nan with -inf (perhaps with an optional argument, in case this conflicts with other distributions tests):
https://github.com/pymc-devs/pymc3/blob/9dbf9eabf3b18bf2529a8855c086e90249be0b5e/pymc3/tests/test_distributions.py#L543-L551

What do you think?

The text was updated successfully, but these errors were encountered:

Harivallabha · 2020-12-21T17:07:16Z

I'm not sure how scipy does its parameter validation. Please go ahead and make the PR if you have some idea as to how it can be done! :D

ricardoV94 mentioned this issue Dec 21, 2020

Add bound to HyperGeometric logp (resolves #4366) #4367

Merged

AlexAndorra closed this as completed in 0402aab Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HyperGeometric distribution gives wrong results #4366

HyperGeometric distribution gives wrong results #4366

ricardoV94 commented Dec 21, 2020 •

edited

Loading

Harivallabha commented Dec 21, 2020

HyperGeometric distribution gives wrong results #4366

HyperGeometric distribution gives wrong results #4366

Comments

ricardoV94 commented Dec 21, 2020 • edited Loading

Harivallabha commented Dec 21, 2020

ricardoV94 commented Dec 21, 2020 •

edited

Loading