-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11302] [MLLIB] Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases #9293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #1956 has finished for PR 9293 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the test data, it's obviously constructed so that the first 2 points cluster together and the other 3 cluster together. I verified this is what Mclust gives in R as well.
|
Test build #44436 has finished for PR 9293 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the old formula missing the following:
val v = pinvS * u
(v.t * v, ...)
I think using the root inverse should be cheaper and more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's equivalent numerically, and would require the other changes above. I'm not clear why it's better to do it this way though? it takes longer to take the square root of the eigenvalues, and then they're just multiplied back together. It's the same number of operations here and above otherwise.
I think the evidence that it's not accurate enough is the case in the JIRA and tests here, and also the Pyspark test that is wrong at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an extra matrix-matrix multiplication in u * pinvS * u.t. I think the bug is in line 133, where we should use pinvS * u.t instead of pinvS * u. Could you check this solution? Some comments need updates too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding v.t * v also puts in an extra matrix-matrix multiply. But yes I see your point that u.t alone was the likely original bug. If that fixes it, it's a simpler change and yes that does cost one less matrix multiply. Have a look at #9309
…atrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes #9293 Author: Sean Owen <[email protected]> Closes #9309 from srowen/SPARK-11302.2. (cherry picked from commit 826e1e3) Signed-off-by: Xiangrui Meng <[email protected]>
…atrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes #9293 Author: Sean Owen <[email protected]> Closes #9309 from srowen/SPARK-11302.2. (cherry picked from commit 826e1e3) Signed-off-by: Xiangrui Meng <[email protected]>
…atrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes #9293 Author: Sean Owen <[email protected]> Closes #9309 from srowen/SPARK-11302.2.
…atrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes #9293 Author: Sean Owen <[email protected]> Closes #9309 from srowen/SPARK-11302.2. (cherry picked from commit 826e1e3) Signed-off-by: Xiangrui Meng <[email protected]>
…atrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes apache#9293 Author: Sean Owen <[email protected]> Closes apache#9309 from srowen/SPARK-11302.2. (cherry picked from commit 826e1e3) Signed-off-by: Xiangrui Meng <[email protected]>
…atrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes apache#9293 Author: Sean Owen <[email protected]> Closes apache#9309 from srowen/SPARK-11302.2. (cherry picked from commit 826e1e3) Signed-off-by: Xiangrui Meng <[email protected]>
…atrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes apache/spark#9293 Author: Sean Owen <[email protected]> Closes #9309 from srowen/SPARK-11302.2.
Compute sigma pseudo-inverse without square root to avoid precision problems
CC @mengxr @jkbradley