-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8674] [MLlib] Implementation of a 2 sample Kolmogorov Smirnov Test #7075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
So this isn't to be merged in its current form? Put |
|
Hey @josepablocam can you rebase this on current master? |
|
@sryza yes, will do. |
…k relevant from 1-sample test
5c9da55 to
e0686a7
Compare
|
@sryza Do you want to make another pass? Please sign off if you think this is ready:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this unionedData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be clearer to write true and false instead of isSample1 and !isSample1? I don't have a strong opinion.
Can the map functions be like .map((_, isSample1)) for tidiness or does that syntax not work here?
Finally I wonder if .union is clearer than ++ here? I don't have a strong opinion either, just am somehow used to method invocations on RDDs and ++-like syntax for Scala collections.
|
@mengxr @josepablocam oops thought it was still a WIP for some reason. Just took a pass. It looks mostly done - I just had a bunch of nits and a test request. |
|
Mmm. I seem to be having some issues building and testing on my laptop. It keeps failing when building Catalyst. I'll try this first thing in the morning at work and push if it passes tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.sortByKey? I think the mapPartitions doesn't need braces, just parens, but that's tiny.
…to account for aliasing of commons' KS test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd make the above one "Sample follows the theoretical distribution" or make the bottom one "Both samples follow same distribution".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. Modified the first. Thanks
…ical functions are package private and can be tested more directly
|
jenkins, test this please |
|
This LGTM pending jenkins |
|
test this please |
|
Test build #39872 has finished for PR 7075 at commit
|
|
ugh, did not reword the tests in pyspark after we slightly cleaned up the grammar in the 2 sample test. I will make the ks 2 sample test hypothesis statement match the grammar in the first. Sorry about this! |
… unit test failure by reversing prior grammar change
|
@mengxr is it too late to get this in to 1.5? @josepablocam are you able to resolve merge conflicts? |
|
@sryza fixed merge conflicts |
|
Test build #58715 has finished for PR 7075 at commit
|
|
Test build #69329 has finished for PR 7075 at commit
|
|
@josepablocam, it looks the conflicts were not resolved cleanly. Would you resolve them? |
|
I'm not involved in this anymore. If you're interested in making those
fixes, feel free. Thanks.
…On Jun 19, 2017 5:44 AM, "Hyukjin Kwon" ***@***.***> wrote:
@josepablocam <https://github.com/josepablocam>, it looks the conflicts
were not resolved cleanly. Would you resolve them?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7075 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AE3GlO5eHVTNq47py-_XuJbhLTFkn89Eks5sFe6tgaJpZM4FNtLA>
.
|
|
We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks! |
## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 I believe the author in apache#14807 removed his account. Closes apache#7075 Closes apache#8927 Closes apache#9202 Closes apache#9366 Closes apache#10861 Closes apache#11420 Closes apache#12356 Closes apache#13028 Closes apache#13506 Closes apache#14191 Closes apache#14198 Closes apache#14330 Closes apache#14807 Closes apache#15839 Closes apache#16225 Closes apache#16685 Closes apache#16692 Closes apache#16995 Closes apache#17181 Closes apache#17211 Closes apache#17235 Closes apache#17237 Closes apache#17248 Closes apache#17341 Closes apache#17708 Closes apache#17716 Closes apache#17721 Closes apache#17937 Added: Closes apache#14739 Closes apache#17139 Closes apache#17445 Closes apache#18042 Closes apache#18359 Added: Closes apache#16450 Closes apache#16525 Closes apache#17738 Added: Closes apache#16458 Closes apache#16508 Closes apache#17714 Added: Closes apache#17830 Closes apache#14742 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18417 from HyukjinKwon/close-stale-pr.
The current patch implements a 2-sample, 2-sided Kolmogorov Smirnov test. Similarly to the 1-sample implementation, we seek to reduce the shuffles necessary for computation. The user can provide 2 RDD[Double] and the Statistics.ksTest function allows them to test the null hypothesis that both samples came from the same probability distribution.
This patch includes the 1-sample test (so that reviewers can see the broader context of the change), however, that portion (and relevant tests) are being reviewed at https://issues.apache.org/jira/browse/SPARK-8598.