-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3250] Implement Gap Sampling optimization for random sampling #2455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test whether rdd.randomSplit() will produce non-overlapping subsets with this change?
|
add to whitelist |
|
this is ok to test |
|
Can one of the admins verify this patch? |
82f461f to
e29a0ae
Compare
|
@mengxr you're right, a data partitioning use case like rddSplit doesn't work with gap sampling, so I restored a "partitioning" RandomSampler subclass that works for those cases. |
|
test this please |
|
Test FAILed. |
|
Jenkins looks like it failed trying to fetch the repo. |
|
@erikerlandson Jenkins is not very stable. You are on the whitelist, feel free to ask Jenkins to retest this PR. |
|
test this please |
|
QA tests have started for PR 2455 at commit
|
|
QA tests have finished for PR 2455 at commit
|
|
Test FAILed. |
e29a0ae to
b89b591
Compare
|
test this please |
1 similar comment
|
test this please |
|
QA tests have started for PR 2455 at commit
|
|
QA tests have finished for PR 2455 at commit
|
|
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change the comment to use javadoc style? e.g.
/**
* Default gap sampling maximum.
* ...
*/|
@srowen, regarding the testing for iterator types, inside of 'dd', that was the only way I found (so far) that scala would accept. The best solution (imo) would be if Scala defined a random-access-optimized iterator subclass that I could match on, but there is no such animal. I've been considering requesting one in a Scala PR. |
|
@erikerlandson The feature freeze deadline for v1.2 is this Sat. Just want to check with you and see whether you are going to update the PR this week. |
|
@mengxr, coincidentally I'm working through the PR comments today, I plan to have an update pushed this evening |
|
@erikerlandson Great! Thanks for the heads up. |
|
Was about to push, but looks like commit for SPARK-4022 broke my updates so I'm going to have to make more edits to rebase |
d832364 to
46cb9fa
Compare
|
Test build #22475 has started for PR 2455 at commit
|
|
Test build #22475 has finished for PR 2455 at commit
|
|
Test PASSed. |
|
@mengxr latest updates are rebased and passing Jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.5 -> 0.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.5 is what I recommend as an initial guess if one is using a new RNG. (0.4 is what I got by experimenting with the current RNG)
46cb9fa to
72496bc
Compare
|
@mengxr, I changed |
|
Test build #22576 has started for PR 2455 at commit
|
|
Test build #22576 has finished for PR 2455 at commit
|
|
Test PASSed. |
|
LGTM. Merged into master. Thanks for implementing gap sampling! |
More efficient sampling, based on Gap Sampling optimization:
http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/