-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-1438 RDD make seed optional in RDD methods sam... #462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -466,6 +466,12 @@ class RDDSuite extends FunSuite with SharedSparkContext { | |
| test("takeSample") { | ||
| val data = sc.parallelize(1 to 100, 2) | ||
|
|
||
| for (num <- List(5,20,100)) { | ||
| val sample = data.takeSample(withReplacement=false, num=num) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there might be some tab character weirdness going on, because these statements don't line up correctly.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @advancedxy java.util.Random the default seed is a function of System.nanoTime ( at least in the openjdk code ). In python its based on time.time(). python time.time() is at the millisecond precision. Not sure if there is a python method to get nanoTime.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @smartnut007 based on https://docs.python.org/2/library/time.html#time.time , time.time() is at the second precision. But since it returns float, I think we can use long(time.time() * 10**9) to get the nanoTime precision. |
||
| assert(sample.size === num) // Got exactly num elements | ||
| assert(sample.toSet.size === num) // Elements are distinct | ||
| assert(sample.forall(x => 1 <= x && x <= 100), "elements not in [1, 100]") | ||
| } | ||
| for (seed <- 1 to 5) { | ||
| val sample = data.takeSample(withReplacement=false, 20, seed) | ||
| assert(sample.size === 20) // Got exactly 20 elements | ||
|
|
@@ -483,6 +489,19 @@ class RDDSuite extends FunSuite with SharedSparkContext { | |
| assert(sample.size === 20) // Got exactly 20 elements | ||
| assert(sample.forall(x => 1 <= x && x <= 100), "elements not in [1, 100]") | ||
| } | ||
| { | ||
| val sample = data.takeSample(withReplacement=true, num=20) | ||
| assert(sample.size === 20) // Got exactly 100 elements | ||
| assert(sample.toSet.size <= 20, "sampling with replacement returned all distinct elements") | ||
| assert(sample.forall(x => 1 <= x && x <= 100), "elements not in [1, 100]") | ||
| } | ||
| { | ||
| val sample = data.takeSample(withReplacement=true, num=100) | ||
| assert(sample.size === 100) // Got exactly 100 elements | ||
| // Chance of getting all distinct elements is astronomically low, so test we got < 100 | ||
| assert(sample.toSet.size < 100, "sampling with replacement returned all distinct elements") | ||
| assert(sample.forall(x => 1 <= x && x <= 100), "elements not in [1, 100]") | ||
| } | ||
| for (seed <- 1 to 5) { | ||
| val sample = data.takeSample(withReplacement=true, 100, seed) | ||
| assert(sample.size === 100) // Got exactly 100 elements | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -256,10 +256,11 @@ class SchemaRDD( | |
| * @group Query | ||
| */ | ||
| @Experimental | ||
| override | ||
| def sample( | ||
| fraction: Double, | ||
| withReplacement: Boolean = true, | ||
| seed: Int = (math.random * 1000).toInt) = | ||
| fraction: Double, | ||
| seed: Long) = | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you intend to remove the default behavior here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scala does not allow multiple overloaded methods to have default params. So, if we makde seed default in RDD.sample, then this had to be modified. So, modified it in a standard way. Also, I believe the author intended to overrride RDD.sample and not overload. More details on the PR comment. |
||
| new SchemaRDD(sqlContext, Sample(fraction, withReplacement, seed, logicalPlan)) | ||
|
|
||
| /** | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's import to use Int instead of Long. Since current code is wrote against Int. If we change to Long, the old code using the sample api cannot be compiled because of type mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longis going to be more standard; certainly Java useslongseeds in its APIs. It gives more bits of seed, which is good too. There may be some API changes but anyone calling with anIntseed should be able to call an API with aLongseed right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is more standard to use Long. But we just need to rewrite some code.
We should ask about @mateiz or @pwendell for advice. Maybe they chosen Int for some reason we don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have deprecated overloaded methods for backward compatibility if needed.