Skip to content

Conversation

@erikerlandson
Copy link
Contributor

More efficient sampling, based on Gap Sampling optimization:
http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test whether rdd.randomSplit() will produce non-overlapping subsets with this change?

@mengxr
Copy link
Contributor

mengxr commented Sep 19, 2014

add to whitelist

@mengxr
Copy link
Contributor

mengxr commented Sep 19, 2014

this is ok to test

@SparkQA
Copy link

SparkQA commented Sep 19, 2014

Can one of the admins verify this patch?

@erikerlandson
Copy link
Contributor Author

@mengxr you're right, a data partitioning use case like rddSplit doesn't work with gap sampling, so I restored a "partitioning" RandomSampler subclass that works for those cases.

@mengxr
Copy link
Contributor

mengxr commented Sep 25, 2014

test this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20790/

@erikerlandson
Copy link
Contributor Author

Jenkins looks like it failed trying to fetch the repo.

@mengxr
Copy link
Contributor

mengxr commented Sep 25, 2014

@erikerlandson Jenkins is not very stable. You are on the whitelist, feel free to ask Jenkins to retest this PR.

@mengxr
Copy link
Contributor

mengxr commented Sep 25, 2014

test this please

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have started for PR 2455 at commit e29a0ae.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have finished for PR 2455 at commit e29a0ae.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BernoulliPartitionSampler[T](lb: Double, ub: Double, complement: Boolean = false)
    • class BernoulliSampler[T: ClassTag](fraction: Double) extends RandomSampler[T, T]
    • class PoissonSampler[T: ClassTag](fraction: Double) extends RandomSampler[T, T]
    • class GapSamplingIterator[T: ClassTag](var data: Iterator[T], f: Double,
    • class GapSamplingReplacementIterator[T: ClassTag](var data: Iterator[T], f: Double,

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20811/

@erikerlandson
Copy link
Contributor Author

test this please

1 similar comment
@mengxr
Copy link
Contributor

mengxr commented Sep 25, 2014

test this please

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have started for PR 2455 at commit b89b591.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 2455 at commit b89b591.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BernoulliPartitionSampler[T](lb: Double, ub: Double, complement: Boolean = false)
    • class BernoulliSampler[T: ClassTag](fraction: Double) extends RandomSampler[T, T]
    • class PoissonSampler[T: ClassTag](fraction: Double) extends RandomSampler[T, T]
    • class GapSamplingIterator[T: ClassTag](var data: Iterator[T], f: Double,
    • class GapSamplingReplacementIterator[T: ClassTag](var data: Iterator[T], f: Double,

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20828/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change the comment to use javadoc style? e.g.

/**
 * Default gap sampling maximum.
 * ...
 */

@erikerlandson
Copy link
Contributor Author

@srowen, regarding the testing for iterator types, inside of 'dd', that was the only way I found (so far) that scala would accept. The best solution (imo) would be if Scala defined a random-access-optimized iterator subclass that I could match on, but there is no such animal. I've been considering requesting one in a Scala PR.

@mengxr
Copy link
Contributor

mengxr commented Oct 28, 2014

@erikerlandson The feature freeze deadline for v1.2 is this Sat. Just want to check with you and see whether you are going to update the PR this week.

@erikerlandson
Copy link
Contributor Author

@mengxr, coincidentally I'm working through the PR comments today, I plan to have an update pushed this evening

@mengxr
Copy link
Contributor

mengxr commented Oct 28, 2014

@erikerlandson Great! Thanks for the heads up.

@erikerlandson
Copy link
Contributor Author

Was about to push, but looks like commit for SPARK-4022 broke my updates so I'm going to have to make more edits to rebase

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22475 has started for PR 2455 at commit 46cb9fa.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22475 has finished for PR 2455 at commit 46cb9fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BernoulliCellSampler[T](lb: Double, ub: Double, complement: Boolean = false)
    • class BernoulliSampler[T: ClassTag](fraction: Double) extends RandomSampler[T, T]
    • class PoissonSampler[T: ClassTag](fraction: Double) extends RandomSampler[T, T]
    • class GapSamplingIterator[T: ClassTag](var data: Iterator[T], f: Double,
    • class GapSamplingReplacementIterator[T: ClassTag](var data: Iterator[T], f: Double,
    • class DeferredObjectAdapter(oi: ObjectInspector) extends DeferredObject

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22475/
Test PASSed.

@erikerlandson
Copy link
Contributor Author

@mengxr latest updates are rebased and passing Jenkins

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.5 -> 0.4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.5 is what I recommend as an initial guess if one is using a new RNG. (0.4 is what I got by experimenting with the current RNG)

@erikerlandson
Copy link
Contributor Author

@mengxr, I changed fractionEpsilon to rngEpsilon, which is more suggestive of its purpose. I also updated its documentation, which I think is also now more clear about what rngEpsilon is for.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22576 has started for PR 2455 at commit 72496bc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22576 has finished for PR 2455 at commit 72496bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22576/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Oct 31, 2014

LGTM. Merged into master. Thanks for implementing gap sampling!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants