Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ We list major functionality from both below, with links to detailed guides.
* [correlations](mllib-statistics.html#correlations)
* [stratified sampling](mllib-statistics.html#stratified-sampling)
* [hypothesis testing](mllib-statistics.html#hypothesis-testing)
* [streaming significance testing](mllib-statistics.html#streaming-significance-testing)
* [random data generation](mllib-statistics.html#random-data-generation)
* [Classification and regression](mllib-classification-regression.html)
* [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html)
Expand Down
25 changes: 25 additions & 0 deletions docs/mllib-statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -521,6 +521,31 @@ print(testResult) # summary of the test including the p-value, test statistic,
</div>
</div>

### Streaming Significance Testing
MLlib provides online implementations of some tests to support use cases
like A/B testing. These tests may be performed on a Spark Streaming
`DStream[(Boolean,Double)]` where the first element of each tuple
indicates control group (`false`) or treatment group (`true`) and the
second element is the value of an observation.

Streaming significance testing supports the following parameters:

* `peacePeriod` - The number of initial data points from the stream to
ignore, used to mitigate novelty effects.
* `windowSize` - The number of past batches to perform hypothesis
testing over. Setting to `0` will perform cumulative processing using
all prior batches.


<div class="codetabs">
<div data-lang="scala" markdown="1">
[`StreamingTest`](api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest)
provides streaming hypothesis testing.

{% include_example scala/org/apache/spark/examples/mllib/StreamingTestExample.scala %}
</div>
</div>


## Random data generation

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ object StreamingTestExample {
dir.toString
})

// $example on$
val data = ssc.textFileStream(dataDir).map(line => line.split(",") match {
case Array(label, value) => (label.toBoolean, value.toDouble)
})
Expand All @@ -75,6 +76,7 @@ object StreamingTestExample {

val out = streamingTest.registerStream(data)
out.print()
// $example off$

// Stop processing if test becomes significant or we time out
var timeoutCounter = numBatchesTimeout
Expand Down