[SPARK-19635][ML] DataFrame-based API for chi square test by jkbradley · Pull Request #17110 · apache/spark

jkbradley · 2017-03-01T02:36:12Z

What changes were proposed in this pull request?

Wrapper taking and return a DataFrame

How was this patch tested?

Copied unit tests from RDD-based API

SparkQA · 2017-03-01T03:30:22Z

Test build #73644 has finished for PR 17110 at commit a9a8225.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-03-01T22:17:19Z

+    import spark.implicits._
+
+    SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
+    SchemaUtils.checkNumericType(dataset.schema, labelCol)


shouldn't chi square test work for binary type as well? or we don't want to support that?

Sounds reasonable, but let's do that in the future; this is already a lot more types than the RDD-based API supports.

imatiach-msft · 2017-03-01T23:05:35Z

+    SchemaUtils.checkNumericType(dataset.schema, labelCol)
+    val rdd = dataset.select(col(labelCol).cast("double"), col(featuresCol)).as[(Double, Vector)]
+      .rdd.map { case (label, features) => OldLabeledPoint(label, OldVectors.fromML(features)) }
+    val testResults = OldStatistics.chiSqTest(rdd)


it would be nice to optimize this in the future -- since we have schema, if the label and features have been converted to categorical, we can get the unique values right away instead of having to re-generate the maps for distinct labels and features

Definitely; feel free to make a JIRA for it.

imatiach-msft · 2017-03-01T23:08:56Z

+    // Detect continuous features or labels
+    val random = new Random(11L)
+    val continuousLabel =
+      Seq.fill(100000)(LabeledPoint(random.nextDouble(), Vectors.dense(random.nextInt(2))))


can the special value that is above the max categorical limit of 10000 be refactored to a constant?

Good idea, done now.

jkbradley · 2017-03-03T19:28:08Z

Actually, synced with @thunterdb and will update design doc to put everything under a "Statistics" object. I'll wait until #17108 gets merged.

imatiach-msft · 2017-03-06T05:51:39Z

cool, I'll hold off on reviewing this for now then

jkbradley · 2017-03-08T23:12:36Z

I just reversed my opinion about a shared "Statistics" object. See #17108 (comment) for details.

I pushed an update per your review @imatiach-msft

SparkQA · 2017-03-09T00:05:36Z

Test build #74227 has finished for PR 17110 at commit 19fa02a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-13T23:31:08Z

Ping @imatiach-msft any more comments after the update?

imatiach-msft · 2017-03-13T23:32:50Z

LGTM! nice addition :)

imatiach-msft · 2017-03-13T23:38:00Z

I guess my only concern would be the testing is a bit sparse, but more tests can be added in the future (especially when the MLlib part is removed). It seems it would be better to move more tests from MLlib -> ML over time.

thunterdb · 2017-03-14T21:12:02Z

@jkbradley LGTM, thanks!

jkbradley · 2017-03-17T00:09:41Z

OK merging with master
Thanks @imatiach-msft and @thunterdb !

@imatiach-msft I agree about sparse testing. This has all of the MLlib tests, but we should add more in the future.

DF-based api for chi square test

a9a8225

imatiach-msft reviewed Mar 1, 2017

View reviewed changes

imatiach-msft approved these changes Mar 1, 2017

View reviewed changes

update max on num categories for chisqtest to be stored as static val

19fa02a

jkbradley mentioned this pull request Mar 8, 2017

[SPARK-19636][ML] Feature parity for correlation statistics in MLlib #17108

Closed

asfgit closed this in 4c32005 Mar 17, 2017

Conversation

jkbradley commented Mar 1, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

imatiach-msft Mar 1, 2017

Choose a reason for hiding this comment

Uh oh!

jkbradley Mar 3, 2017

Choose a reason for hiding this comment

Uh oh!

imatiach-msft Mar 1, 2017

Choose a reason for hiding this comment

Uh oh!

jkbradley Mar 3, 2017

Choose a reason for hiding this comment

Uh oh!

imatiach-msft Mar 1, 2017

Choose a reason for hiding this comment

Uh oh!

jkbradley Mar 3, 2017

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Mar 3, 2017

Uh oh!

imatiach-msft commented Mar 6, 2017

Uh oh!

jkbradley commented Mar 8, 2017

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

jkbradley commented Mar 13, 2017

Uh oh!

imatiach-msft commented Mar 13, 2017

Uh oh!

imatiach-msft commented Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thunterdb commented Mar 14, 2017

Uh oh!

jkbradley commented Mar 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

imatiach-msft commented Mar 13, 2017 •

edited

Loading