[SPARK-19635][ML] DataFrame-based API for chi square test#17110
[SPARK-19635][ML] DataFrame-based API for chi square test#17110jkbradley wants to merge 2 commits intoapache:masterfrom
Conversation
|
Test build #73644 has finished for PR 17110 at commit
|
| import spark.implicits._ | ||
|
|
||
| SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT) | ||
| SchemaUtils.checkNumericType(dataset.schema, labelCol) |
There was a problem hiding this comment.
shouldn't chi square test work for binary type as well? or we don't want to support that?
There was a problem hiding this comment.
Sounds reasonable, but let's do that in the future; this is already a lot more types than the RDD-based API supports.
| SchemaUtils.checkNumericType(dataset.schema, labelCol) | ||
| val rdd = dataset.select(col(labelCol).cast("double"), col(featuresCol)).as[(Double, Vector)] | ||
| .rdd.map { case (label, features) => OldLabeledPoint(label, OldVectors.fromML(features)) } | ||
| val testResults = OldStatistics.chiSqTest(rdd) |
There was a problem hiding this comment.
it would be nice to optimize this in the future -- since we have schema, if the label and features have been converted to categorical, we can get the unique values right away instead of having to re-generate the maps for distinct labels and features
There was a problem hiding this comment.
Definitely; feel free to make a JIRA for it.
| // Detect continuous features or labels | ||
| val random = new Random(11L) | ||
| val continuousLabel = | ||
| Seq.fill(100000)(LabeledPoint(random.nextDouble(), Vectors.dense(random.nextInt(2)))) |
There was a problem hiding this comment.
can the special value that is above the max categorical limit of 10000 be refactored to a constant?
|
Actually, synced with @thunterdb and will update design doc to put everything under a "Statistics" object. I'll wait until #17108 gets merged. |
|
cool, I'll hold off on reviewing this for now then |
|
I just reversed my opinion about a shared "Statistics" object. See #17108 (comment) for details. I pushed an update per your review @imatiach-msft |
|
Test build #74227 has finished for PR 17110 at commit
|
|
Ping @imatiach-msft any more comments after the update? |
|
LGTM! nice addition :) |
|
I guess my only concern would be the testing is a bit sparse, but more tests can be added in the future (especially when the MLlib part is removed). It seems it would be better to move more tests from MLlib -> ML over time. |
|
@jkbradley LGTM, thanks! |
|
OK merging with master @imatiach-msft I agree about sparse testing. This has all of the MLlib tests, but we should add more in the future. |
What changes were proposed in this pull request?
Wrapper taking and return a DataFrame
How was this patch tested?
Copied unit tests from RDD-based API