[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column. by yanboliang · Pull Request #18613 · apache/spark

yanboliang · 2017-07-12T14:27:49Z

What changes were proposed in this pull request?

RFormula should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.

How was this patch tested?

Add test cases.

yanboliang · 2017-07-12T14:38:20Z

cc @felixcheung @wangmiao1981

SparkQA · 2017-07-12T15:32:04Z

Test build #79561 has finished for PR 18613 at commit 2df7e35.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-07-12T17:00:11Z

R/pkg/tests/fulltests/test_mllib_tree.R

  model <- spark.randomForest(traindf, clicked ~ ., type = "classification",
                             maxDepth = 10, maxBins = 10, numTrees = 10,
-                             handleInvalid = "skip")
+                             handleInvalid = "keep")


Because of R always forceIndexLabel which will index label whether it is numeric or string type, this leads to 0.0 and 0 in R label are different. If we choose skip, it will make all labels unseen. I think this is a bug, maybe we should fix it in a separate PR.

felixcheung · 2017-07-12T17:32:30Z

in #18496 we discuss the behavior of the output prediction (#18496 (comment)), similar in #18613 (comment), I'd suggest we step back and review how handleInvalid should work in Scala first.

I think we can still make progress in this PR and #18605, but likely need some changes in Scala.

SparkQA · 2017-07-12T18:00:35Z

Test build #79566 has finished for PR 18613 at commit 9132640.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-07-12T18:52:39Z

@felixcheung I agree. We should make changes in Scala side.

yanboliang · 2017-07-13T16:06:50Z

@felixcheung @wangmiao1981 In Scala, we set handleInvalid for both estimator and model, although it only takes effect for model prediction. The reason behind this is we should support pipeline training and transform, so we need to support set model param during estimator fitting.
For R side, why I propose to set handleInvalid for predict is there is no pipeline concept for R and in other native R algorithms like glm, we set handleInvalid for predict. I'm open to hear you thoughts.
BTW, this PR is not related to the above discussion, let move corresponding discussion to the other PR. This PR fix the bug to make label column also can handle invalid according users' setting. Would you mind to have a look at this one? Thanks.

yanboliang · 2017-07-13T16:07:57Z

mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala

+    assert(result1.collect() === expected1.collect())
+    assert(result2.collect() === expected2.collect())
+
+    // Handle unseen labels.


The following test cases is failed before this PR.

felixcheung · 2017-07-13T17:47:20Z

isn't it confusing to silently drop features?
also one might want different policy on how to handle invalid with features vs. label?

yanboliang · 2017-07-14T03:32:46Z

@felixcheung We don't silently drop features, we use handleInvalid to let users decide how to handle invalid features or label. The behavior is consistent with Scala which supports to handle invalid features and label. BTW, this PR add support to handle invalid values for label, the original PR by @wangmiao1981 has already supported to handle invalid features.
With regards to different policy, this may should be discussed. We can provide two options for features and label respectively, but I think handling invalid values for features and label with the same policy may be the most common use case, and I don't want to make code much complicated before we can see users' strong requirements. And we can split it into two params to control features and label later when it's really necessary. However, I don't have strong opinions about this, and looking forward to hear your thoughts. Thanks.

felixcheung · 2017-07-14T08:23:37Z

@yanboliang that's what I mean. to elaborate,
if handleInvalid = "skip", and with this applying to features, then that feature will just be ignored silently?

I get that part on #18496 - I asked actually https://github.com/apache/spark/pull/18496/files#r125154606 - thought it was confusing.

ok, I agree with your assessment on starting with the same policy.

yanboliang · 2017-07-15T12:56:35Z

Merged into master. Thanks for all reviewing.

RFormula should handle invalid for both features and label column.

2df7e35

Fix SparkR test.

9132640

yanboliang commented Jul 12, 2017

View reviewed changes

yanboliang commented Jul 13, 2017

View reviewed changes

felixcheung approved these changes Jul 14, 2017

View reviewed changes

asfgit closed this in 69e5282 Jul 15, 2017

yanboliang deleted the spark-20307 branch July 15, 2017 12:59

wangmiao1981 mentioned this pull request Jul 15, 2017

[SparkR][SPARK-21381]:SparkR: pass on setHandleInvalid for classification algorithms #18605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.#18613

[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.#18613
yanboliang wants to merge 2 commits intoapache:masterfrom
yanboliang:spark-20307

yanboliang commented Jul 12, 2017

Uh oh!

yanboliang commented Jul 12, 2017

Uh oh!

SparkQA commented Jul 12, 2017

Uh oh!

yanboliang Jul 12, 2017

Uh oh!

felixcheung commented Jul 12, 2017

Uh oh!

SparkQA commented Jul 12, 2017

Uh oh!

wangmiao1981 commented Jul 12, 2017

Uh oh!

yanboliang commented Jul 13, 2017

Uh oh!

yanboliang Jul 13, 2017

Uh oh!

felixcheung commented Jul 13, 2017 •

edited

Loading

Uh oh!

yanboliang commented Jul 14, 2017

Uh oh!

felixcheung commented Jul 14, 2017

Uh oh!

yanboliang commented Jul 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

yanboliang commented Jul 12, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

yanboliang commented Jul 12, 2017

Uh oh!

SparkQA commented Jul 12, 2017

Uh oh!

yanboliang Jul 12, 2017

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Jul 12, 2017

Uh oh!

SparkQA commented Jul 12, 2017

Uh oh!

wangmiao1981 commented Jul 12, 2017

Uh oh!

yanboliang commented Jul 13, 2017

Uh oh!

yanboliang Jul 13, 2017

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Jul 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanboliang commented Jul 14, 2017

Uh oh!

felixcheung commented Jul 14, 2017

Uh oh!

yanboliang commented Jul 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

felixcheung commented Jul 13, 2017 •

edited

Loading