[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.#18613
[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.#18613yanboliang wants to merge 2 commits intoapache:masterfrom
Conversation
|
Test build #79561 has finished for PR 18613 at commit
|
| model <- spark.randomForest(traindf, clicked ~ ., type = "classification", | ||
| maxDepth = 10, maxBins = 10, numTrees = 10, | ||
| handleInvalid = "skip") | ||
| handleInvalid = "keep") |
There was a problem hiding this comment.
Because of R always forceIndexLabel which will index label whether it is numeric or string type, this leads to 0.0 and 0 in R label are different. If we choose skip, it will make all labels unseen. I think this is a bug, maybe we should fix it in a separate PR.
|
in #18496 we discuss the behavior of the output prediction (#18496 (comment)), similar in #18613 (comment), I'd suggest we step back and review how handleInvalid should work in Scala first. I think we can still make progress in this PR and #18605, but likely need some changes in Scala. |
|
Test build #79566 has finished for PR 18613 at commit
|
|
@felixcheung I agree. We should make changes in Scala side. |
|
@felixcheung @wangmiao1981 In Scala, we set |
| assert(result1.collect() === expected1.collect()) | ||
| assert(result2.collect() === expected2.collect()) | ||
|
|
||
| // Handle unseen labels. |
There was a problem hiding this comment.
The following test cases is failed before this PR.
|
isn't it confusing to silently drop features? |
|
@felixcheung We don't silently drop features, we use |
|
@yanboliang that's what I mean. to elaborate, I get that part on #18496 - I asked actually https://github.com/apache/spark/pull/18496/files#r125154606 - thought it was confusing. ok, I agree with your assessment on starting with the same policy. |
|
Merged into master. Thanks for all reviewing. |
What changes were proposed in this pull request?
RFormulashould handle invalid for both features and label column.#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.
How was this patch tested?
Add test cases.