[SPARK-32972][ML] Pass all UTs of `mllib` module in Scala 2.13 #29857

LuciferYang · 2020-09-24T02:14:23Z

What changes were proposed in this pull request?

The purpose of this pr is to resolve SPARK-32972, total of 51 Scala failed test cases and 3 Java failed test cases were fixed, the main change of this pr as follow:

Specified Seq to scala.collection.Seq in case match Seq scene and x.asInstanceOf[Seq[T]] scene
Use Row.getSeq[T] instead of Row.getAs[Seq]
Manual call toMap method to convert MapView to Map in Scala 2.13
Change the tol in the last test to 0.75 to pass RandomForestRegressorSuite#training with sample weights in Scala 2.13

Why are the changes needed?

We need to support a Scala 2.13 build.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Scala 2.12: Pass the Jenkins or GitHub Action
Scala 2.13: Pass GitHub 2.13 Build Action

Do the follow:

dev/change-scala-version.sh 2.13
mvn clean install -DskipTests  -pl mllib -Pscala-2.13 -am
mvn test -pl mllib -Pscala-2.13 -fn

Before

[ERROR] Errors: 
[ERROR]   JavaVectorIndexerSuite.vectorIndexerAPI:51 » ClassCast scala.collection.conver...
[ERROR]   JavaWord2VecSuite.testJavaWord2Vec:51 » Spark Job aborted due to stage failure...
[ERROR]   JavaPrefixSpanSuite.runPrefixSpanSaveLoad:79 » Spark Job aborted due to stage ...

Tests: succeeded 1567, failed 51, canceled 0, ignored 7, pending 0
*** 51 TESTS FAILED ***

After

[INFO] Tests run: 122, Failures: 0, Errors: 0, Skipped: 0

Tests: succeeded 1617, failed 0, canceled 0, ignored 7, pending 0
All tests passed.

SparkQA · 2020-09-24T03:21:18Z

Test build #129054 has finished for PR 29857 at commit b7f2f47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

LuciferYang · 2020-09-24T04:11:29Z

cc @dongjoon-hyun #29861 fix GitHub 2.13 build Action, related to k8s module, I will rebase this pr after it merged.

LuciferYang · 2020-09-24T04:55:01Z

cc @srowen The remaining failed case is

RandomForestRegressorSuite:
- training with sample weights *** FAILED ***
  0.756 was not greater than or equal to 0.78 (MLTestingUtils.scala:285)

spark/mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala

Lines 171 to 200 in 0bc0e91

    
           test("training with sample weights") { 
        
             val df = linearRegressionData 
        
             val numClasses = 0 
        
             // (numTrees, maxDepth, subsamplingRate, fractionInTol) 
        
             val testParams = Seq( 
        
               (50, 5, 1.0, 0.75), 
        
               (50, 10, 1.0, 0.75), 
        
               (50, 10, 0.95, 0.78) 
        
             ) 
        
             for ((numTrees, maxDepth, subsamplingRate, tol) <- testParams) { 
        
               val estimator = new RandomForestRegressor() 
        
                 .setNumTrees(numTrees) 
        
                 .setMaxDepth(maxDepth) 
        
                 .setSubsamplingRate(subsamplingRate) 
        
                 .setSeed(seed) 
        
                 .setMinWeightFractionPerNode(0.05) 
        
               MLTestingUtils.testArbitrarilyScaledWeights[RandomForestRegressionModel, 
        
                 RandomForestRegressor](df.as[LabeledPoint], estimator, 
        
                 MLTestingUtils.modelPredictionEquals(df, _ ~= _ relTol 0.2, tol)) 
        
               MLTestingUtils.testOutliersWithSmallWeights[RandomForestRegressionModel, 
        
                 RandomForestRegressor](df.as[LabeledPoint], estimator, 
        
                 numClasses, MLTestingUtils.modelPredictionEquals(df, _ ~= _ relTol 0.2, tol), 
        
                 outlierRatio = 2) 
        
               MLTestingUtils.testOversamplingVsWeighting[RandomForestRegressionModel, 
        
                 RandomForestRegressor](df.as[LabeledPoint], estimator, 
        
                 MLTestingUtils.modelPredictionEquals(df, _ ~= _ relTol 0.2, tol), seed) 
        
             } 
        
           }

Input (50, 10, 0.95, 0.78) with

MLTestingUtils.testOversamplingVsWeighting[RandomForestRegressionModel,
        RandomForestRegressor](df.as[LabeledPoint], estimator,
        MLTestingUtils.modelPredictionEquals(df, _ ~= _ relTol 0.2, tol), seed)

failed.

I found that the following RandomForest.runBagged behave differently for the same input in Scala 2.12 and Scala 2.13, maybe related to the follow code block:

spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

Lines 191 to 215 in 0bc0e91

    
           while (nodeStack.nonEmpty) { 
        
             // Collect some nodes to split, and choose features for each node (if subsampling). 
        
             // Each group of nodes may come from one or multiple trees, and at multiple levels. 
        
             val (nodesForGroup, treeToNodeToIndexInfo) = 
        
             RandomForest.selectNodesToSplit(nodeStack, maxMemoryUsage, metadata, rng) 
        
             // Sanity check (should never occur): 
        
             assert(nodesForGroup.nonEmpty, 
        
               s"RandomForest selected empty nodesForGroup.  Error for unknown reason.") 
        
             // Only send trees to worker if they contain nodes being split this iteration. 
        
             val topNodesForGroup: Map[Int, LearningNode] = 
        
               nodesForGroup.keys.map(treeIdx => treeIdx -> topNodes(treeIdx)).toMap 
        
             // Choose node splits, and enqueue new nodes as needed. 
        
             timer.start("findBestSplits") 
        
             val bestSplit = RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
        
               nodesForGroup, treeToNodeToIndexInfo, bcSplits, nodeStack, timer, nodeIds, 
        
               outputBestSplits = strategy.useNodeIdCache) 
        
             if (strategy.useNodeIdCache) { 
        
               nodeIds = updateNodeIds(baggedInput, nodeIds, bcSplits, bestSplit) 
        
               nodeIdCheckpointer.update(nodeIds) 
        
             } 
        
             timer.stop("findBestSplits") 
        
           }

but I am not familiar with this algorithm and I not find root cause, I think we need an expert to guide how to fix it

LuciferYang · 2020-09-24T04:57:04Z

spark/mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala

Lines 45 to 54 in 0bc0e91

    
           override def beforeAll(): Unit = { 
        
             super.beforeAll() 
        
             orderedLabeledPoints50_1000 = 
        
               sc.parallelize(EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 50, 1000) 
        
                 .map(_.asML)) 
        
             linearRegressionData = sc.parallelize(LinearDataGenerator.generateLinearInput( 
        
               intercept = 6.3, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3), 
        
               xVariance = Array(0.7, 1.2), nPoints = 1000, seed, eps = 0.5), 2).map(_.asML).toDF() 
        
           }

And I found if we change nPoints from 1000 to 1150, the test will pass

dongjoon-hyun · 2020-09-24T04:59:27Z

@LuciferYang . I closed your #29861 because master branch is already fixed two hours ago.

https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/MountVolumesFeatureStep.scala#L117

dongjoon-hyun · 2020-09-24T04:59:57Z

You need to rebase this branch .

LuciferYang · 2020-09-24T05:38:16Z

@dongjoon-hyun Address 4f5eac5 rebase master

SparkQA · 2020-09-24T06:49:02Z

Test build #129065 has finished for PR 29857 at commit 4f5eac5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-09-24T17:22:27Z

For the random forest test, it's probably reasonably to simply change the tol in the last test to 0.75 like the others. I don't know why it should be higher.

If you like you can try making the ordering of the Map in this code deterministic to see if that does it:

val topNodesForGroup: Map[Int, LearningNode] = 
     nodesForGroup.keys.map(treeIdx => treeIdx -> topNodes(treeIdx)).toMap

But, I don't even know if the result the test is complaining about is 'wrong'.

LuciferYang · 2020-09-25T02:04:19Z

f you like you can try making the ordering of the Map in this code deterministic to see if that does it:

val topNodesForGroup: Map[Int, LearningNode] =
nodesForGroup.keys.map(treeIdx => treeIdx -> topNodes(treeIdx)).toMap
But, I don't even know if the result the test is complaining about is 'wrong'.

OK ~ I will try this first and feedback later.

But change 0.78 to 0.75 seems simpler, haha :)

LuciferYang · 2020-09-25T09:13:54Z

Synchronize the test result:

spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

Lines 194 to 202 in 0bc0e91

    
           val (nodesForGroup, treeToNodeToIndexInfo) = 
        
           RandomForest.selectNodesToSplit(nodeStack, maxMemoryUsage, metadata, rng) 
        
           // Sanity check (should never occur): 
        
           assert(nodesForGroup.nonEmpty, 
        
             s"RandomForest selected empty nodesForGroup.  Error for unknown reason.") 
        
           // Only send trees to worker if they contain nodes being split this iteration. 
        
           val topNodesForGroup: Map[Int, LearningNode] = 
        
             nodesForGroup.keys.map(treeIdx => treeIdx -> topNodes(treeIdx)).toMap

Test A： Change to use LinkedHashMap to store topNodesForGroup and use nodesForGroup.keys.toSeq.sorted to init it was no essential impact.
Test B： Change to use LinkedHashMap to store topNodesForGroup and nodesForGroup seems that the goal of unified behavior in Scala 2.12 and 2.13 can be achieved, but need change the tol in the last test to 0.77(the result is 0.771)
Test C： Only change the tol in the last test to 0.75 can workaround

Which do you recommend better @srowen , B or C, B can passed all test case.

srowen · 2020-09-25T12:08:27Z

I think the latter option, changing the tol, is fine.

LuciferYang · 2020-09-25T12:26:45Z

ok ~

LuciferYang · 2020-09-25T12:40:47Z

Address f2a26c5 fix RandomForestRegressorSuite.

SparkQA · 2020-09-25T13:27:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33730/

SparkQA · 2020-09-25T13:45:38Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33730/

SparkQA · 2020-09-25T13:49:09Z

Test build #129109 has finished for PR 29857 at commit f2a26c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

LuciferYang · 2020-09-27T01:39:59Z

@srowen Is there any other problem in this pr that needs to be fixed?thx ~

LuciferYang · 2020-09-27T02:45:22Z

From the first round of code checking, may be this is the last module to fix Scala version compatibility

srowen · 2020-09-27T15:27:09Z

Merged to master

LuciferYang · 2020-09-28T02:51:03Z

thx @srowen ~

LuciferYang · 2020-09-28T02:55:17Z

@dongjoon-hyun Should we consider adding some new GitHubActions to check test in Scala 2.13?

LuciferYang added 2 commits September 23, 2020 18:26

fix mllib failed cases

2971ec0

use getSeq[T] instead of getAs[scala.collection.Seq]

b7f2f47

probot-autolabeler bot added the ML label Sep 24, 2020

Merge branch 'upmaster' into fix-mllib-2

4f5eac5

fix RandomForestRegressorSuite

f2a26c5

LuciferYang changed the title ~~[SPARK-32972][ML] Fix UTs of mllib module in Scala 2.13 except RandomForestRegressorSuite~~ [SPARK-32972][ML] Pass all UTs of mllib module in Scala 2.13 Sep 25, 2020

srowen closed this in bb6d5e7 Sep 27, 2020

LuciferYang deleted the fix-mllib-2 branch June 6, 2022 03:44

[SPARK-32972][ML] Pass all UTs of mllib module in Scala 2.13 #29857

[SPARK-32972][ML] Pass all UTs of mllib module in Scala 2.13 #29857

Uh oh!

Conversation

LuciferYang commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 24, 2020

Uh oh!

LuciferYang commented Sep 24, 2020

Uh oh!

LuciferYang commented Sep 24, 2020

Uh oh!

LuciferYang commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 24, 2020

Uh oh!

dongjoon-hyun commented Sep 24, 2020

Uh oh!

LuciferYang commented Sep 24, 2020

Uh oh!

SparkQA commented Sep 24, 2020

Uh oh!

srowen commented Sep 24, 2020

Uh oh!

LuciferYang commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Sep 25, 2020

Uh oh!

LuciferYang commented Sep 25, 2020

Uh oh!

LuciferYang commented Sep 25, 2020

Uh oh!

SparkQA commented Sep 25, 2020

Uh oh!

SparkQA commented Sep 25, 2020

Uh oh!

SparkQA commented Sep 25, 2020

Uh oh!

LuciferYang commented Sep 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Sep 27, 2020

Uh oh!

srowen commented Sep 27, 2020

Uh oh!

LuciferYang commented Sep 28, 2020

Uh oh!

LuciferYang commented Sep 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-32972][ML] Pass all UTs of `mllib` module in Scala 2.13 #29857

[SPARK-32972][ML] Pass all UTs of `mllib` module in Scala 2.13 #29857

LuciferYang commented Sep 24, 2020 •

edited

Loading

LuciferYang commented Sep 24, 2020 •

edited

Loading

LuciferYang commented Sep 25, 2020 •

edited

Loading

LuciferYang commented Sep 25, 2020 •

edited

Loading

LuciferYang commented Sep 27, 2020 •

edited

Loading