Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Sep 24, 2020

What changes were proposed in this pull request?

The purpose of this pr is to resolve SPARK-32972, total of 51 Scala failed test cases and 3 Java failed test cases were fixed, the main change of this pr as follow:

  • Specified Seq to scala.collection.Seq in case match Seq scene and x.asInstanceOf[Seq[T]] scene

  • Use Row.getSeq[T] instead of Row.getAs[Seq]

  • Manual call toMap method to convert MapView to Map in Scala 2.13

  • Change the tol in the last test to 0.75 to pass RandomForestRegressorSuite#training with sample weights in Scala 2.13

Why are the changes needed?

We need to support a Scala 2.13 build.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Scala 2.12: Pass the Jenkins or GitHub Action

  • Scala 2.13: Pass GitHub 2.13 Build Action

Do the follow:

dev/change-scala-version.sh 2.13
mvn clean install -DskipTests  -pl mllib -Pscala-2.13 -am
mvn test -pl mllib -Pscala-2.13 -fn

Before

[ERROR] Errors: 
[ERROR]   JavaVectorIndexerSuite.vectorIndexerAPI:51 » ClassCast scala.collection.conver...
[ERROR]   JavaWord2VecSuite.testJavaWord2Vec:51 » Spark Job aborted due to stage failure...
[ERROR]   JavaPrefixSpanSuite.runPrefixSpanSaveLoad:79 » Spark Job aborted due to stage ...

Tests: succeeded 1567, failed 51, canceled 0, ignored 7, pending 0
*** 51 TESTS FAILED ***

After

[INFO] Tests run: 122, Failures: 0, Errors: 0, Skipped: 0

Tests: succeeded 1617, failed 0, canceled 0, ignored 7, pending 0
All tests passed.

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Test build #129054 has finished for PR 29857 at commit b7f2f47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LuciferYang
Copy link
Contributor Author

cc @dongjoon-hyun #29861 fix GitHub 2.13 build Action, related to k8s module, I will rebase this pr after it merged.

@LuciferYang
Copy link
Contributor Author

cc @srowen The remaining failed case is

RandomForestRegressorSuite:
- training with sample weights *** FAILED ***
  0.756 was not greater than or equal to 0.78 (MLTestingUtils.scala:285)

test("training with sample weights") {
val df = linearRegressionData
val numClasses = 0
// (numTrees, maxDepth, subsamplingRate, fractionInTol)
val testParams = Seq(
(50, 5, 1.0, 0.75),
(50, 10, 1.0, 0.75),
(50, 10, 0.95, 0.78)
)
for ((numTrees, maxDepth, subsamplingRate, tol) <- testParams) {
val estimator = new RandomForestRegressor()
.setNumTrees(numTrees)
.setMaxDepth(maxDepth)
.setSubsamplingRate(subsamplingRate)
.setSeed(seed)
.setMinWeightFractionPerNode(0.05)
MLTestingUtils.testArbitrarilyScaledWeights[RandomForestRegressionModel,
RandomForestRegressor](df.as[LabeledPoint], estimator,
MLTestingUtils.modelPredictionEquals(df, _ ~= _ relTol 0.2, tol))
MLTestingUtils.testOutliersWithSmallWeights[RandomForestRegressionModel,
RandomForestRegressor](df.as[LabeledPoint], estimator,
numClasses, MLTestingUtils.modelPredictionEquals(df, _ ~= _ relTol 0.2, tol),
outlierRatio = 2)
MLTestingUtils.testOversamplingVsWeighting[RandomForestRegressionModel,
RandomForestRegressor](df.as[LabeledPoint], estimator,
MLTestingUtils.modelPredictionEquals(df, _ ~= _ relTol 0.2, tol), seed)
}
}

Input (50, 10, 0.95, 0.78) with

MLTestingUtils.testOversamplingVsWeighting[RandomForestRegressionModel,
        RandomForestRegressor](df.as[LabeledPoint], estimator,
        MLTestingUtils.modelPredictionEquals(df, _ ~= _ relTol 0.2, tol), seed)

failed.

I found that the following RandomForest.runBagged behave differently for the same input in Scala 2.12 and Scala 2.13, maybe related to the follow code block:

while (nodeStack.nonEmpty) {
// Collect some nodes to split, and choose features for each node (if subsampling).
// Each group of nodes may come from one or multiple trees, and at multiple levels.
val (nodesForGroup, treeToNodeToIndexInfo) =
RandomForest.selectNodesToSplit(nodeStack, maxMemoryUsage, metadata, rng)
// Sanity check (should never occur):
assert(nodesForGroup.nonEmpty,
s"RandomForest selected empty nodesForGroup. Error for unknown reason.")
// Only send trees to worker if they contain nodes being split this iteration.
val topNodesForGroup: Map[Int, LearningNode] =
nodesForGroup.keys.map(treeIdx => treeIdx -> topNodes(treeIdx)).toMap
// Choose node splits, and enqueue new nodes as needed.
timer.start("findBestSplits")
val bestSplit = RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup,
nodesForGroup, treeToNodeToIndexInfo, bcSplits, nodeStack, timer, nodeIds,
outputBestSplits = strategy.useNodeIdCache)
if (strategy.useNodeIdCache) {
nodeIds = updateNodeIds(baggedInput, nodeIds, bcSplits, bestSplit)
nodeIdCheckpointer.update(nodeIds)
}
timer.stop("findBestSplits")
}

but I am not familiar with this algorithm and I not find root cause, I think we need an expert to guide how to fix it

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Sep 24, 2020

override def beforeAll(): Unit = {
super.beforeAll()
orderedLabeledPoints50_1000 =
sc.parallelize(EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 50, 1000)
.map(_.asML))
linearRegressionData = sc.parallelize(LinearDataGenerator.generateLinearInput(
intercept = 6.3, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3),
xVariance = Array(0.7, 1.2), nPoints = 1000, seed, eps = 0.5), 2).map(_.asML).toDF()
}

And I found if we change nPoints from 1000 to 1150, the test will pass

@dongjoon-hyun
Copy link
Member

@LuciferYang . I closed your #29861 because master branch is already fixed two hours ago.

@dongjoon-hyun
Copy link
Member

You need to rebase this branch .

@LuciferYang
Copy link
Contributor Author

@dongjoon-hyun Address 4f5eac5 rebase master

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Test build #129065 has finished for PR 29857 at commit 4f5eac5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 24, 2020

For the random forest test, it's probably reasonably to simply change the tol in the last test to 0.75 like the others. I don't know why it should be higher.

If you like you can try making the ordering of the Map in this code deterministic to see if that does it:

val topNodesForGroup: Map[Int, LearningNode] = 
     nodesForGroup.keys.map(treeIdx => treeIdx -> topNodes(treeIdx)).toMap 

But, I don't even know if the result the test is complaining about is 'wrong'.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Sep 25, 2020

f you like you can try making the ordering of the Map in this code deterministic to see if that does it:

val topNodesForGroup: Map[Int, LearningNode] =
nodesForGroup.keys.map(treeIdx => treeIdx -> topNodes(treeIdx)).toMap
But, I don't even know if the result the test is complaining about is 'wrong'.

OK ~ I will try this first and feedback later.

But change 0.78 to 0.75 seems simpler, haha :)

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Sep 25, 2020

Synchronize the test result:

val (nodesForGroup, treeToNodeToIndexInfo) =
RandomForest.selectNodesToSplit(nodeStack, maxMemoryUsage, metadata, rng)
// Sanity check (should never occur):
assert(nodesForGroup.nonEmpty,
s"RandomForest selected empty nodesForGroup. Error for unknown reason.")
// Only send trees to worker if they contain nodes being split this iteration.
val topNodesForGroup: Map[Int, LearningNode] =
nodesForGroup.keys.map(treeIdx => treeIdx -> topNodes(treeIdx)).toMap

  • Test A: Change to use LinkedHashMap to store topNodesForGroup and use nodesForGroup.keys.toSeq.sorted to init it was no essential impact.

  • Test B: Change to use LinkedHashMap to store topNodesForGroup and nodesForGroup seems that the goal of unified behavior in Scala 2.12 and 2.13 can be achieved, but need change the tol in the last test to 0.77(the result is 0.771)

  • Test C: Only change the tol in the last test to 0.75 can workaround

Which do you recommend better @srowen , B or C, B can passed all test case.

@srowen
Copy link
Member

srowen commented Sep 25, 2020

I think the latter option, changing the tol, is fine.

@LuciferYang
Copy link
Contributor Author

ok ~

@LuciferYang
Copy link
Contributor Author

Address f2a26c5 fix RandomForestRegressorSuite.

@LuciferYang LuciferYang changed the title [SPARK-32972][ML] Fix UTs of mllib module in Scala 2.13 except RandomForestRegressorSuite [SPARK-32972][ML] Pass all UTs of mllib module in Scala 2.13 Sep 25, 2020
@SparkQA
Copy link

SparkQA commented Sep 25, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33730/

@SparkQA
Copy link

SparkQA commented Sep 25, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33730/

@SparkQA
Copy link

SparkQA commented Sep 25, 2020

Test build #129109 has finished for PR 29857 at commit f2a26c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Sep 27, 2020

@srowen Is there any other problem in this pr that needs to be fixed?thx ~

@LuciferYang
Copy link
Contributor Author

From the first round of code checking, may be this is the last module to fix Scala version compatibility

@srowen srowen closed this in bb6d5e7 Sep 27, 2020
@srowen
Copy link
Member

srowen commented Sep 27, 2020

Merged to master

@LuciferYang
Copy link
Contributor Author

thx @srowen ~

@LuciferYang
Copy link
Contributor Author

@dongjoon-hyun Should we consider adding some new GitHubActions to check test in Scala 2.13?

@LuciferYang LuciferYang deleted the fix-mllib-2 branch June 6, 2022 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants