[SPARK-3158][MLLIB]Avoid 1 extra aggregation for DecisionTree training #2708

chouqin · 2014-10-08T08:16:56Z

Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes).

Implementation Details

Each node now has a impurity field and the predict is changed from type Double to type Predict(this can be used to compute predict probability in the future) When compute best splits for each node, we also compute impurity and predict for the child nodes, which is used to constructed newly allocated child nodes. So at level L, we have set impurity and predict for nodes at level L +1.
If level L+1 is the last level, then we can avoid aggregation. What's more, calculation of parent impurity in

Top nodes for each tree needs to be treated differently because we have to compute impurity and predict for them first. In binsToBestSplit, if current node is top node(level == 0), we calculate impurity and predict first.
after finding best split, top node's predict and impurity is set to the calculated value. Non-top nodes's impurity and predict are already calculated and don't need to be recalculated again. I have considered to add a initialization step to set top nodes' impurity and predict and then we can treat all nodes in the same way, but this will need a lot of duplication of code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I choose the current way.

CC @mengxr @manishamde @jkbradley, please help me review this, thanks.

SparkQA · 2014-10-08T08:19:39Z

QA tests have started for PR 2708 at commit 7ad7a71.

This patch merges cleanly.

SparkQA · 2014-10-08T09:16:04Z

QA tests have finished for PR 2708 at commit 7ad7a71.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-08T09:16:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21456/Test FAILed.

SparkQA · 2014-10-08T10:39:38Z

QA tests have started for PR 2708 at commit c41b1b6.

This patch merges cleanly.

SparkQA · 2014-10-08T11:43:07Z

QA tests have finished for PR 2708 at commit c41b1b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-08T11:43:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21460/Test PASSed.

jkbradley · 2014-10-08T18:58:06Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

We can also check stats.leftImpurity and rightImpurity. If stats.leftImpurity = 0, then we know the left child will be a leaf. Same for the right child.

Once this is done, it might be good to add 1 more test to make sure it works. A slight modification of the test you already added should work.

jkbradley · 2014-10-08T19:40:05Z

@chouqin Thank you for the PR! It looks almost ready, except for the items I noted above. I will try some timing tests and post results here when done.

SparkQA · 2014-10-09T01:29:50Z

QA tests have started for PR 2708 at commit eefeef1.

This patch merges cleanly.

chouqin · 2014-10-09T01:32:05Z

@jkbradley thanks for your comments, I have adjusted the code accordingly. I look forward to your timing test and hope that it will get some performance gain.

SparkQA · 2014-10-09T02:31:53Z

QA tests have finished for PR 2708 at commit eefeef1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-09T02:31:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21505/Test PASSed.

jkbradley · 2014-10-09T05:20:29Z

@chouqin Thanks for the updates! LGTM. I ran some timing tests and saw consistent speedups, but the speedups were not as big as I would have expected. (It was about 1.1X or 1.2X faster, but I would hope for close to 2X faster.)
@mengxr If it looks ready to you, I think this is ready to merge. In the meantime, I am running some larger timing tests to see if I can get better speedups.

mengxr · 2014-10-09T05:48:59Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

val nodes = new Array[Node](numNodes)

AmplabJenkins · 2014-10-09T07:17:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21527/Test FAILed.

mengxr · 2014-10-09T07:24:33Z

test this please

SparkQA · 2014-10-09T07:29:45Z

QA tests have started for PR 2708 at commit 8e269ea.

This patch merges cleanly.

SparkQA · 2014-10-09T08:33:20Z

QA tests have finished for PR 2708 at commit 8e269ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-09T08:33:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21528/Test PASSed.

mengxr · 2014-10-09T08:38:46Z

Merged into master. Thanks!

chouqin added 4 commits October 8, 2014 12:07

SPARK-3158: Avoid 1 extra aggregation for DecisionTree training

6cc0333

fix bug in test suite

e41d715

add comments and unit test

822c912

fix unit test

7ad7a71

fix pyspark unit test

c41b1b6

jkbradley reviewed Oct 8, 2014
View reviewed changes

adjust comments and check child nodes' impurity

eefeef1

mengxr reviewed Oct 9, 2014
View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala Outdated

Copy link

Contributor

mengxr Oct 9, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val nodes = new Array[Node](numNodes)

adjust code and comments

8e269ea

asfgit closed this in 14f222f Oct 9, 2014

[SPARK-3158][MLLIB]Avoid 1 extra aggregation for DecisionTree training #2708

[SPARK-3158][MLLIB]Avoid 1 extra aggregation for DecisionTree training #2708

Uh oh!

Conversation

chouqin commented Oct 8, 2014

Implementation Details

Uh oh!

SparkQA commented Oct 8, 2014

Uh oh!

SparkQA commented Oct 8, 2014

Uh oh!

AmplabJenkins commented Oct 8, 2014

Uh oh!

SparkQA commented Oct 8, 2014

Uh oh!

SparkQA commented Oct 8, 2014

Uh oh!

AmplabJenkins commented Oct 8, 2014

Uh oh!

jkbradley Oct 8, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley Oct 8, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Oct 8, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

chouqin commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

AmplabJenkins commented Oct 9, 2014

Uh oh!

jkbradley commented Oct 9, 2014

Uh oh!

mengxr Oct 9, 2014

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Oct 9, 2014

Uh oh!

mengxr commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

AmplabJenkins commented Oct 9, 2014

Uh oh!

mengxr commented Oct 9, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants