Skip to content

Conversation

@chouqin
Copy link
Contributor

@chouqin chouqin commented Oct 8, 2014

Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes).

Implementation Details

Each node now has a impurity field and the predict is changed from type Double to type Predict(this can be used to compute predict probability in the future) When compute best splits for each node, we also compute impurity and predict for the child nodes, which is used to constructed newly allocated child nodes. So at level L, we have set impurity and predict for nodes at level L +1.
If level L+1 is the last level, then we can avoid aggregation. What's more, calculation of parent impurity in

Top nodes for each tree needs to be treated differently because we have to compute impurity and predict for them first. In binsToBestSplit, if current node is top node(level == 0), we calculate impurity and predict first.
after finding best split, top node's predict and impurity is set to the calculated value. Non-top nodes's impurity and predict are already calculated and don't need to be recalculated again. I have considered to add a initialization step to set top nodes' impurity and predict and then we can treat all nodes in the same way, but this will need a lot of duplication of code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I choose the current way.

CC @mengxr @manishamde @jkbradley, please help me review this, thanks.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2708 at commit 7ad7a71.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2708 at commit 7ad7a71.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21456/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2708 at commit c41b1b6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2708 at commit c41b1b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21460/Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also check stats.leftImpurity and rightImpurity. If stats.leftImpurity = 0, then we know the left child will be a leaf. Same for the right child.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is done, it might be good to add 1 more test to make sure it works. A slight modification of the test you already added should work.

@jkbradley
Copy link
Member

@chouqin Thank you for the PR! It looks almost ready, except for the items I noted above. I will try some timing tests and post results here when done.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2708 at commit eefeef1.

  • This patch merges cleanly.

@chouqin
Copy link
Contributor Author

chouqin commented Oct 9, 2014

@jkbradley thanks for your comments, I have adjusted the code accordingly. I look forward to your timing test and hope that it will get some performance gain.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2708 at commit eefeef1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21505/Test PASSed.

@jkbradley
Copy link
Member

@chouqin Thanks for the updates! LGTM. I ran some timing tests and saw consistent speedups, but the speedups were not as big as I would have expected. (It was about 1.1X or 1.2X faster, but I would hope for close to 2X faster.)
@mengxr If it looks ready to you, I think this is ready to merge. In the meantime, I am running some larger timing tests to see if I can get better speedups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val nodes = new Array[Node](numNodes)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21527/Test FAILed.

@mengxr
Copy link
Contributor

mengxr commented Oct 9, 2014

test this please

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2708 at commit 8e269ea.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2708 at commit 8e269ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21528/Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Oct 9, 2014

Merged into master. Thanks!

@asfgit asfgit closed this in 14f222f Oct 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants