[SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up #1975

jkbradley · 2014-08-15T22:37:18Z

DecisionTree needs to match each example to a node at each iteration. It currently does this with a set of filters very inefficiently: For each example, it examines each node at the current level and traces up to the root to see if that example should be handled by that node.

Fix: Filter top-down using the partly built tree itself.

Major changes:

Eliminated Filter class, findBinsForLevel() method.
Set up node parent links in main loop over levels in train().
Added predictNodeIndex() for filtering top-down.
Added DTMetadata class

Other changes:

Pre-compute set of unorderedFeatures.

Notes for following expected PR based on [https://issues.apache.org/jira/browse/SPARK-3043]:

The unorderedFeatures set will next be stored in a metadata structure to simplify function calls (to store other items such as the data in strategy).

I've done initial tests indicating that this speeds things up, but am only now running large-scale ones.

CC: @mengxr @manishamde @chouqin Any comments are welcome---thanks!

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

* Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)

Optimization: Added TreePoint representation so we only call findBin once for each example, feature. Also, calculateGainsForAllNodeSplits now only searches over actual splits, not empty/unused ones. BUG FIX: isSampleValid * isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. * exhibited for unordered features (multi-class classification with categorical features of low arity) * Fix: Index bins correctly for unordered categorical features. Also: some commented-out debugging println calls in DecisionTree, to be removed later

DecisionTree.scala * Eliminated filters, replaced by building tree on the fly and filtering top-down. ** Aggregation over examples now skips examples which do not reach the current level. * Only calculate unorderedFeatures once (in findSplitsBins) Node: Renamed predictIfLeaf to predict Bin, Split: Updated doc

* Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification

…emoved debugging println calls from DecisionTree. Made TreePoint extend Serialiable

* Updated doc * Made some methods private Changed timer to report time in seconds.

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

…ng println calls in DecisionTree.scala.

…disk, not just memory. Details: DecisionTree * Changed: .cache() -> .persist(StorageLevel.MEMORY_AND_DISK) ** This gave major performance improvements on small tests. E.g., 500K examples, 500 features, depth 5, on MacBook, took 292 sec with cache() and 112 when using disk as well. * Change for to while loops * Small cleanups TimeTracker * Removed useless timing in DecisionTree TreePoint * Renamed features to binnedFeatures

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala Merge is OK except one DT Suite test to fix.

…te treePointToNodeIndex with groupShift.

SparkQA · 2014-08-15T22:40:30Z

QA tests have started for PR 1975 at commit 931a3a7.

This patch merges cleanly.

SparkQA · 2014-08-15T22:41:15Z

QA tests have finished for PR 1975 at commit 931a3a7.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2014-08-15T22:55:45Z

I will fix the scala style QA failure by bringing the metadata class from my next planned PR into this one. Updating soon...

mengxr · 2014-08-16T01:24:37Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

minor: Is it cleaner if we offset node index by 1? So the root node has index 1 instead of 0, and then parentNodeIndex = nodeIndex >> 1.

Sounds good, but I will save that for the next PR so that there are fewer changes to propagate.

mengxr · 2014-08-16T23:01:35Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

Should get numFeatures from metadata. Calling first() triggers at least one job.

Thanks! I forgot to port that change from my next PR.

jkbradley · 2014-08-17T03:57:52Z

@mengxr Thanks for the comments; updates made!

mengxr · 2014-08-17T04:06:04Z

Jenkins, test this please.

SparkQA · 2014-08-17T04:10:11Z

QA tests have started for PR 1975 at commit 3726d20.

This patch merges cleanly.

mengxr · 2014-08-17T04:14:36Z

LGTM. Waiting for Jenkins ...

Btw, I created a JIRA for using 1-indexing: https://issues.apache.org/jira/browse/SPARK-3086 .

manishamde · 2014-08-17T04:23:14Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

May be a comment explaining this calculation will help. Even the previous code might be a good comment. :-)

manishamde · 2014-08-17T05:02:33Z

Apart from minor comments, LGTM!

This is an awesome PR. The Filter class had to go for a long time -- thanks for doing it. :-) And the predictNodeIndex is an awesome non-obvious optimization.

manishamde · 2014-08-17T05:02:49Z

Thanks @mengxr for the review as well.

SparkQA · 2014-08-17T05:02:53Z

QA tests have finished for PR 1975 at commit 3726d20.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2014-08-17T05:48:17Z

@mengxr @manishamde Thanks for the feedback! I pushed my last updates, and everything should be ready, Jenkins agreeing.

SparkQA · 2014-08-17T05:50:08Z

QA tests have started for PR 1975 at commit a0ed0da.

This patch merges cleanly.

SparkQA · 2014-08-17T06:42:58Z

QA tests have finished for PR 1975 at commit a0ed0da.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-08-17T06:54:01Z

Merged into master and branch-1.1. Thanks!!

DecisionTree needs to match each example to a node at each iteration. It currently does this with a set of filters very inefficiently: For each example, it examines each node at the current level and traces up to the root to see if that example should be handled by that node. Fix: Filter top-down using the partly built tree itself. Major changes: * Eliminated Filter class, findBinsForLevel() method. * Set up node parent links in main loop over levels in train(). * Added predictNodeIndex() for filtering top-down. * Added DTMetadata class Other changes: * Pre-compute set of unorderedFeatures. Notes for following expected PR based on [https://issues.apache.org/jira/browse/SPARK-3043]: * The unorderedFeatures set will next be stored in a metadata structure to simplify function calls (to store other items such as the data in strategy). I've done initial tests indicating that this speeds things up, but am only now running large-scale ones. CC: mengxr manishamde chouqin Any comments are welcome---thanks! Author: Joseph K. Bradley <[email protected]> Closes #1975 from jkbradley/dt-opt2 and squashes the following commits: a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata. Small doc updates. 3726d20 [Joseph K. Bradley] Small code improvements based on code review. ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow. db0d773 [Joseph K. Bradley] scala style fix 6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code 931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2 797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level. Needed to update treePointToNodeIndex with groupShift. f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2 5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint 6b5651e [Joseph K. Bradley] Updates based on code review. 1 major change: persisting to memory + disk, not just memory. 2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used. Removed debugging println calls in DecisionTree.scala. 356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2 430d782 [Joseph K. Bradley] Added more debug info on binning error. Added some docs. d036089 [Joseph K. Bradley] Print timing info to logDebug. e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private 8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. Removed debugging println calls from DecisionTree. Made TreePoint extend Serialiable a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt 0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree 3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging) f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing 511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing a95bc22 [Joseph K. Bradley] timing for DecisionTree internals (cherry picked from commit 73ab7f1) Signed-off-by: Xiangrui Meng <[email protected]>

DecisionTree needs to match each example to a node at each iteration. It currently does this with a set of filters very inefficiently: For each example, it examines each node at the current level and traces up to the root to see if that example should be handled by that node. Fix: Filter top-down using the partly built tree itself. Major changes: * Eliminated Filter class, findBinsForLevel() method. * Set up node parent links in main loop over levels in train(). * Added predictNodeIndex() for filtering top-down. * Added DTMetadata class Other changes: * Pre-compute set of unorderedFeatures. Notes for following expected PR based on [https://issues.apache.org/jira/browse/SPARK-3043]: * The unorderedFeatures set will next be stored in a metadata structure to simplify function calls (to store other items such as the data in strategy). I've done initial tests indicating that this speeds things up, but am only now running large-scale ones. CC: mengxr manishamde chouqin Any comments are welcome---thanks! Author: Joseph K. Bradley <[email protected]> Closes apache#1975 from jkbradley/dt-opt2 and squashes the following commits: a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata. Small doc updates. 3726d20 [Joseph K. Bradley] Small code improvements based on code review. ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow. db0d773 [Joseph K. Bradley] scala style fix 6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code 931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2 797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level. Needed to update treePointToNodeIndex with groupShift. f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2 5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint 6b5651e [Joseph K. Bradley] Updates based on code review. 1 major change: persisting to memory + disk, not just memory. 2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used. Removed debugging println calls in DecisionTree.scala. 356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2 430d782 [Joseph K. Bradley] Added more debug info on binning error. Added some docs. d036089 [Joseph K. Bradley] Print timing info to logDebug. e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private 8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. Removed debugging println calls from DecisionTree. Made TreePoint extend Serialiable a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt 0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree 3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging) f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing 511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing a95bc22 [Joseph K. Bradley] timing for DecisionTree internals

jkbradley added 22 commits August 5, 2014 11:17

timing for DecisionTree internals

a95bc22

Merge remote-tracking branch 'upstream/master' into dt-timing

511ec85

Merge remote-tracking branch 'upstream/master' into dt-timing

bcf874a

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

Merge remote-tracking branch 'upstream/master' into dt-timing

f61e9d2

Optimizing DecisionTree

3211f02

* Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)

Merge remote-tracking branch 'upstream/master' into dt-opt

b2ed1f3

Small DecisionTree updates:

c1565a5

* Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification

Merge remote-tracking branch 'upstream/master' into dt-opt1

a87e08f

Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. R…

8464a6e

…emoved debugging println calls from DecisionTree. Made TreePoint extend Serialiable

TreePoint

e66f1b1

* Updated doc * Made some methods private Changed timer to report time in seconds.

Print timing info to logDebug.

d036089

Added more debug info on binning error. Added some docs.

430d782

Merge branch 'dt-opt1' into dt-opt2

356daba

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

Removed tree/model/Filter.scala since no longer used. Removed debuggi…

26d10dd

…ng println calls in DecisionTree.scala.

Merge remote-tracking branch 'upstream/master' into dt-opt1

2d2aaaf

Fixed scalastyle issue in TreePoint

5f2dec2

Merge branch 'dt-opt1' into dt-opt2

f40381c

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala Merge is OK except one DT Suite test to fix.

Fixed DecisionTreeSuite bug for training second level. Needed to upda…

797f68a

…te treePointToNodeIndex with groupShift.

Merge remote-tracking branch 'upstream/master' into dt-opt2

931a3a7

mengxr reviewed Aug 16, 2014
View reviewed changes

Small code improvements based on code review.

3726d20

manishamde reviewed Aug 17, 2014
View reviewed changes

Renamed DTMetadata to DecisionTreeMetadata. Small doc updates.

a0ed0da

asfgit closed this in 73ab7f1 Aug 17, 2014

[SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up #1975

[SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up #1975

Uh oh!

Conversation

jkbradley commented Aug 15, 2014

Uh oh!

SparkQA commented Aug 15, 2014

Uh oh!

SparkQA commented Aug 15, 2014

Uh oh!

jkbradley commented Aug 15, 2014

Uh oh!

mengxr Aug 16, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley Aug 16, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Aug 16, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley Aug 17, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Aug 17, 2014

Uh oh!

mengxr commented Aug 17, 2014

Uh oh!

SparkQA commented Aug 17, 2014

Uh oh!

mengxr commented Aug 17, 2014

Uh oh!

manishamde Aug 17, 2014

Choose a reason for hiding this comment

Uh oh!

manishamde commented Aug 17, 2014

Uh oh!

manishamde commented Aug 17, 2014

Uh oh!

SparkQA commented Aug 17, 2014

Uh oh!

jkbradley commented Aug 17, 2014

Uh oh!

SparkQA commented Aug 17, 2014

Uh oh!

SparkQA commented Aug 17, 2014

Uh oh!

mengxr commented Aug 17, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants