Tree #3

manishamde · 2013-10-07T06:28:56Z

Decision Tree algorithm implemented on top of Spark RDD.

Key features:

Supports both classification and regressions
Supports gini, entropy and variance for information gain calculation
Supports calculating quantiles using a configurable fraction of the data
Performance accuracy verified by comparing with scikit-learn

Added basic documentation.

etrain · 2013-10-11T18:13:37Z

This looks awesome, thanks so much for the contribution Manish!

The big question I have is whether you looked at using MLTable and its API for your input? Were there big hurdles preventing that from being an option? We'd like to build ML algorithms around that API, so if there are things we need to change to add this case, let us know! Decision trees are fairly different than algorithms that work by evaluating some linear loss function and optimizing via gradient descent, so this is a good test for something different that may not fit our existing model.

manishamde · 2013-10-11T22:14:39Z

Thanks Evan. Looking forward to contributing more to the library.

Unfortunately, I haven't looked at MLTable since the code was written prior to the open sourcing of the MLI library. As I mentioned in an earlier comment, I will look to make this code compatible with the MLI API and give feedback for any improvements.

The fixes should not take me too long. The non-linear data generator will be the trickiest part.

When do you think we can start testing performance once I am done?

manishamde · 2013-10-13T07:15:00Z

Evan,

I have just performed a major refactoring of the code based on your feedback without changing functionality.

A few tasks remain:

Making the interface to the tree algorithm consistent with the MLI API. I am wondering whether some "implicit" magic might make the algorithm work with both MLTable and RDD. However, I couldn't find any variance calculation logic for MLTable features. I am currently using the StatCounter class in the Spark library for that purpose.
Using the utils and deprecating TreeUtils class. I don't think this should be a show stopper for now and could be deprecated in the future. TreeRunner and TreeUtils are helper/example classes to get started.
Non-linear data generator. I will think about this problem a little more. We will need to come up with a configurable (features, size, etc.) data generator.
Using a CLI parser (scopt/sumac/etc.)

I think task 1 is the most important for now. Task 2 can be done in the future. I am wondering whether we can use the same data that you might have used for testing logistic regression or SVM for performance testing while we work on Task 3. Task 4 is again one for the future.

manishamde · 2013-10-20T01:43:34Z

Some more changes.

Quantile and split calculations are now performed in memory (significant improvement in performance)
Calculated training error after tree building. Could not find the best place to store the training error in the tree model. Didn't want to hack it right now but will do it more systematically while building ensembles.
Started doing testing on a dataset with ~0.5 million instances but limited by lack of access to a Spark cluster to accurately measure performance.

etrain · 2013-10-20T02:14:50Z

This is awesome, thanks Manish - we'll plan to test your code for
scalability on a cluster this week.

On Sat, Oct 19, 2013 at 6:43 PM, manishamde [email protected]:

Some more changes.

Quantile and split calculations are now performed in memory
(significant improvement in performance)

Calculated training error after tree building. Could not find the best
place to store the training error in the tree model. Didn't want to hack it
right now but will do it more systematically while building ensembles.

Started doing testing on a dataset with ~0.5 million instances but
limited by lack of access to a Spark cluster.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3#issuecomment-26662937
.

manishamde · 2013-10-20T02:21:22Z

Sounds great Evan!

etrain · 2013-11-08T19:29:38Z

.gitignore

@@ -15,3 +15,7 @@ project/plugins/project/
 #Eclipse specific
 .classpath
 .project
+
+#IDEA specific
+.idea


This is a good idea, thanks.

etrain · 2013-11-08T19:46:42Z

src/main/scala/ml/tree/README.md

+* Add logging
+* Move metrics to a different package
+
+#Extensions


etrain · 2013-11-08T20:11:15Z

This is terrific work! Basic functionality is there and scaling well for large datasets based on my tests. Though, I don't see special logic differentiating between continuous and categorical features. (Maybe I'm just missing something).

We should think about optimizing the inner loop a bit more, in particular see my comments about vectorization and avoiding intermediate caching.

Really good stuff and a welcome contribution!

manishamde · 2013-11-11T04:35:37Z

Attempted vectorization of findBestSplit calculation in the recent commit.

hirakendu · 2013-11-11T23:02:48Z

Hi guys, this is some great work and sorry for coming late to the party :), but I have two high-level reservations at this stage.

The implementation doesn't support categorical features, which I think is important for decision tree applications. This may require some invasive changes. I see that the histograms are calculated for splits instead of bins of feature values. This is okay for continuous features where the bins are always arranged in the same order - in the order of feature quantiles. But for categorical features, the ordering of bins is based on the average (or centroid) of output values of samples in the bin. So the bins are ordered after the calculation of histograms.
The design is too tightly coupled to binary classification and related loss functions. I don't see hooks for regression and other loss functions and a general interface for implementing them. One would essentially have to write a very different decision tree algorithm for regression. Sorry if the refactoring was meant to be done at a later stage.

As referenced in the previous (automatic) comment, I have submitted a pull request for an implementation of decision trees for Spark/MLLib. It is based on the boosted trees implementation I have been working on. It does address the above two issues. It supports both categorical and continuous features. I have defined a (pretty awesome) generic loss function interface. Observing that we calculate loss functions over groupings of feature bins (each part of split is a group of bins), the interface requires specifying summary statistics for bins that can be "added", from which the loss can be calculated. The actual decision tree algorithm implementation is generic and uses this loss interface and calculates histograms of loss statistics. One can use this algorithm for a variety of loss functions by simply defining and implementing suitable loss statistics and related methods. Both regression and classification tree derivatives are provided using square loss and entropy loss functions. There are also basic tests, synthetic data generators and ample amount of scaladoc comments as documentation. Overall, it is quite robust and performant to my liking from initial tests and benchmarks.

Additional details of this implementation are at hirakendu/boosted_trees/doc. To give it a try, a precompiled mllib jar based on 0.8.0-incubating release is available at hirakendu/boosted_trees/code/spark_mllib_boosted_trees/target. It would be nice if the performance can be tested along-side this implementation. Comments on the design and implementation would be highly appreciated in the other discussion.

</advertisement>

I am curious about the vectorization optimizations. I believe it is related to my previous observation that using Arrays for calculating and storing histograms of RDD partitions and then merging them at master is faster than reduceByKey, almost runs in half the time in my tests. There is quite some shuffle data involved in reduceByKey. Will have a deeper look and see if this vectorization instead of flatMap can improve performance.

I also agree with one of the previous comments about caching at every node. I think this has already been addressed. I have been debating about training level by level, instead of node by node, but there is some book-keeping involved. Secondly, at moderately deep levels, say depth 5, it may already become too many bins for histograms. On a related note, I am going by the rough estimate that we can handle million bins (say 1000 features with 1000 quantiles or categories) on a single machine.

I will try to write a version of boosted trees for MLI as well. Overall I like the MLTable and Algorithm, Model interfaces, not to mention the support for categorical and other types of features. One thing that is curious in both MLLib and MLI is the absence of an interface for loss functions. It looks to be a part of optimization interface currently, which appears specific to linear models. Another curiosity is the non-standard data format for labeled point instances. I think tab delimited values are standard in Hadoop and outside, with schema/header files explaining the columns, and the first column is generally taken to be label. While it's an easy transformation in Spark, it would be good to have it as a standard input format.

manishamde added 7 commits September 30, 2013 23:22

migrating tree code to MLI

1eba6f3

added design file

e2231ad

added accuracy score calculation

95d45ab

added mean square error and test directory

6754a40

placeholder test file

49b8797

basic documentation

376e241

Update README.md

35cebe6

Added basic documentation.

manishamde added 5 commits October 12, 2013 19:11

adding empty lines above comments

9ad0dd5

moved impurity classes to a different package

fc773de

reorganized code

e29493c

refactored decison nodes into separate package

6047ed8

making variance serializable

2a1185b

manishamde added 4 commits October 13, 2013 00:21

fixing usage and example

b2447a8

drastic speedup of split calculation

68ad6c8

moving metrics to new class and changing root node depth to 1 from 0

729a3b1

calculating training error while building decision tree

d956cd3

renaming error to accuracy :-)

422ed7d

etrain reviewed Nov 8, 2013
View reviewed changes

src/main/scala/ml/tree/README.md

* Add logging

* Move metrics to a different package

#Extensions

Copy link

Contributor

etrain Nov 8, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

manishamde added 2 commits November 9, 2013 20:52

first attempt at vectorization

020069a

using RDD aggregate to count

271d1f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree #3

Tree #3

manishamde commented Oct 7, 2013

etrain commented Oct 11, 2013

manishamde commented Oct 11, 2013

manishamde commented Oct 13, 2013

manishamde commented Oct 20, 2013

etrain commented Oct 20, 2013

manishamde commented Oct 20, 2013

etrain Nov 8, 2013

etrain Nov 8, 2013

etrain commented Nov 8, 2013

manishamde commented Nov 11, 2013

hirakendu commented Nov 11, 2013

Tree #3

Are you sure you want to change the base?

Tree #3

Conversation

manishamde commented Oct 7, 2013

etrain commented Oct 11, 2013

manishamde commented Oct 11, 2013

manishamde commented Oct 13, 2013

manishamde commented Oct 20, 2013

etrain commented Oct 20, 2013

manishamde commented Oct 20, 2013

etrain Nov 8, 2013

Choose a reason for hiding this comment

etrain Nov 8, 2013

Choose a reason for hiding this comment

etrain commented Nov 8, 2013

manishamde commented Nov 11, 2013

hirakendu commented Nov 11, 2013