-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tree #3
base: master
Are you sure you want to change the base?
Tree #3
Conversation
This looks awesome, thanks so much for the contribution Manish! The big question I have is whether you looked at using MLTable and its API for your input? Were there big hurdles preventing that from being an option? We'd like to build ML algorithms around that API, so if there are things we need to change to add this case, let us know! Decision trees are fairly different than algorithms that work by evaluating some linear loss function and optimizing via gradient descent, so this is a good test for something different that may not fit our existing model. |
Thanks Evan. Looking forward to contributing more to the library. Unfortunately, I haven't looked at MLTable since the code was written prior to the open sourcing of the MLI library. As I mentioned in an earlier comment, I will look to make this code compatible with the MLI API and give feedback for any improvements. The fixes should not take me too long. The non-linear data generator will be the trickiest part. When do you think we can start testing performance once I am done? |
Evan, I have just performed a major refactoring of the code based on your feedback without changing functionality. A few tasks remain:
I think task 1 is the most important for now. Task 2 can be done in the future. I am wondering whether we can use the same data that you might have used for testing logistic regression or SVM for performance testing while we work on Task 3. Task 4 is again one for the future. |
Some more changes.
|
This is awesome, thanks Manish - we'll plan to test your code for On Sat, Oct 19, 2013 at 6:43 PM, manishamde [email protected]:
|
Sounds great Evan! |
@@ -15,3 +15,7 @@ project/plugins/project/ | |||
#Eclipse specific | |||
.classpath | |||
.project | |||
|
|||
#IDEA specific | |||
.idea |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, thanks.
* Add logging | ||
* Move metrics to a different package | ||
|
||
#Extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This is terrific work! Basic functionality is there and scaling well for large datasets based on my tests. Though, I don't see special logic differentiating between continuous and categorical features. (Maybe I'm just missing something). We should think about optimizing the inner loop a bit more, in particular see my comments about vectorization and avoiding intermediate caching. Really good stuff and a welcome contribution! |
Attempted vectorization of findBestSplit calculation in the recent commit. |
Hi guys, this is some great work and sorry for coming late to the party :), but I have two high-level reservations at this stage.
As referenced in the previous (automatic) comment, I have submitted a pull request for an implementation of decision trees for Spark/MLLib. It is based on the boosted trees implementation I have been working on. It does address the above two issues. It supports both categorical and continuous features. I have defined a (pretty awesome) generic loss function interface. Observing that we calculate loss functions over groupings of feature bins (each part of split is a group of bins), the interface requires specifying summary statistics for bins that can be "added", from which the loss can be calculated. The actual decision tree algorithm implementation is generic and uses this loss interface and calculates histograms of loss statistics. One can use this algorithm for a variety of loss functions by simply defining and implementing suitable loss statistics and related methods. Both regression and classification tree derivatives are provided using square loss and entropy loss functions. There are also basic tests, synthetic data generators and ample amount of scaladoc comments as documentation. Overall, it is quite robust and performant to my liking from initial tests and benchmarks. Additional details of this implementation are at hirakendu/boosted_trees/doc. To give it a try, a precompiled mllib jar based on
I am curious about the vectorization optimizations. I believe it is related to my previous observation that using I also agree with one of the previous comments about caching at every node. I think this has already been addressed. I have been debating about training level by level, instead of node by node, but there is some book-keeping involved. Secondly, at moderately deep levels, say depth 5, it may already become too many bins for histograms. On a related note, I am going by the rough estimate that we can handle million bins (say 1000 features with 1000 quantiles or categories) on a single machine. I will try to write a version of boosted trees for MLI as well. Overall I like the |
Decision Tree algorithm implemented on top of Spark RDD.
Key features: