-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles #6300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if using the Broadcast variable as a parameter is a good idea
|
Thanks for the PR! I don't think we should introduce a new abstraction. The abstractions can be useful, but it seems like a lot of overhead for the small task of broadcasting models. It will also complicate things since we can't have multiple inheritance from 2 abstract classes. Can you please remove it for now? |
…assing a prediction function as param to transform
|
I agree @jkbradley and removed the new class. I changed this around a little in the last commit and I think it's cleaner. Now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned that we might want to selectively broadcast the model, only if it's large enough. Do you think that is something we can do here automatically, or would it need to be a configuration setting?
|
Hi @jkbradley, just wondering if you could take a look at the changes from my last commit and see if they are ok. Thanks! |
|
@BryanCutler Sorry for the delay! I like the general idea, but I think it could be simpler. What if:
That should allow you to do the same thing, but you can have subclasses not override transform() and can eliminate predictImpl. (Also, currently, the subclasses skip schema validation in transform, which is a problem.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could call:
transformImpl(dataset)
|
Hi @jkbradley , thanks for checking this out! I'm not sure I understand a couple things from your suggestion. If a subclass implements Also, the model parameters need to be accessed from inside Sorry, maybe I am missing something, could you elaborate more on how you were thinking of using the broadcast variable in a map? |
|
Sorry, I should not have said "map." I agree we should not use map since that will create an RDD. By map, I really meant UDF. Here's the sketch of what I meant for transform/predict:
Does that make sense? |
…roadcasted model in callUDF to make prediction
Conflicts: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala
|
Yeah, that makes sense. It wasn't clear to me that using the broadcast model in the UDF would allow the member vars in |
|
Jenkins test this please |
|
@BryanCutler Nice, the updates look good, and it's a bit simpler now. LGTM pending tests. |
|
Test build #1077 has finished for PR 6300 at commit
|
|
Merging with master. |
|
Cool, thanks! |
|
can any one help me on, how to use transformImpl method for predictProbability method ? I see it's not implemented in transformImpl of RandomForestClassificationModel. hence my streaming job broad casting the RF model for every mini batch. Help me with way to implement. thanks |
Broadcast of ensemble models in transformImpl before call to predict