[SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles #6300

BryanCutler · 2015-05-20T23:21:14Z

Broadcast of ensemble models in transformImpl before call to predict

…estClassifier

BryanCutler · 2015-05-20T23:22:26Z

mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

I'm not sure if using the Broadcast variable as a parameter is a good idea

jkbradley · 2015-05-21T01:24:15Z

Thanks for the PR! I don't think we should introduce a new abstraction. The abstractions can be useful, but it seems like a lot of overhead for the small task of broadcasting models. It will also complicate things since we can't have multiple inheritance from 2 abstract classes. Can you please remove it for now?

…assing a prediction function as param to transform

BryanCutler · 2015-05-25T18:18:18Z

I agree @jkbradley and removed the new class. I changed this around a little in the last commit and I think it's cleaner. Now transformImpl takes the predict function as a parameter, then the overridden transform in a concrete ensemble classifier/regressor will broadcast the model and bind the broadcast var to the predictImpl where it is accessed. Let me know if this design seems ok and I'll implement for the other ensembles. Thanks!

BryanCutler · 2015-05-25T18:22:19Z

mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala

You mentioned that we might want to selectively broadcast the model, only if it's large enough. Do you think that is something we can do here automatically, or would it need to be a configuration setting?

BryanCutler · 2015-06-01T15:38:01Z

Hi @jkbradley, just wondering if you could take a look at the changes from my last commit and see if they are ok. Thanks!

…oadcasted model

jkbradley · 2015-07-09T00:59:22Z

@BryanCutler Sorry for the delay! I like the general idea, but I think it could be simpler. What if:

Predictor.transform still handled everything, except the actual prediction. For that, it would call transformImpl(dataset). I'll note what I mean inline.
Predictor.transformImpl would by default use predict(), as before.
Subclasses like RandomForestClassifier could override transformImpl to broadcast the model and then use that broadcast variable in a map (which would use predict()).

That should allow you to do the same thing, but you can have subclasses not override transform() and can eliminate predictImpl. (Also, currently, the subclasses skip schema validation in transform, which is a problem.)

jkbradley · 2015-07-09T00:59:26Z

mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

This could call:

transformImpl(dataset)

BryanCutler · 2015-07-10T23:19:22Z

Hi @jkbradley , thanks for checking this out! I'm not sure I understand a couple things from your suggestion.

If a subclass implements transformImpl(dataset: DataFrame), then broadcasts and proceeds with adataset.map(...) then the result is now an RDD which would have to be made back into a DataFrame to return. This seems like an inefficient step, which is why I tried to just stick with DataFrames.

Also, the model parameters need to be accessed from inside predict(features: Double) of a subclass, like RandomForestClassificationModel, so the only way to do this is to change the signature of predict to have the broadcast var as a parameter, or make the broadcast var a member of RandomForestClassificationModel. Both of those seemed like bad ideas, which is why I added predictImpl that could share the same code for both broadcasted and non-broadcasted models.

Sorry, maybe I am missing something, could you elaborate more on how you were thinking of using the broadcast variable in a map?

jkbradley · 2015-07-13T18:32:38Z

Sorry, I should not have said "map." I agree we should not use map since that will create an RDD. By map, I really meant UDF.

Here's the sketch of what I meant for transform/predict:

Predictor.transform: check schema, and call transformImpl
Predictor.transformImpl: call predict in a UDF, as it does now
RF/GBT*.predict: keep as is
RF/GBT*.transformImpl (override): broadcast model, and call UDF. Inside the UDF, get model from the broadcast variable, and call predict on it.

Does that make sense?

…roadcasted model in callUDF to make prediction

Conflicts: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

BryanCutler · 2015-07-15T00:59:42Z

Yeah, that makes sense. It wasn't clear to me that using the broadcast model in the UDF would allow the member vars in predict to later use that broadcast instance. Hopefully this last iteration is more along the lines of what you were thinking. I also merged with the latest master and replaced callUDF with udf. Thanks for the help @jkbradley !

jkbradley · 2015-07-15T20:46:35Z

Jenkins test this please

jkbradley · 2015-07-15T20:47:08Z

@BryanCutler Nice, the updates look good, and it's a bit simpler now. LGTM pending tests.

SparkQA · 2015-07-15T21:49:52Z

Test build #1077 has finished for PR 6300 at commit 86e73de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-17T21:09:57Z

Merging with master.
Thank you for the contribution!

BryanCutler · 2015-07-17T21:17:42Z

Cool, thanks!

tamilselvanv1be · 2018-04-23T01:02:34Z

can any one help me on, how to use transformImpl method for predictProbability method ? I see it's not implemented in transformImpl of RandomForestClassificationModel. hence my streaming job broad casting the RF model for every mini batch. Help me with way to implement. thanks

[SPARK-7127] Adding broadcast of model before prediction in RandomFor…

83904bb

…estClassifier

BryanCutler reviewed May 20, 2015
View reviewed changes

[SPARK-7127] Removed abstract class for broadcasting model, instead p…

aaad77b

…assing a prediction function as param to transform

BryanCutler reviewed May 25, 2015
View reviewed changes

BryanCutler added 3 commits June 15, 2015 15:21

[SPARK-7127] Applied broadcasting to remaining ensemble models

6fd153c

[SPARK-7127] Used modelAccessor parameter in predictImpl to access br…

171a6ce

…oadcasted model

[SPARK-7127] Removed accidental newline

1f34be4

jkbradley reviewed Jul 9, 2015
View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/Predictor.scala Outdated

Copy link

Member

jkbradley Jul 9, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could call:

transformImpl(dataset)

BryanCutler added 3 commits July 14, 2015 14:32

[SPARK-7127] Simplified calls by overriding transformImpl and using b…

9afad56

…roadcasted model in callUDF to make prediction

Merge branch 'master' into bcast-ensemble-models-7127

40a139d

Conflicts: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

[SPARK-7127] Replaced deprecated callUDF with udf

86e73de

BryanCutler changed the title ~~[SPARK-7127] [MLLIB] [WIP] Adding broadcast of model before prediction in RandomForestClassifier~~ [SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles Jul 16, 2015

asfgit closed this in 8b8be1f Jul 17, 2015

BryanCutler deleted the bcast-ensemble-models-7127 branch November 18, 2015 21:37

xuanyuanking mentioned this pull request Jul 27, 2019

[SPARK-28514][ML] Remove the redundant transformImpl method in RF & GBT #25256

Closed

[SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles #6300

[SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles #6300

Uh oh!

Conversation

BryanCutler commented May 20, 2015

Uh oh!

BryanCutler May 20, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented May 21, 2015

Uh oh!

BryanCutler commented May 25, 2015

Uh oh!

BryanCutler May 25, 2015

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jun 1, 2015

Uh oh!

jkbradley commented Jul 9, 2015

Uh oh!

jkbradley Jul 9, 2015

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jul 10, 2015

Uh oh!

jkbradley commented Jul 13, 2015

Uh oh!

BryanCutler commented Jul 15, 2015

Uh oh!

jkbradley commented Jul 15, 2015

Uh oh!

jkbradley commented Jul 15, 2015

Uh oh!

SparkQA commented Jul 15, 2015

Uh oh!

jkbradley commented Jul 17, 2015

Uh oh!

BryanCutler commented Jul 17, 2015

Uh oh!

tamilselvanv1be commented Apr 23, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants