Skip to content

Conversation

@BryanCutler
Copy link
Member

Broadcast of ensemble models in transformImpl before call to predict

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if using the Broadcast variable as a parameter is a good idea

@jkbradley
Copy link
Member

Thanks for the PR! I don't think we should introduce a new abstraction. The abstractions can be useful, but it seems like a lot of overhead for the small task of broadcasting models. It will also complicate things since we can't have multiple inheritance from 2 abstract classes. Can you please remove it for now?

…assing a prediction function as param to transform
@BryanCutler
Copy link
Member Author

I agree @jkbradley and removed the new class. I changed this around a little in the last commit and I think it's cleaner. Now transformImpl takes the predict function as a parameter, then the overridden transform in a concrete ensemble classifier/regressor will broadcast the model and bind the broadcast var to the predictImpl where it is accessed. Let me know if this design seems ok and I'll implement for the other ensembles. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned that we might want to selectively broadcast the model, only if it's large enough. Do you think that is something we can do here automatically, or would it need to be a configuration setting?

@BryanCutler
Copy link
Member Author

Hi @jkbradley, just wondering if you could take a look at the changes from my last commit and see if they are ok. Thanks!

@jkbradley
Copy link
Member

@BryanCutler Sorry for the delay! I like the general idea, but I think it could be simpler. What if:

  • Predictor.transform still handled everything, except the actual prediction. For that, it would call transformImpl(dataset). I'll note what I mean inline.
  • Predictor.transformImpl would by default use predict(), as before.
  • Subclasses like RandomForestClassifier could override transformImpl to broadcast the model and then use that broadcast variable in a map (which would use predict()).

That should allow you to do the same thing, but you can have subclasses not override transform() and can eliminate predictImpl. (Also, currently, the subclasses skip schema validation in transform, which is a problem.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could call:

transformImpl(dataset)

@BryanCutler
Copy link
Member Author

Hi @jkbradley , thanks for checking this out! I'm not sure I understand a couple things from your suggestion.

If a subclass implements transformImpl(dataset: DataFrame), then broadcasts and proceeds with adataset.map(...) then the result is now an RDD which would have to be made back into a DataFrame to return. This seems like an inefficient step, which is why I tried to just stick with DataFrames.

Also, the model parameters need to be accessed from inside predict(features: Double) of a subclass, like RandomForestClassificationModel, so the only way to do this is to change the signature of predict to have the broadcast var as a parameter, or make the broadcast var a member of RandomForestClassificationModel. Both of those seemed like bad ideas, which is why I added predictImpl that could share the same code for both broadcasted and non-broadcasted models.

Sorry, maybe I am missing something, could you elaborate more on how you were thinking of using the broadcast variable in a map?

@jkbradley
Copy link
Member

Sorry, I should not have said "map." I agree we should not use map since that will create an RDD. By map, I really meant UDF.

Here's the sketch of what I meant for transform/predict:

  • Predictor.transform: check schema, and call transformImpl
  • Predictor.transformImpl: call predict in a UDF, as it does now
  • RF/GBT*.predict: keep as is
  • RF/GBT*.transformImpl (override): broadcast model, and call UDF. Inside the UDF, get model from the broadcast variable, and call predict on it.

Does that make sense?

@BryanCutler
Copy link
Member Author

Yeah, that makes sense. It wasn't clear to me that using the broadcast model in the UDF would allow the member vars in predict to later use that broadcast instance. Hopefully this last iteration is more along the lines of what you were thinking. I also merged with the latest master and replaced callUDF with udf. Thanks for the help @jkbradley !

@jkbradley
Copy link
Member

Jenkins test this please

@jkbradley
Copy link
Member

@BryanCutler Nice, the updates look good, and it's a bit simpler now. LGTM pending tests.

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #1077 has finished for PR 6300 at commit 86e73de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler changed the title [SPARK-7127] [MLLIB] [WIP] Adding broadcast of model before prediction in RandomForestClassifier [SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles Jul 16, 2015
@jkbradley
Copy link
Member

Merging with master.
Thank you for the contribution!

@asfgit asfgit closed this in 8b8be1f Jul 17, 2015
@BryanCutler
Copy link
Member Author

Cool, thanks!

@BryanCutler BryanCutler deleted the bcast-ensemble-models-7127 branch November 18, 2015 21:37
@tamilselvanv1be
Copy link

can any one help me on, how to use transformImpl method for predictProbability method ? I see it's not implemented in transformImpl of RandomForestClassificationModel. hence my streaming job broad casting the RF model for every mini batch. Help me with way to implement. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants