-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19979][MLLIB] Allow multiple pipelines when tuning #17306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
leifker
commented
Mar 15, 2017
## What changes were proposed in this pull request?
Updated CrossValidator and TrainValidationSplit to be able to
accept multiple pipelines for testing different algorithms
and/or being able to better control tuning combinations.
## How was this patch tested?
Unit tests
Author: David Leifker <[email protected]>
|
Can one of the admins verify this patch? |
|
Have a look at http://spark.apache.org/contributing.html first So the idea is to search over pipelines with different components, potentially, not just one set of components but varying its parameters? What you have here amounts to running n pipeline evaluations with n grid searches. That could just be done with the existing machinery, run n times. In this new model it seems hard to work out which pipeline was selected? except by inspecting it. It makes some sense but the alternative isn't much code either, to just combine the results of n grid searches. |
|
Looking over the contributing link, I should open a jira issue it seems? The intent is like you said, to run the CrossValidator with different pipelines. The same could be done using an external iterative approach. Build different pipelines, throwing each into a CrossValidator, and then taking the best model from each of those CrossValidators. Then finally picking the best from those. This is the initial approach I explored. It resulted in a lot of boiler plate code that felt like it shouldn't need to exist if the api simply allowed for arrays of estimators and their parameters. A couple advantages to this implementation to consider come from keeping the functional interface to the CrossValidator.
Both of those behind-the-scene optimizations are possible because of providing the CrossValidator with the data and the complete set of pipelines/estimators to evaluate up front allowing one to abstract away the implementation. |
|
Yes re: JIRA I agree about reusing the cached folds, that's a good point. I think this is worth considering. |
|
I commented on the linked JIRA also. In principle I think this can be a useful enhancement and yes the better efficiency on the caching side is a good benefit. I'd actually been thinking about a (simpler) version that would allow multiple This is a generalization of that idea. I think it gels with SPARK-19071 and the next phase of that work that would look at smarter caching & re-use of stages in the pipeline. Also related to the subtask of that ticket - the parallel CV work in SPARK-19357. Since Spark 2.2 code freeze is imminent, I think this will have to be delayed a little but it would be good to coordinate with #16774 where appropriate. |
|
Thanks @leifker for the PR, this is a good idea. I think though it can already be accomplished with the current param grid builder. Since the stages of a pipeline are actually a param, you can add these to the param grid and will evaluate the different pipelines, also reusing the cached splits in cross-val. I tried this out by modifying the example val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
val dt = new DecisionTreeClassifier()
.setMaxDepth(5)
val pipeline = new Pipeline()
val pipeline1: Array[PipelineStage] = Array(tokenizer, hashingTF, lr)
val pipeline2: Array[PipelineStage] = Array(tokenizer, hashingTF, dt)
val paramGrid = new ParamGridBuilder()
.addGrid[Array[PipelineStage]](pipeline.stages, Array(pipeline1, pipeline2))
.addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
.addGrid(lr.regParam, Array(0.1, 0.01))
.build() |
|
Sorry of the delayed response @BryanCutler, that's pretty neat, however this will perform unneeded work as it will execute nonsensical combinations of parameters. For example, if pipeline2 is selected/executed (decision tree), the code will run the pipeline with lr.regParam set to 0.1 and 0.01 which doesn't matter since we're using the decision tree algorithm. |
|
Yeah, that's true in this case. You could just build the grids separately and combine them like this val pipeline1_grid = new ParamGridBuilder()
.baseOn(pipeline.stages -> pipeline1)
.addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
.addGrid(lr.regParam, Array(0.1, 0.01))
.build()
val pipeline2_grid = new ParamGridBuilder()
.baseOn(pipeline.stages -> pipeline2)
.addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
.build()
val paramGrid = pipeline1_grid ++ pipeline2_gridMaybe not all that intuitive though... |
|
Interesting, let me think about this a bit. I think that there is probably a better api around this approach for sure. |
|
Closing this PR, will get back to this eventually, but dealing with some other priorities at the moment. |