[SPARK-19979][MLLIB] Allow multiple pipelines when tuning #17306

leifker · 2017-03-15T16:58:59Z

## What changes were proposed in this pull request?

Updated CrossValidator and TrainValidationSplit to be able to
accept multiple pipelines for testing different algorithms
and/or being able to better control tuning combinations.
Maintains backwards compatible API and reads legacy
serialized objects.

## How was this patch tested?

Unit tests

Author: David Leifker <[email protected]>

## What changes were proposed in this pull request? Updated CrossValidator and TrainValidationSplit to be able to accept multiple pipelines for testing different algorithms and/or being able to better control tuning combinations. ## How was this patch tested? Unit tests Author: David Leifker <[email protected]>

AmplabJenkins · 2017-03-15T17:02:16Z

Can one of the admins verify this patch?

srowen · 2017-03-15T18:08:25Z

Have a look at http://spark.apache.org/contributing.html first

So the idea is to search over pipelines with different components, potentially, not just one set of components but varying its parameters?

What you have here amounts to running n pipeline evaluations with n grid searches. That could just be done with the existing machinery, run n times. In this new model it seems hard to work out which pipeline was selected? except by inspecting it.

It makes some sense but the alternative isn't much code either, to just combine the results of n grid searches.

leifker · 2017-03-15T21:06:16Z

Looking over the contributing link, I should open a jira issue it seems?

The intent is like you said, to run the CrossValidator with different pipelines.

The same could be done using an external iterative approach. Build different pipelines, throwing each into a CrossValidator, and then taking the best model from each of those CrossValidators. Then finally picking the best from those. This is the initial approach I explored. It resulted in a lot of boiler plate code that felt like it shouldn't need to exist if the api simply allowed for arrays of estimators and their parameters.

A couple advantages to this implementation to consider come from keeping the functional interface to the CrossValidator.

The caching of the folds is better utilized. An external iterative approach creates a new set of k folds for each CrossValidator fit and the folds are discarded after each CrossValidator run. In this implementation a single set of k folds is created and cached for all of the pipelines.
A potential advantage of using this implementation is for future parallelization of the pipelines within the CrossValdiator. It is of course possible to handle the parallelization outside of the CrossValidator here too, however I believe there is already work in progress to parallelize the grid parameters and that could be extended to multiple pipelines.

Both of those behind-the-scene optimizations are possible because of providing the CrossValidator with the data and the complete set of pipelines/estimators to evaluate up front allowing one to abstract away the implementation.

srowen · 2017-03-16T08:34:01Z

Yes re: JIRA
CC @MLnick or @jkbradley as well

I agree about reusing the cached folds, that's a good point. I think this is worth considering.

MLnick · 2017-03-16T18:12:08Z

I commented on the linked JIRA also.

In principle I think this can be a useful enhancement and yes the better efficiency on the caching side is a good benefit. I'd actually been thinking about a (simpler) version that would allow multiplePredictors to be the last stages of a pipeline, and return the best from those - so evaluating a set of algorithms at the same time (but keeping the preceding pipeline stages common).

This is a generalization of that idea.

I think it gels with SPARK-19071 and the next phase of that work that would look at smarter caching & re-use of stages in the pipeline. Also related to the subtask of that ticket - the parallel CV work in SPARK-19357.

Since Spark 2.2 code freeze is imminent, I think this will have to be delayed a little but it would be good to coordinate with #16774 where appropriate.

BryanCutler · 2017-03-17T17:24:49Z

Thanks @leifker for the PR, this is a good idea. I think though it can already be accomplished with the current param grid builder. Since the stages of a pipeline are actually a param, you can add these to the param grid and will evaluate the different pipelines, also reusing the cached splits in cross-val. I tried this out by modifying the example ModelSelectionViaCrossValidation and it seems to work

    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
    val dt = new DecisionTreeClassifier()
      .setMaxDepth(5)
    val pipeline = new Pipeline()

    val pipeline1: Array[PipelineStage] = Array(tokenizer, hashingTF, lr)
    val pipeline2: Array[PipelineStage] = Array(tokenizer, hashingTF, dt)

    val paramGrid = new ParamGridBuilder()
      .addGrid[Array[PipelineStage]](pipeline.stages, Array(pipeline1, pipeline2))
      .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
      .addGrid(lr.regParam, Array(0.1, 0.01))
      .build()

leifker · 2017-03-20T22:03:26Z

Sorry of the delayed response @BryanCutler, that's pretty neat, however this will perform unneeded work as it will execute nonsensical combinations of parameters. For example, if pipeline2 is selected/executed (decision tree), the code will run the pipeline with lr.regParam set to 0.1 and 0.01 which doesn't matter since we're using the decision tree algorithm.

BryanCutler · 2017-03-21T00:19:52Z

Yeah, that's true in this case. You could just build the grids separately and combine them like this

    val pipeline1_grid = new ParamGridBuilder()
      .baseOn(pipeline.stages -> pipeline1)
      .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
      .addGrid(lr.regParam, Array(0.1, 0.01))
      .build()

    val pipeline2_grid = new ParamGridBuilder()
      .baseOn(pipeline.stages -> pipeline2)
      .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
      .build()

    val paramGrid = pipeline1_grid ++ pipeline2_grid

Maybe not all that intuitive though...
It might be worth it to add some smarts in the ParamGridBuilder or CrossValidator to prune the grid if there is a pipeline and params not used in any of the stages.

leifker · 2017-03-21T02:28:41Z

Interesting, let me think about this a bit. I think that there is probably a better api around this approach for sure.

leifker · 2017-04-02T13:05:48Z

Closing this PR, will get back to this eventually, but dealing with some other priorities at the moment.

leifker changed the title ~~[MLLIB] Allow multiple pipelines when tuning~~ [SPARK-19979][MLLIB] Allow multiple pipelines when tuning Mar 16, 2017

leifker closed this Apr 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-19979][MLLIB] Allow multiple pipelines when tuning #17306

[SPARK-19979][MLLIB] Allow multiple pipelines when tuning #17306

Uh oh!

leifker commented Mar 15, 2017

Uh oh!

AmplabJenkins commented Mar 15, 2017

Uh oh!

srowen commented Mar 15, 2017

Uh oh!

leifker commented Mar 15, 2017 •

edited

Loading

Uh oh!

srowen commented Mar 16, 2017

Uh oh!

MLnick commented Mar 16, 2017

Uh oh!

BryanCutler commented Mar 17, 2017

Uh oh!

leifker commented Mar 20, 2017 •

edited

Loading

Uh oh!

BryanCutler commented Mar 21, 2017

Uh oh!

leifker commented Mar 21, 2017

Uh oh!

leifker commented Apr 2, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-19979][MLLIB] Allow multiple pipelines when tuning #17306

[SPARK-19979][MLLIB] Allow multiple pipelines when tuning #17306

Uh oh!

Conversation

leifker commented Mar 15, 2017

Uh oh!

AmplabJenkins commented Mar 15, 2017

Uh oh!

srowen commented Mar 15, 2017

Uh oh!

leifker commented Mar 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Mar 16, 2017

Uh oh!

MLnick commented Mar 16, 2017

Uh oh!

BryanCutler commented Mar 17, 2017

Uh oh!

leifker commented Mar 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented Mar 21, 2017

Uh oh!

leifker commented Mar 21, 2017

Uh oh!

leifker commented Apr 2, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leifker commented Mar 15, 2017 •

edited

Loading

leifker commented Mar 20, 2017 •

edited

Loading