Skip to content

Conversation

@leifker
Copy link

@leifker leifker commented Mar 15, 2017

## What changes were proposed in this pull request?

Updated CrossValidator and TrainValidationSplit to be able to
accept multiple pipelines for testing different algorithms
and/or being able to better control tuning combinations.
Maintains backwards compatible API and reads legacy
serialized objects.

## How was this patch tested?

Unit tests

Author: David Leifker <[email protected]>

    ## What changes were proposed in this pull request?

    Updated CrossValidator and TrainValidationSplit to be able to
    accept multiple pipelines for testing different algorithms
    and/or being able to better control tuning combinations.

    ## How was this patch tested?

    Unit tests

    Author: David Leifker <[email protected]>
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Mar 15, 2017

Have a look at http://spark.apache.org/contributing.html first

So the idea is to search over pipelines with different components, potentially, not just one set of components but varying its parameters?

What you have here amounts to running n pipeline evaluations with n grid searches. That could just be done with the existing machinery, run n times. In this new model it seems hard to work out which pipeline was selected? except by inspecting it.

It makes some sense but the alternative isn't much code either, to just combine the results of n grid searches.

@leifker
Copy link
Author

leifker commented Mar 15, 2017

Looking over the contributing link, I should open a jira issue it seems?

The intent is like you said, to run the CrossValidator with different pipelines.

The same could be done using an external iterative approach. Build different pipelines, throwing each into a CrossValidator, and then taking the best model from each of those CrossValidators. Then finally picking the best from those. This is the initial approach I explored. It resulted in a lot of boiler plate code that felt like it shouldn't need to exist if the api simply allowed for arrays of estimators and their parameters.

A couple advantages to this implementation to consider come from keeping the functional interface to the CrossValidator.

  1. The caching of the folds is better utilized. An external iterative approach creates a new set of k folds for each CrossValidator fit and the folds are discarded after each CrossValidator run. In this implementation a single set of k folds is created and cached for all of the pipelines.
  2. A potential advantage of using this implementation is for future parallelization of the pipelines within the CrossValdiator. It is of course possible to handle the parallelization outside of the CrossValidator here too, however I believe there is already work in progress to parallelize the grid parameters and that could be extended to multiple pipelines.

Both of those behind-the-scene optimizations are possible because of providing the CrossValidator with the data and the complete set of pipelines/estimators to evaluate up front allowing one to abstract away the implementation.

@srowen
Copy link
Member

srowen commented Mar 16, 2017

Yes re: JIRA
CC @MLnick or @jkbradley as well

I agree about reusing the cached folds, that's a good point. I think this is worth considering.

@leifker leifker changed the title [MLLIB] Allow multiple pipelines when tuning [SPARK-19979][MLLIB] Allow multiple pipelines when tuning Mar 16, 2017
@MLnick
Copy link
Contributor

MLnick commented Mar 16, 2017

I commented on the linked JIRA also.

In principle I think this can be a useful enhancement and yes the better efficiency on the caching side is a good benefit. I'd actually been thinking about a (simpler) version that would allow multiplePredictors to be the last stages of a pipeline, and return the best from those - so evaluating a set of algorithms at the same time (but keeping the preceding pipeline stages common).

This is a generalization of that idea.

I think it gels with SPARK-19071 and the next phase of that work that would look at smarter caching & re-use of stages in the pipeline. Also related to the subtask of that ticket - the parallel CV work in SPARK-19357.

Since Spark 2.2 code freeze is imminent, I think this will have to be delayed a little but it would be good to coordinate with #16774 where appropriate.

@BryanCutler
Copy link
Member

Thanks @leifker for the PR, this is a good idea. I think though it can already be accomplished with the current param grid builder. Since the stages of a pipeline are actually a param, you can add these to the param grid and will evaluate the different pipelines, also reusing the cached splits in cross-val. I tried this out by modifying the example ModelSelectionViaCrossValidation and it seems to work

    val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
    val hashingTF = new HashingTF()
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
    val lr = new LogisticRegression()
      .setMaxIter(10)
    val dt = new DecisionTreeClassifier()
      .setMaxDepth(5)
    val pipeline = new Pipeline()

    val pipeline1: Array[PipelineStage] = Array(tokenizer, hashingTF, lr)
    val pipeline2: Array[PipelineStage] = Array(tokenizer, hashingTF, dt)

    val paramGrid = new ParamGridBuilder()
      .addGrid[Array[PipelineStage]](pipeline.stages, Array(pipeline1, pipeline2))
      .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
      .addGrid(lr.regParam, Array(0.1, 0.01))
      .build()

@leifker
Copy link
Author

leifker commented Mar 20, 2017

Sorry of the delayed response @BryanCutler, that's pretty neat, however this will perform unneeded work as it will execute nonsensical combinations of parameters. For example, if pipeline2 is selected/executed (decision tree), the code will run the pipeline with lr.regParam set to 0.1 and 0.01 which doesn't matter since we're using the decision tree algorithm.

@BryanCutler
Copy link
Member

Yeah, that's true in this case. You could just build the grids separately and combine them like this

    val pipeline1_grid = new ParamGridBuilder()
      .baseOn(pipeline.stages -> pipeline1)
      .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
      .addGrid(lr.regParam, Array(0.1, 0.01))
      .build()

    val pipeline2_grid = new ParamGridBuilder()
      .baseOn(pipeline.stages -> pipeline2)
      .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
      .build()

    val paramGrid = pipeline1_grid ++ pipeline2_grid

Maybe not all that intuitive though...
It might be worth it to add some smarts in the ParamGridBuilder or CrossValidator to prune the grid if there is a pipeline and params not used in any of the stages.

@leifker
Copy link
Author

leifker commented Mar 21, 2017

Interesting, let me think about this a bit. I think that there is probably a better api around this approach for sure.

@leifker
Copy link
Author

leifker commented Apr 2, 2017

Closing this PR, will get back to this eventually, but dealing with some other priorities at the moment.

@leifker leifker closed this Apr 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants