Skip to content

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Dec 4, 2017

What changes were proposed in this pull request?

Adds PMML export support to Spark ML pipelines in the style of Spark's DataSource API to allow library authors to add their own model export formats.

Includes a specific implementation for Spark ML linear regression PMML export.

In addition to adding PMML to reach parity with our current MLlib implementation, this approach will allow other libraries & formats (like PFA) to implement and export models with a unified API.

How was this patch tested?

Basic unit test.

@holdenk holdenk changed the title [WIP][ML][SPARK-11171] spark 11237 Add PMML export to Spark ML pipelines [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to Spark ML pipelines Dec 4, 2017
@SparkQA
Copy link

SparkQA commented Dec 4, 2017

Test build #84426 has finished for PR 19876 at commit de86190.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait PMMLReadWriteTest extends TempDirectory

@SparkQA
Copy link

SparkQA commented Dec 5, 2017

Test build #84470 has finished for PR 19876 at commit 8b1c752.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2017

Test build #84473 has finished for PR 19876 at commit 72b509f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2017

Test build #84474 has finished for PR 19876 at commit b8362a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Dec 6, 2017

You can see some of the past discussion on #9207 cc @sethah @MLnick

@holdenk
Copy link
Contributor Author

holdenk commented Dec 10, 2017

@sethah: want to publish the comments from when we were chatting in Singapore? :p

@sethah
Copy link
Contributor

sethah commented Dec 12, 2017

@holdenk Do you mind leaving some comments on the intentions/benefits of this new API for the benefit of other reviewers? For example, what use cases may exist - adding third party PFA support (other third party export tools?), and also why we need to add PMML support when there are already tools that do this jpmml-sparkml.

Also, this is two changes in one PR: adding an API for generic model export and adding PMML to LinearRegression. I think it makes sense to separate the two, and just focus on the new API here. What do you think?

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing, but gonna publish what I have for now.

val writer = writerCls.newInstance().asInstanceOf[MLWriterFormat]
writer.write(path, sparkSession, optionMap, stage)
} else {
throw new SparkException("ML source $source is not a valid MLWriterFormat")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need string interpolation here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've added a test for this error message.

* Function write the provided pipeline stage out.
*/
def write(path: String, session: SparkSession, optionMap: mutable.Map[String, String],
stage: PipelineStage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type?

*/
@Since("1.6.0")
override def write: MLWriter = new LinearRegressionModel.LinearRegressionModelWriter(this)
override def write: GeneralMLWriter = new GeneralMLWriter(this)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc above this is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

testPMMLWrite(sc, model, checkModel)
}

test("unsupported export format") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have a test that verifies that this works with third party implementations. Specifically, that something like model.write.format("org.apache.spark.ml.MyDummyWriter").save(path) works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll put a dummy writer in test so it doesn't clog up our class space.

/**
* A ML Writer which delegates based on the requested format.
*/
class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps for another PR, but maybe we could add a method here:

  def pmml(path: String): Unit = {
    this.source = "pmml"
    save(path)
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I don't think that belongs in the base GeneralMLWriter, but we could make a trait for writers which support PMML to mix in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The follow up issue to track this is https://issues.apache.org/jira/browse/SPARK-11241

@holdenk
Copy link
Contributor Author

holdenk commented Dec 20, 2017

@sethah So I'm hesitant to push an API without an implementation to make sure its actually usable for our goal. But I'm fine splitting it out into a separate PR.

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85204 has finished for PR 19876 at commit 6e9cdc3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85207 has finished for PR 19876 at commit b8844c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk changed the title [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to Spark ML pipelines [ML][SPARK-11171][SPARK-11239] Add PMML export to Spark ML pipelines Dec 20, 2017
@holdenk
Copy link
Contributor Author

holdenk commented Dec 29, 2017

So @MLnick / @sethah I'd like to put something like this in before we roll 2.3 RC1. Do y'all have any feedback before that?

/**
* Abstract class for utility classes that can save ML instances.
*/
@deprecated("Use GeneralMLWriter instead. Will be removed in Spark 3.0.0", "2.3.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm debating if this should be deprecated in 2.4 and just have this as a new option in 2.3. What do you think @sethah / @MLnick ?

@holdenk
Copy link
Contributor Author

holdenk commented Jan 8, 2018

I'm going to update this tomorrow, but if no one has anything by EOW would folks be OK with this as an experimental developer API for 2.3? cc @JoshRosen ?

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass :)

* ML export formats for should implement this trait so that users can specify a shortname rather
* than the fully qualified class name of the exporter.
*
* A new instance of this class will be instantiated each time a DDL call is made.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this supposed to be retained from the DataSourceRegister?

val loader = Utils.getContextOrSparkClassLoader
val serviceLoader = ServiceLoader.load(classOf[MLFormatRegister], loader)
val stageName = stage.getClass.getName
val targetName = s"${source}+${stageName}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need brackets

*
* {{{
* override def shortName(): String =
* "pmml+org.apache.spark.ml.regression.LinearRegressionModel"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about making a second abstract field def stageName(): String, instead of having it packed into one string?

trait MLFormatRegister {
/**
* The string that represents the format that this data source provider uses. This is
* overridden by children to provide a nice alias for the data source. For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"data source" -> "model format"?


/**
* Overwrites if the output path already exists.
* Specifies the format of ML export (e.g. PMML, internal, or
Copy link
Contributor

@sethah sethah Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to e.g. "pmml", "internal", or the fully qualified class name for export).

@InterfaceStability.Evolving
trait MLWriterFormat {
/**
* Function write the provided pipeline stage out.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a full doc here with param annotations. Also should it be "Function to write ..."?

/**
* A ML Writer which delegates based on the requested format.
*/
class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need @Since("2.3.0") here?

}

// override for Java compatibility
override def session(sparkSession: SparkSession): this.type = super.session(sparkSession)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since tags here

* @since 2.3.0
*/
@InterfaceStability.Evolving
trait MLWriterFormat {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the actual since annotations here, though?

}
}

test("dummy export format is called") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also add tests for the MLFormatRegister similar to DDLSourceLoadSuite. Just add a META-INF/services/ directory to src/test/resources/

@MLnick
Copy link
Contributor

MLnick commented Jan 16, 2018

So to be clear this doesn't handle the read path at all? Would there be a plan to implement a similar read API? Though often with "model serving" formats like PMML, the write is what's important.

Overall I like the idea of an open API for plugging in model serialization formats (as I've commented on the previous PRs etc). This might be a bit too much to put in for 2.3 though?

@holdenk
Copy link
Contributor Author

holdenk commented Jan 17, 2018

@MLnick I think the read path could follow a similar approach, but for models which we export from Spark and wish to load back into Spark the internal format is probably the best option.

As far as options, this keeps the same stringly typed options interface as we have with Datasources, and individual writers are free to specialize and add their own special methods.

I'm open to the idea of punting this, but we've punted PMML export since we introduced ML pipelines and users are still asking for general export support so the feature req hasn't gone away either. If we think the design needs another pass that's fine, but I'd like to suggest that this is probably a good starting block from which we can add more things on top off.

If folks agree the foundation isn't bad I think putting it in for 2.3 would be useful (and it's not like we haven't had a long time to consider this design). This specific PR has been open since Dec 4th but we've been looking at variants on this idea for years.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86226 has finished for PR 19876 at commit 6411054.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86228 has finished for PR 19876 at commit 4047239.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86229 has finished for PR 19876 at commit cd330f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86230 has finished for PR 19876 at commit 41312e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 18, 2018

re-ping @MLnick + ping @jkbradley , thoughts?

@holdenk
Copy link
Contributor Author

holdenk commented Jan 19, 2018

also maybe @dbtsai ?

@holdenk
Copy link
Contributor Author

holdenk commented Jan 21, 2018

re-ping folks ?

@holdenk
Copy link
Contributor Author

holdenk commented Feb 28, 2018

So now that it looks like 2.3 is pretty much wrapped up do folks have any thoughts? @MLnick @jkbradley @sethah ?

@holdenk
Copy link
Contributor Author

holdenk commented Mar 5, 2018

So if no one has comments by March 15th I'm going update the tags & push forward with this API since we need something and this is the design most folks seems to be interested in from the last proposal.

@holdenk
Copy link
Contributor Author

holdenk commented Mar 13, 2018

March 15th is soon, any thoughts @MLnick @jkbradley @sethah ?

@holdenk
Copy link
Contributor Author

holdenk commented Mar 20, 2018

Will do a last pass myself and merge on Friday if no one else has opinions.

@holdenk
Copy link
Contributor Author

holdenk commented Mar 22, 2018

See some related thoughts in how to support Spark in kubeflow: kubeflow/spark-operator#119

@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Test build #88534 has finished for PR 19876 at commit 9075626.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*
* Must have a valid zero argument constructor which will be called to instantiate.
*
* @since 2.3.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update since annotations to 2.4.0

* ML export formats for should implement this trait so that users can specify a shortname rather
* than the fully qualified class name of the exporter.
*
* A new instance of this class will be instantiated each time a save call is made.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment about zero arg constructor requirement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@holdenk
Copy link
Contributor Author

holdenk commented Mar 23, 2018

LGTM pending Jenkins will merge.

@holdenk holdenk changed the title [ML][SPARK-11171][SPARK-11239] Add PMML export to Spark ML pipelines [ML][SPARK-23783][SPARK-11239] Add PMML export to Spark ML pipelines Mar 23, 2018
@asfgit asfgit closed this in 95c03cb Mar 23, 2018
@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Test build #88546 has finished for PR 19876 at commit cb6fd70.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/** A writer for LinearRegression that handles the "pmml" format */
private class PMMLLinearRegressionModelWriter
extends MLWriterFormat with MLFormatRegister {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be two space indentation
extends MLWriterFormat with MLFormatRegister {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I'll fix it in a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included this in #20907

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants