[ML][SPARK-23783][SPARK-11239] Add PMML export to Spark ML pipelines #19876

holdenk · 2017-12-04T11:27:28Z

What changes were proposed in this pull request?

Adds PMML export support to Spark ML pipelines in the style of Spark's DataSource API to allow library authors to add their own model export formats.

Includes a specific implementation for Spark ML linear regression PMML export.

In addition to adding PMML to reach parity with our current MLlib implementation, this approach will allow other libraries & formats (like PFA) to implement and export models with a unified API.

How was this patch tested?

Basic unit test.

…gable

SparkQA · 2017-12-04T12:51:41Z

Test build #84426 has finished for PR 19876 at commit de86190.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait PMMLReadWriteTest extends TempDirectory

SparkQA · 2017-12-05T09:15:50Z

Test build #84470 has finished for PR 19876 at commit 8b1c752.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-05T10:33:20Z

Test build #84473 has finished for PR 19876 at commit 72b509f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-05T10:53:12Z

Test build #84474 has finished for PR 19876 at commit b8362a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-12-06T05:42:10Z

You can see some of the past discussion on #9207 cc @sethah @MLnick

holdenk · 2017-12-10T18:51:26Z

@sethah: want to publish the comments from when we were chatting in Singapore? :p

sethah · 2017-12-12T14:53:39Z

@holdenk Do you mind leaving some comments on the intentions/benefits of this new API for the benefit of other reviewers? For example, what use cases may exist - adding third party PFA support (other third party export tools?), and also why we need to add PMML support when there are already tools that do this jpmml-sparkml.

Also, this is two changes in one PR: adding an API for generic model export and adding PMML to LinearRegression. I think it makes sense to separate the two, and just focus on the new API here. What do you think?

sethah

Still reviewing, but gonna publish what I have for now.

sethah · 2017-12-06T06:51:14Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+      val writer = writerCls.newInstance().asInstanceOf[MLWriterFormat]
+      writer.write(path, sparkSession, optionMap, stage)
+    } else {
+      throw new SparkException("ML source $source is not a valid MLWriterFormat")


nit: need string interpolation here

Good catch, I've added a test for this error message.

sethah · 2017-12-12T13:53:23Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+   * Function write the provided pipeline stage out.
+   */
+  def write(path: String, session: SparkSession, optionMap: mutable.Map[String, String],
+    stage: PipelineStage)


return type?

sethah · 2017-12-12T14:32:19Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

   */
  @Since("1.6.0")
-  override def write: MLWriter = new LinearRegressionModel.LinearRegressionModelWriter(this)
+  override def write: GeneralMLWriter = new GeneralMLWriter(this)


The doc above this is wrong.

sethah · 2017-12-12T14:56:29Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

+    testPMMLWrite(sc, model, checkModel)
+  }
+
+  test("unsupported export format") {


Would be great to have a test that verifies that this works with third party implementations. Specifically, that something like model.write.format("org.apache.spark.ml.MyDummyWriter").save(path) works.

Sure, I'll put a dummy writer in test so it doesn't clog up our class space.

sethah · 2017-12-12T14:57:38Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+/**
+ * A ML Writer which delegates based on the requested format.
+ */
+class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {


Perhaps for another PR, but maybe we could add a method here:

def pmml(path: String): Unit = { this.source = "pmml" save(path) }

So I don't think that belongs in the base GeneralMLWriter, but we could make a trait for writers which support PMML to mix in?

The follow up issue to track this is https://issues.apache.org/jira/browse/SPARK-11241

holdenk · 2017-12-20T19:13:40Z

@sethah So I'm hesitant to push an API without an implementation to make sure its actually usable for our goal. But I'm fine splitting it out into a separate PR.

SparkQA · 2017-12-20T19:29:10Z

Test build #85204 has finished for PR 19876 at commit 6e9cdc3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-20T21:39:18Z

Test build #85207 has finished for PR 19876 at commit b8844c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-12-29T01:56:45Z

So @MLnick / @sethah I'd like to put something like this in before we roll 2.3 RC1. Do y'all have any feedback before that?

holdenk · 2017-12-29T01:57:26Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

 /**
 * Abstract class for utility classes that can save ML instances.
 */
+@deprecated("Use GeneralMLWriter instead. Will be removed in Spark 3.0.0", "2.3.0")


I'm debating if this should be deprecated in 2.4 and just have this as a new option in 2.3. What do you think @sethah / @MLnick ?

holdenk · 2018-01-08T04:24:05Z

I'm going to update this tomorrow, but if no one has anything by EOW would folks be OK with this as an experimental developer API for 2.3? cc @JoshRosen ?

sethah

Another pass :)

sethah · 2018-01-09T16:47:58Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+ * ML export formats for should implement this trait so that users can specify a shortname rather
+ * than the fully qualified class name of the exporter.
+ *
+ * A new instance of this class will be instantiated each time a DDL call is made.


Was this supposed to be retained from the DataSourceRegister?

sethah · 2018-01-09T16:48:16Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+    val loader = Utils.getContextOrSparkClassLoader
+    val serviceLoader = ServiceLoader.load(classOf[MLFormatRegister], loader)
+    val stageName = stage.getClass.getName
+    val targetName = s"${source}+${stageName}"


don't need brackets

sethah · 2018-01-09T16:52:03Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+   *
+   * {{{
+   *   override def shortName(): String =
+   *       "pmml+org.apache.spark.ml.regression.LinearRegressionModel"


what about making a second abstract field def stageName(): String, instead of having it packed into one string?

sethah · 2018-01-09T16:53:31Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+trait MLFormatRegister {
+  /**
+   * The string that represents the format that this data source provider uses. This is
+   * overridden by children to provide a nice alias for the data source. For example:


"data source" -> "model format"?

sethah · 2018-01-09T16:55:00Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+
  /**
-   * Overwrites if the output path already exists.
+   * Specifies the format of ML export (e.g. PMML, internal, or


change to e.g. "pmml", "internal", or the fully qualified class name for export).

sethah · 2018-01-09T19:24:44Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+@InterfaceStability.Evolving
+trait MLWriterFormat {
+  /**
+   * Function write the provided pipeline stage out.


Should add a full doc here with param annotations. Also should it be "Function to write ..."?

sethah · 2018-01-09T19:27:56Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+/**
+ * A ML Writer which delegates based on the requested format.
+ */
+class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {


need @Since("2.3.0") here?

sethah · 2018-01-09T19:28:35Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

  }

+  // override for Java compatibility
+  override def session(sparkSession: SparkSession): this.type = super.session(sparkSession)


since tags here

sethah · 2018-01-09T19:29:20Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+ * @since 2.3.0
+ */
+@InterfaceStability.Evolving
+trait MLWriterFormat {


do we need the actual since annotations here, though?

sethah · 2018-01-09T19:42:28Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

+    }
+  }
+
+  test("dummy export format is called") {


We can also add tests for the MLFormatRegister similar to DDLSourceLoadSuite. Just add a META-INF/services/ directory to src/test/resources/

MLnick · 2018-01-16T13:18:39Z

So to be clear this doesn't handle the read path at all? Would there be a plan to implement a similar read API? Though often with "model serving" formats like PMML, the write is what's important.

Overall I like the idea of an open API for plugging in model serialization formats (as I've commented on the previous PRs etc). This might be a bit too much to put in for 2.3 though?

…could write a multi-format export class I suppose)

holdenk · 2018-01-17T02:53:05Z

@MLnick I think the read path could follow a similar approach, but for models which we export from Spark and wish to load back into Spark the internal format is probably the best option.

As far as options, this keeps the same stringly typed options interface as we have with Datasources, and individual writers are free to specialize and add their own special methods.

I'm open to the idea of punting this, but we've punted PMML export since we introduced ML pipelines and users are still asking for general export support so the feature req hasn't gone away either. If we think the design needs another pass that's fine, but I'd like to suggest that this is probably a good starting block from which we can add more things on top off.

If folks agree the foundation isn't bad I think putting it in for 2.3 would be useful (and it's not like we haven't had a long time to consider this design). This specific PR has been open since Dec 4th but we've been looking at variants on this idea for years.

SparkQA · 2018-01-17T03:47:51Z

Test build #86226 has finished for PR 19876 at commit 6411054.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T04:07:30Z

Test build #86228 has finished for PR 19876 at commit 4047239.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T04:09:31Z

Test build #86229 has finished for PR 19876 at commit cd330f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T04:12:24Z

Test build #86230 has finished for PR 19876 at commit 41312e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-01-18T01:33:50Z

re-ping @MLnick + ping @jkbradley , thoughts?

holdenk · 2018-01-19T03:42:07Z

also maybe @dbtsai ?

holdenk · 2018-01-21T12:07:44Z

re-ping folks ?

holdenk · 2018-02-28T00:44:53Z

So now that it looks like 2.3 is pretty much wrapped up do folks have any thoughts? @MLnick @jkbradley @sethah ?

holdenk · 2018-03-05T08:59:18Z

So if no one has comments by March 15th I'm going update the tags & push forward with this API since we need something and this is the design most folks seems to be interested in from the last proposal.

holdenk · 2018-03-13T18:53:16Z

March 15th is soon, any thoughts @MLnick @jkbradley @sethah ?

holdenk · 2018-03-20T21:21:25Z

Will do a last pass myself and merge on Friday if no one else has opinions.

holdenk · 2018-03-22T23:38:36Z

See some related thoughts in how to support Spark in kubeflow: kubeflow/spark-operator#119

…r-ML-KMeans-r2

SparkQA · 2018-03-23T00:48:17Z

Test build #88534 has finished for PR 19876 at commit 9075626.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-03-23T18:09:16Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+ *
+ * Must have a valid zero argument constructor which will be called to instantiate.
+ *
+ * @since 2.3.0


Need to update since annotations to 2.4.0

holdenk · 2018-03-23T18:11:21Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+ * ML export formats for should implement this trait so that users can specify a shortname rather
+ * than the fully qualified class name of the exporter.
+ *
+ * A new instance of this class will be instantiated each time a save call is made.


Add a comment about zero arg constructor requirement

…ctor

holdenk · 2018-03-23T18:27:11Z

LGTM pending Jenkins will merge.

SparkQA · 2018-03-23T19:33:01Z

Test build #88546 has finished for PR 19876 at commit cb6fd70.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

goungoun · 2018-03-24T14:47:58Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

+
+/** A writer for LinearRegression that handles the "pmml" format */
+private class PMMLLinearRegressionModelWriter
+    extends MLWriterFormat with MLFormatRegister {


Should be two space indentation
extends MLWriterFormat with MLFormatRegister {

Thanks for pointing this out, I'll fix it in a follow up.

I've included this in #20907

holdenk added 6 commits November 24, 2017 06:00

Initial attempt at allowing Spark ML writers to be slightly more plug…

43ae30f

…gable

The LinearRegression suite passes

9fec08f

Add missing META-INFO for MLFormatRegister

0075bf4

Add a (untested) PMMLLinearRegressionModelWriter

c68880d

Basic PMML export test

c2108df

Add PMML testing utils for Spark ML that were accidently left out

de86190

holdenk changed the title ~~[WIP][ML][SPARK-11171] spark 11237 Add PMML export to Spark ML pipelines~~ [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to Spark ML pipelines Dec 4, 2017

Minor wording/whitespace change

8b1c752

holdenk added 2 commits December 5, 2017 01:29

Remove link causing doc issue

72b509f

Verify we throw on invalid export formats

b8362a4

sethah reviewed Dec 12, 2017

View reviewed changes

Merge in master

6e9cdc3

Updates on CR feedback

b8844c7

holdenk changed the title ~~[WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to Spark ML pipelines~~ [ML][SPARK-11171][SPARK-11239] Add PMML export to Spark ML pipelines Dec 20, 2017

holdenk commented Dec 29, 2017

View reviewed changes

sethah reviewed Jan 9, 2018

View reviewed changes

Merge in master

c265200

holdenk added 2 commits January 16, 2018 18:39

Refactor a bit (especially in tests).

8fba2e5

eh the wrong format error is up to each implementation (e.g. someone …

6411054

…could write a multi-format export class I suppose)

holdenk added 3 commits January 16, 2018 18:57

remove old format register meta-inf file

4047239

Annoations and remove unecessary whitespace change

cd330f3

Weaken promise to Unstable

41312e7

holdenk mentioned this pull request Mar 5, 2018

[SPARK-11171][SPARK-11237][SPARK-11241][ML] Try adding PMMLExportable to ML with KMeans #9207

Closed

Merge branch 'master' into SPARK-11171-SPARK-11237-Add-PMML-export-fo…

9075626

…r-ML-KMeans-r2

holdenk commented Mar 23, 2018

View reviewed changes

Update since annotation to 2.4.0 and add comment re: zero arg constru…

cb6fd70

…ctor

holdenk changed the title ~~[ML][SPARK-11171][SPARK-11239] Add PMML export to Spark ML pipelines~~ [ML][SPARK-23783][SPARK-11239] Add PMML export to Spark ML pipelines Mar 23, 2018

asfgit closed this in 95c03cb Mar 23, 2018

goungoun reviewed Mar 24, 2018

View reviewed changes

[ML][SPARK-23783][SPARK-11239] Add PMML export to Spark ML pipelines #19876

[ML][SPARK-23783][SPARK-11239] Add PMML export to Spark ML pipelines #19876

Uh oh!

Conversation

holdenk commented Dec 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 4, 2017

Uh oh!

SparkQA commented Dec 5, 2017

Uh oh!

SparkQA commented Dec 5, 2017

Uh oh!

SparkQA commented Dec 5, 2017

Uh oh!

holdenk commented Dec 6, 2017

Uh oh!

holdenk commented Dec 10, 2017

Uh oh!

sethah commented Dec 12, 2017

Uh oh!

sethah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Dec 20, 2017

Uh oh!

SparkQA commented Dec 20, 2017

Uh oh!

SparkQA commented Dec 20, 2017

Uh oh!

holdenk commented Dec 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Jan 8, 2018

Uh oh!

sethah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah Jan 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

holdenk commented Dec 4, 2017 •

edited

Loading

sethah Jan 9, 2018 •

edited

Loading

MLnick commented Jan 16, 2018 •

edited

Loading