-
Notifications
You must be signed in to change notification settings - Fork 29k
[ML][SPARK-23783][SPARK-11239] Add PMML export to Spark ML pipelines #19876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][SPARK-23783][SPARK-11239] Add PMML export to Spark ML pipelines #19876
Conversation
|
Test build #84426 has finished for PR 19876 at commit
|
|
Test build #84470 has finished for PR 19876 at commit
|
|
Test build #84473 has finished for PR 19876 at commit
|
|
Test build #84474 has finished for PR 19876 at commit
|
|
@sethah: want to publish the comments from when we were chatting in Singapore? :p |
|
@holdenk Do you mind leaving some comments on the intentions/benefits of this new API for the benefit of other reviewers? For example, what use cases may exist - adding third party PFA support (other third party export tools?), and also why we need to add PMML support when there are already tools that do this jpmml-sparkml. Also, this is two changes in one PR: adding an API for generic model export and adding PMML to LinearRegression. I think it makes sense to separate the two, and just focus on the new API here. What do you think? |
sethah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing, but gonna publish what I have for now.
| val writer = writerCls.newInstance().asInstanceOf[MLWriterFormat] | ||
| writer.write(path, sparkSession, optionMap, stage) | ||
| } else { | ||
| throw new SparkException("ML source $source is not a valid MLWriterFormat") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: need string interpolation here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I've added a test for this error message.
| * Function write the provided pipeline stage out. | ||
| */ | ||
| def write(path: String, session: SparkSession, optionMap: mutable.Map[String, String], | ||
| stage: PipelineStage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type?
| */ | ||
| @Since("1.6.0") | ||
| override def write: MLWriter = new LinearRegressionModel.LinearRegressionModelWriter(this) | ||
| override def write: GeneralMLWriter = new GeneralMLWriter(this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc above this is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| testPMMLWrite(sc, model, checkModel) | ||
| } | ||
|
|
||
| test("unsupported export format") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to have a test that verifies that this works with third party implementations. Specifically, that something like model.write.format("org.apache.spark.ml.MyDummyWriter").save(path) works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll put a dummy writer in test so it doesn't clog up our class space.
| /** | ||
| * A ML Writer which delegates based on the requested format. | ||
| */ | ||
| class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps for another PR, but maybe we could add a method here:
def pmml(path: String): Unit = {
this.source = "pmml"
save(path)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I don't think that belongs in the base GeneralMLWriter, but we could make a trait for writers which support PMML to mix in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The follow up issue to track this is https://issues.apache.org/jira/browse/SPARK-11241
|
@sethah So I'm hesitant to push an API without an implementation to make sure its actually usable for our goal. But I'm fine splitting it out into a separate PR. |
|
Test build #85204 has finished for PR 19876 at commit
|
|
Test build #85207 has finished for PR 19876 at commit
|
| /** | ||
| * Abstract class for utility classes that can save ML instances. | ||
| */ | ||
| @deprecated("Use GeneralMLWriter instead. Will be removed in Spark 3.0.0", "2.3.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I'm going to update this tomorrow, but if no one has anything by EOW would folks be OK with this as an experimental developer API for 2.3? cc @JoshRosen ? |
sethah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pass :)
| * ML export formats for should implement this trait so that users can specify a shortname rather | ||
| * than the fully qualified class name of the exporter. | ||
| * | ||
| * A new instance of this class will be instantiated each time a DDL call is made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this supposed to be retained from the DataSourceRegister?
| val loader = Utils.getContextOrSparkClassLoader | ||
| val serviceLoader = ServiceLoader.load(classOf[MLFormatRegister], loader) | ||
| val stageName = stage.getClass.getName | ||
| val targetName = s"${source}+${stageName}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need brackets
| * | ||
| * {{{ | ||
| * override def shortName(): String = | ||
| * "pmml+org.apache.spark.ml.regression.LinearRegressionModel" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about making a second abstract field def stageName(): String, instead of having it packed into one string?
| trait MLFormatRegister { | ||
| /** | ||
| * The string that represents the format that this data source provider uses. This is | ||
| * overridden by children to provide a nice alias for the data source. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"data source" -> "model format"?
|
|
||
| /** | ||
| * Overwrites if the output path already exists. | ||
| * Specifies the format of ML export (e.g. PMML, internal, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to e.g. "pmml", "internal", or the fully qualified class name for export).
| @InterfaceStability.Evolving | ||
| trait MLWriterFormat { | ||
| /** | ||
| * Function write the provided pipeline stage out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a full doc here with param annotations. Also should it be "Function to write ..."?
| /** | ||
| * A ML Writer which delegates based on the requested format. | ||
| */ | ||
| class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need @Since("2.3.0") here?
| } | ||
|
|
||
| // override for Java compatibility | ||
| override def session(sparkSession: SparkSession): this.type = super.session(sparkSession) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since tags here
| * @since 2.3.0 | ||
| */ | ||
| @InterfaceStability.Evolving | ||
| trait MLWriterFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the actual since annotations here, though?
| } | ||
| } | ||
|
|
||
| test("dummy export format is called") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also add tests for the MLFormatRegister similar to DDLSourceLoadSuite. Just add a META-INF/services/ directory to src/test/resources/
|
So to be clear this doesn't handle the Overall I like the idea of an open API for plugging in model serialization formats (as I've commented on the previous PRs etc). This might be a bit too much to put in for 2.3 though? |
…could write a multi-format export class I suppose)
|
@MLnick I think the read path could follow a similar approach, but for models which we export from Spark and wish to load back into Spark the internal format is probably the best option. As far as options, this keeps the same stringly typed options interface as we have with Datasources, and individual writers are free to specialize and add their own special methods. I'm open to the idea of punting this, but we've punted PMML export since we introduced ML pipelines and users are still asking for general export support so the feature req hasn't gone away either. If we think the design needs another pass that's fine, but I'd like to suggest that this is probably a good starting block from which we can add more things on top off. If folks agree the foundation isn't bad I think putting it in for 2.3 would be useful (and it's not like we haven't had a long time to consider this design). This specific PR has been open since Dec 4th but we've been looking at variants on this idea for years. |
|
Test build #86226 has finished for PR 19876 at commit
|
|
Test build #86228 has finished for PR 19876 at commit
|
|
Test build #86229 has finished for PR 19876 at commit
|
|
Test build #86230 has finished for PR 19876 at commit
|
|
re-ping @MLnick + ping @jkbradley , thoughts? |
|
also maybe @dbtsai ? |
|
re-ping folks ? |
|
So now that it looks like 2.3 is pretty much wrapped up do folks have any thoughts? @MLnick @jkbradley @sethah ? |
|
So if no one has comments by March 15th I'm going update the tags & push forward with this API since we need something and this is the design most folks seems to be interested in from the last proposal. |
|
March 15th is soon, any thoughts @MLnick @jkbradley @sethah ? |
|
Will do a last pass myself and merge on Friday if no one else has opinions. |
|
See some related thoughts in how to support Spark in kubeflow: kubeflow/spark-operator#119 |
|
Test build #88534 has finished for PR 19876 at commit
|
| * | ||
| * Must have a valid zero argument constructor which will be called to instantiate. | ||
| * | ||
| * @since 2.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update since annotations to 2.4.0
| * ML export formats for should implement this trait so that users can specify a shortname rather | ||
| * than the fully qualified class name of the exporter. | ||
| * | ||
| * A new instance of this class will be instantiated each time a save call is made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment about zero arg constructor requirement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
LGTM pending Jenkins will merge. |
|
Test build #88546 has finished for PR 19876 at commit
|
|
|
||
| /** A writer for LinearRegression that handles the "pmml" format */ | ||
| private class PMMLLinearRegressionModelWriter | ||
| extends MLWriterFormat with MLFormatRegister { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be two space indentation
extends MLWriterFormat with MLFormatRegister {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I'll fix it in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included this in #20907
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
What changes were proposed in this pull request?
Adds PMML export support to Spark ML pipelines in the style of Spark's DataSource API to allow library authors to add their own model export formats.
Includes a specific implementation for Spark ML linear regression PMML export.
In addition to adding PMML to reach parity with our current MLlib implementation, this approach will allow other libraries & formats (like PFA) to implement and export models with a unified API.
How was this patch tested?
Basic unit test.