Skip to content

[Spark-17025][ML][Python] Persistence for Pipelines with Python-only Stages#18888

Closed
ajaysaini725 wants to merge 11 commits intoapache:masterfrom
ajaysaini725:PythonPipelines
Closed

[Spark-17025][ML][Python] Persistence for Pipelines with Python-only Stages#18888
ajaysaini725 wants to merge 11 commits intoapache:masterfrom
ajaysaini725:PythonPipelines

Conversation

@ajaysaini725
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Implemented a Python-only persistence framework for pipelines containing stages that cannot be saved using Java.

How was this patch tested?

Created a custom Python-only UnaryTransformer, included it in a Pipeline, and saved/loaded the pipeline. The loaded pipeline was compared against the original using _compare_pipelines() in tests.py.

@ajaysaini725 ajaysaini725 changed the title [Spark-17025][ML][Python] Persistence for Custom Python-only Pipelines [Spark-17025][ML][Python] Persistence for Pipelines with Python-only Stages Aug 8, 2017
@ajaysaini725
Copy link
Copy Markdown
Contributor Author

@jkbradley @MrBago @WeichenXu123 Can you please review this?

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 8, 2017

Test build #80421 has finished for PR 18888 at commit 85a98d6.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 9, 2017

Test build #80426 has finished for PR 18888 at commit ba4402c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 9, 2017

Test build #80427 has finished for PR 18888 at commit 22ebe3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is good overall I think, except some minor problem.

The _to_java method we can add a user-friendly exception handling, when meet custom python stage, throw exception with detailed description.

if not isinstance(stage, JavaMLWritable):
allStagesAreJava = False
if allStagesAreJava:
return JavaMLWriter(self)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the similar logic twice, can you move it to a util function ?

stageUids = [stage.uid for stage in stages]
jsonParams = {'stageUids': stageUids, 'savedAsPython': True}
DefaultParamsWriter.saveMetadata(instance, path, sc, paramMap=jsonParams)
stagesDir = os.path.join(path, "stages")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here use os.path.join to generate full path, maybe will have some risk... because it depends on local OS path format.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley, what's the right way to handle Paths in pyspark? Scala has org.apache.hadoop.fs.Path, is there something similar in pyspark?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as good as it gets, as far as I know

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 9, 2017

Test build #80465 has finished for PR 18888 at commit cdcd1cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor

@MrBago MrBago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @ajaysaini725, mostly minor comments.

One more major concern I have is with __get_class in DefaultParamsReader. I know it's outside this PR, but it looks a little brittle on first pass. Are we testing this method with different module structures and notebooks?

from pyspark.ml.base import Estimator, Model, Transformer
from pyspark.ml.param import Param, Params
from pyspark.ml.util import JavaMLWriter, JavaMLReader, MLReadable, MLWritable
from pyspark.ml.util import *
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do import pyspark.ml.util as mlutil?

Copy link
Copy Markdown
Member

@jkbradley jkbradley Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK either way, though mlutil would be cleaner

Comment thread python/pyspark/ml/pipeline.py Outdated
stages = self.getStages()
for stage in stages:
if not isinstance(stage, JavaMLWritable):
allStagesAreJava = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about allStagesAreJava = all(isinstance(stage, JavaMLWritable) for stage in self.getStages())

return (metadata['uid'], stages)

@staticmethod
def getStagePath(stageUid, stageIdx, numStages, stagesDir):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stageIdx isn't used by this method, is that intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be used. Fixed this. Thanks!

self._compare_pipelines(model, loaded_model)
finally:
try:
rmtree(temp_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this in a try block? I worry about silencing errors in tests because it's a good way to miss issues.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same pattern that exists in all other tests so I just followed it for this one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, it was because we didn't want tests to fail because of cleanup failing. I forget if/when cleanup failures were causing a problem...

stageUids = [stage.uid for stage in stages]
jsonParams = {'stageUids': stageUids, 'savedAsPython': True}
DefaultParamsWriter.saveMetadata(instance, path, sc, paramMap=jsonParams)
stagesDir = os.path.join(path, "stages")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley, what's the right way to handle Paths in pyspark? Scala has org.apache.hadoop.fs.Path, is there something similar in pyspark?

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 11, 2017

Test build #80513 has finished for PR 18888 at commit cf1a08d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread python/pyspark/ml/pipeline.py Outdated
"""
for stage in stages:
if not isinstance(stage, MLWritable):
raise ValueError("Pipeline write will fail on this pipline " +
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: pipline

Comment thread python/pyspark/ml/pipeline.py Outdated


@inherit_doc
class SharedReadWrite():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to something with "Pipeline" such as "PipelineSharedReadWrite"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, note that it is either private or DeveloperApi

Comment thread python/pyspark/ml/pipeline.py Outdated
class SharedReadWrite():
"""
Functions for :py:class:`MLReader` and :py:class:`MLWriter` shared between
:py:class:`Pipeline` and :py:class`PipelineModel`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing colon

Comment thread python/pyspark/ml/pipeline.py Outdated
- save stages to stages/IDX_UID
"""
stageUids = [stage.uid for stage in stages]
jsonParams = {'stageUids': stageUids, 'savedAsPython': True}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me: For future extensibility, it would make sense to change this to something like 'language': 'Python' since there may be something analogous for R or other languages in the future.

self._compare_pipelines(model, loaded_model)
finally:
try:
rmtree(temp_path)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, it was because we didn't want tests to fail because of cleanup failing. I forget if/when cleanup failures were causing a problem...

stageUids = [stage.uid for stage in stages]
jsonParams = {'stageUids': stageUids, 'savedAsPython': True}
DefaultParamsWriter.saveMetadata(instance, path, sc, paramMap=jsonParams)
stagesDir = os.path.join(path, "stages")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as good as it gets, as far as I know

from pyspark.ml.base import Estimator, Model, Transformer
from pyspark.ml.param import Param, Params
from pyspark.ml.util import JavaMLWriter, JavaMLReader, MLReadable, MLWritable
from pyspark.ml.util import *
Copy link
Copy Markdown
Member

@jkbradley jkbradley Aug 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK either way, though mlutil would be cleaner

Copy link
Copy Markdown
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with review. Thanks!

Comment thread python/pyspark/ml/pipeline.py Outdated

@staticmethod
def checkStagesForJava(stages):
allStagesAreJava = True
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying @MrBago 's comment here:
How about allStagesAreJava = all(isinstance(stage, JavaMLWritable) for stage in self.getStages())?

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 11, 2017

Test build #80547 has finished for PR 18888 at commit 18c902c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PipelineSharedReadWrite():

@jkbradley
Copy link
Copy Markdown
Member

LGTM pending tests!

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 11, 2017

Test build #80548 has finished for PR 18888 at commit 2b63eea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ajaysaini725
Copy link
Copy Markdown
Contributor Author

@jkbradley Quick reminder to merge this since the tests have passed!

@asfgit asfgit closed this in 35db3b9 Aug 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants