[Spark-17025][ML][Python] Persistence for Pipelines with Python-only Stages by ajaysaini725 · Pull Request #18888 · apache/spark

ajaysaini725 · 2017-08-08T23:35:49Z

What changes were proposed in this pull request?

Implemented a Python-only persistence framework for pipelines containing stages that cannot be saved using Java.

How was this patch tested?

Created a custom Python-only UnaryTransformer, included it in a Pipeline, and saved/loaded the pipeline. The loaded pipeline was compared against the original using _compare_pipelines() in tests.py.

ajaysaini725 · 2017-08-08T23:37:47Z

@jkbradley @MrBago @WeichenXu123 Can you please review this?

SparkQA · 2017-08-08T23:52:56Z

Test build #80421 has finished for PR 18888 at commit 85a98d6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-09T00:18:40Z

Test build #80426 has finished for PR 18888 at commit ba4402c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-09T00:45:02Z

Test build #80427 has finished for PR 18888 at commit 22ebe3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123

The PR is good overall I think, except some minor problem.

The _to_java method we can add a user-friendly exception handling, when meet custom python stage, throw exception with detailed description.

WeichenXu123 · 2017-08-09T01:02:28Z

+            if not isinstance(stage, JavaMLWritable):
+                allStagesAreJava = False
+        if allStagesAreJava:
+            return JavaMLWriter(self)


I find the similar logic twice, can you move it to a util function ?

WeichenXu123 · 2017-08-09T01:26:59Z

+        stageUids = [stage.uid for stage in stages]
+        jsonParams = {'stageUids': stageUids, 'savedAsPython': True}
+        DefaultParamsWriter.saveMetadata(instance, path, sc, paramMap=jsonParams)
+        stagesDir = os.path.join(path, "stages")


Here use os.path.join to generate full path, maybe will have some risk... because it depends on local OS path format.

@jkbradley, what's the right way to handle Paths in pyspark? Scala has org.apache.hadoop.fs.Path, is there something similar in pyspark?

This is as good as it gets, as far as I know

SparkQA · 2017-08-09T19:15:26Z

Test build #80465 has finished for PR 18888 at commit cdcd1cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MrBago

This looks good @ajaysaini725, mostly minor comments.

One more major concern I have is with __get_class in DefaultParamsReader. I know it's outside this PR, but it looks a little brittle on first pass. Are we testing this method with different module structures and notebooks?

MrBago · 2017-08-10T18:25:34Z

 from pyspark.ml.base import Estimator, Model, Transformer
 from pyspark.ml.param import Param, Params
-from pyspark.ml.util import JavaMLWriter, JavaMLReader, MLReadable, MLWritable
+from pyspark.ml.util import *


can we do import pyspark.ml.util as mlutil?

I'm OK either way, though mlutil would be cleaner

MrBago · 2017-08-10T20:34:49Z

+        stages = self.getStages()
+        for stage in stages:
+            if not isinstance(stage, JavaMLWritable):
+                allStagesAreJava = False


How about allStagesAreJava = all(isinstance(stage, JavaMLWritable) for stage in self.getStages())

MrBago · 2017-08-10T20:39:40Z

+        return (metadata['uid'], stages)
+
+    @staticmethod
+    def getStagePath(stageUid, stageIdx, numStages, stagesDir):


stageIdx isn't used by this method, is that intentional?

It should be used. Fixed this. Thanks!

MrBago · 2017-08-10T20:42:21Z

+            self._compare_pipelines(model, loaded_model)
+        finally:
+            try:
+                rmtree(temp_path)


Why do we need this in a try block? I worry about silencing errors in tests because it's a good way to miss issues.

This is the same pattern that exists in all other tests so I just followed it for this one.

As I recall, it was because we didn't want tests to fail because of cleanup failing. I forget if/when cleanup failures were causing a problem...

MrBago · 2017-08-10T21:40:42Z

+        stageUids = [stage.uid for stage in stages]
+        jsonParams = {'stageUids': stageUids, 'savedAsPython': True}
+        DefaultParamsWriter.saveMetadata(instance, path, sc, paramMap=jsonParams)
+        stagesDir = os.path.join(path, "stages")


@jkbradley, what's the right way to handle Paths in pyspark? Scala has org.apache.hadoop.fs.Path, is there something similar in pyspark?

SparkQA · 2017-08-11T00:25:01Z

Test build #80513 has finished for PR 18888 at commit cf1a08d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-08-11T00:18:15Z

+        """
+        for stage in stages:
+            if not isinstance(stage, MLWritable):
+                raise ValueError("Pipeline write will fail on this pipline " +


typo: pipline

jkbradley · 2017-08-11T00:21:29Z

+
+
+@inherit_doc
+class SharedReadWrite():


Rename to something with "Pipeline" such as "PipelineSharedReadWrite"

Also, note that it is either private or DeveloperApi

jkbradley · 2017-08-11T00:21:41Z

+class SharedReadWrite():
+    """
+    Functions for :py:class:`MLReader` and :py:class:`MLWriter` shared between
+    :py:class:`Pipeline` and :py:class`PipelineModel`


missing colon

jkbradley · 2017-08-11T17:30:34Z

+        - save stages to stages/IDX_UID
+        """
+        stageUids = [stage.uid for stage in stages]
+        jsonParams = {'stageUids': stageUids, 'savedAsPython': True}


It just occurred to me: For future extensibility, it would make sense to change this to something like 'language': 'Python' since there may be something analogous for R or other languages in the future.

jkbradley · 2017-08-11T17:38:17Z

+            self._compare_pipelines(model, loaded_model)
+        finally:
+            try:
+                rmtree(temp_path)


As I recall, it was because we didn't want tests to fail because of cleanup failing. I forget if/when cleanup failures were causing a problem...

jkbradley · 2017-08-11T17:40:10Z

+        stageUids = [stage.uid for stage in stages]
+        jsonParams = {'stageUids': stageUids, 'savedAsPython': True}
+        DefaultParamsWriter.saveMetadata(instance, path, sc, paramMap=jsonParams)
+        stagesDir = os.path.join(path, "stages")


This is as good as it gets, as far as I know

jkbradley · 2017-08-11T17:41:42Z

 from pyspark.ml.base import Estimator, Model, Transformer
 from pyspark.ml.param import Param, Params
-from pyspark.ml.util import JavaMLWriter, JavaMLReader, MLReadable, MLWritable
+from pyspark.ml.util import *


I'm OK either way, though mlutil would be cleaner

jkbradley

Done with review. Thanks!

jkbradley · 2017-08-11T17:43:10Z

+
+    @staticmethod
+    def checkStagesForJava(stages):
+        allStagesAreJava = True


Copying @MrBago 's comment here:
How about allStagesAreJava = all(isinstance(stage, JavaMLWritable) for stage in self.getStages())?

SparkQA · 2017-08-11T19:19:16Z

Test build #80547 has finished for PR 18888 at commit 18c902c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PipelineSharedReadWrite():

jkbradley · 2017-08-11T19:38:30Z

LGTM pending tests!

SparkQA · 2017-08-11T19:44:41Z

Test build #80548 has finished for PR 18888 at commit 2b63eea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ajaysaini725 · 2017-08-12T00:27:13Z

@jkbradley Quick reminder to merge this since the tests have passed!

ajaysaini725 added 2 commits August 8, 2017 16:24

Pipeline persistence commit with tests.

840a193

Fixed import

85a98d6

ajaysaini725 changed the title ~~[Spark-17025][ML][Python] Persistence for Custom Python-only Pipelines~~ [Spark-17025][ML][Python] Persistence for Pipelines with Python-only Stages Aug 8, 2017

ajaysaini725 added 2 commits August 8, 2017 16:58

Fixed python 3 issue with updating a dictionary

0eb0494

Fixed dictionary update

ba4402c

ajaysaini725 added 2 commits August 8, 2017 17:25

Fixed map serialization issue in python 3

22ebe3e

Removed extra space

6a094f0

WeichenXu123 reviewed Aug 9, 2017

View reviewed changes

ajaysaini725 added 2 commits August 9, 2017 11:52

Removed duplicated java stage check logic

4d2caf8

Fixed small bug

cdcd1cc

MrBago suggested changes Aug 10, 2017

View reviewed changes

Fixed based on comments

cf1a08d

jkbradley reviewed Aug 11, 2017

View reviewed changes

Fixed based on comments

18c902c

Marked PipelineReadWrite class as developer API

2b63eea

asfgit closed this in 35db3b9 Aug 12, 2017

Conversation

ajaysaini725 commented Aug 8, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ajaysaini725 commented Aug 8, 2017

Uh oh!

SparkQA commented Aug 8, 2017

Uh oh!

SparkQA commented Aug 9, 2017

Uh oh!

SparkQA commented Aug 9, 2017

Uh oh!

WeichenXu123 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 9, 2017

Uh oh!

MrBago left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley Aug 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley Aug 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 11, 2017

Uh oh!

jkbradley commented Aug 11, 2017

Uh oh!

SparkQA commented Aug 11, 2017

WeichenXu123 left a comment •

edited

Loading

jkbradley Aug 11, 2017 •

edited

Loading

jkbradley Aug 11, 2017 •

edited

Loading