[SPARK-20601][PYTHON][ML] Python API Changes for Constrained Logistic Regression Params by zero323 · Pull Request #17922 · apache/spark

zero323 · 2017-05-09T11:01:27Z

What changes were proposed in this pull request?

Add new Params to pyspark.ml.classification.LogisticRegression.
Add toMatrix method to pyspark.ml.param.TypeConverters.
Add generate_multinomial_logistic_input helper to pyspark.ml.tests.

How was this patch tested?

Unit tests

SparkQA · 2017-05-09T11:04:12Z

Test build #76671 has finished for PR 17922 at commit 1c0cb74.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-09T11:25:32Z

Test build #76672 has finished for PR 17922 at commit d4eeb0f.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-09T11:50:53Z

Test build #76674 has finished for PR 17922 at commit 550e165.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2017-05-09T14:03:38Z

python/pyspark/ml/classification.py

Since we're voting on 2.2 now, I presume this will make it for 2.3.

Probably. I've seen that Scala version has been targeted for 2.2.1 so who knows? But let's make 2.3.

nchammas · 2017-05-09T14:04:34Z

python/pyspark/ml/tests.py

"Assign class", though IMO you could also just do away with the comments in this section.

SparkQA · 2017-05-09T15:45:30Z

Test build #76687 has finished for PR 17922 at commit 57faa5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-10T20:39:57Z

Test build #76756 has finished for PR 17922 at commit aa219c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang

@zero323 Could you resolve the merge conflict, then I can review this? Thanks.

yanboliang · 2017-06-03T19:16:22Z

python/pyspark/ml/tests.py

Usually it's not need to write exactly the same test as Scala in PySpark, we can use a simple test with loading a dataset or generating a very simple dataset and run constrained LR on it. You can refer test cases in test.py or other tests like this.

zero323 · 2017-06-03T19:55:17Z

Sure @yanboliang. Give me a sec.

SparkQA · 2017-06-03T20:49:15Z

Test build #77705 has finished for PR 17922 at commit 649bf28.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-06-23T09:24:46Z

python/pyspark/ml/param/__init__.py


+    @staticmethod
+    def toMatrix(value):
+        """


ML -> MLlib, MLlib is the only official name for both spark.mllib and spark.ml package.

While I am aware of this, distinction between ml.linalg and mllib.linalg, is a common source of confusion for the PySpark users. Of course we could be more forgiving, and automatically convert objects to the required class.

This is not a big issue, but you can still refer the clarification in MLlib user guide to get the convention in MLlib.

yanboliang · 2017-06-23T09:44:43Z

python/pyspark/ml/tests.py

+            LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+        )
+
+    def test_binomial_logistic_regression_bounds(self):


@zero323 We usually run PySpark MLlib test with loading a dataset from data/mllib/ or manual generating a dummy/hard-coded dataset rather than rewrite the same test case as Scala. We keep PySpark test as simple as possible. You can refer this test case. Thanks.

Example datasets are not that good for checking constraints, and generator seems like a better idea than creating large enough example by hand. I can of course remove it, if this is an issue.

For PySpark, we should only check the output be consistent with Scala. The most straight-forward way for this test should be loading data directly and run constraint LR on it:

data_path = "data/mllib/sample_multiclass_classification_data.txt" df = spark.read.format("libsvm").load(data_path) ......

This will make the test case simple and time-saving. Thanks.

I agree this is probably overkill for testing this. The functionality is already in Scala and should be tested there, here in python we are just setting the parameters.

BryanCutler

Thanks for doing this @zero323 ! I commented on some minor style issues and tests should be simplified a bit. Otherwise looks fine.

BryanCutler · 2017-07-11T18:07:00Z

python/pyspark/ml/classification.py

+                                      "(1, number of features) for binomial regression, or "
+                                      "(number of classes, number of features) "
+                                      "for multinomial regression.",
+                                      typeConverter=TypeConverters.toMatrix)


I think you can condense this and the above text blocks some more

BryanCutler · 2017-07-11T18:08:09Z

python/pyspark/ml/classification.py

                 rawPredictionCol="rawPrediction", standardization=True, weightCol=None,
-                 aggregationDepth=2, family="auto"):
+                 aggregationDepth=2, family="auto",
+                 lowerBoundsOnCoefficients=None, upperBoundsOnCoefficients=None,


should fill up the previous line before starting another, here and below

BryanCutler · 2017-07-11T18:15:08Z

python/pyspark/ml/param/__init__.py

+        """
+        Convert a value to ML Matrix, if possible
+        """
+        if isinstance(value, Matrix):


Is this method really necessary? It's not actually converting anything, just checking the type. If this wasn't here and the user put something other than a Matrix, what error would be raised?

BryanCutler · 2017-07-11T18:21:19Z

python/pyspark/ml/tests.py

+            LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+        )
+
+    def test_binomial_logistic_regression_bounds(self):


I agree this is probably overkill for testing this. The functionality is already in Scala and should be tested there, here in python we are just setting the parameters.

zero323 · 2017-07-13T16:38:28Z

@BryanCutler @yanboliang @nchammas Thanks for all the comments. Unfortunately I don't have access to a hardware I can use for development at this moment, and most I likely I won't have in the upcoming weeks. I going to close this PR, but I'd really appreciate if one of you could pick it up from here. TIA

## What changes were proposed in this pull request? Python API for Constrained Logistic Regression based on apache#17922 , thanks for the original contribution from zero323 . ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#18759 from yanboliang/SPARK-20601.

zero323 changed the title ~~[SPARK-2060][PYTHON][ML] Python API Changes for Constrained Logistic Regression Params~~ [SPARK-20601][PYTHON][ML] Python API Changes for Constrained Logistic Regression Params May 9, 2017

zero323 force-pushed the SPARK-20601 branch from d4eeb0f to 550e165 Compare May 9, 2017 11:31

nchammas reviewed May 9, 2017

View reviewed changes

zero323 force-pushed the SPARK-20601 branch from 57faa5b to aa219c6 Compare May 10, 2017 20:19

yanboliang reviewed Jun 3, 2017

View reviewed changes

zero323 added 4 commits June 3, 2017 22:28

Initial commit

c19f8ae

Fix style

c57964c

Change since 2.2.0 => 2.3.0

49b2501

Correct typo

649bf28

zero323 force-pushed the SPARK-20601 branch from aa219c6 to 649bf28 Compare June 3, 2017 20:30

yanboliang reviewed Jun 23, 2017

View reviewed changes

BryanCutler requested changes Jul 11, 2017

View reviewed changes

zero323 closed this Jul 13, 2017

yanboliang mentioned this pull request Jul 28, 2017

[SPARK-20601][ML] Python API for Constrained Logistic Regression #18759

Closed

zero323 deleted the SPARK-20601 branch February 2, 2020 17:45

Conversation

zero323 commented May 9, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

SparkQA commented May 10, 2017

Uh oh!

yanboliang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zero323 commented Jun 3, 2017

Uh oh!

SparkQA commented Jun 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang Jun 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zero323 commented Jul 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

yanboliang Jun 23, 2017 •

edited

Loading