Skip to content

[SPARK-20601][PYTHON][ML] Python API Changes for Constrained Logistic Regression Params#17922

Closed
zero323 wants to merge 4 commits intoapache:masterfrom
zero323:SPARK-20601
Closed

[SPARK-20601][PYTHON][ML] Python API Changes for Constrained Logistic Regression Params#17922
zero323 wants to merge 4 commits intoapache:masterfrom
zero323:SPARK-20601

Conversation

@zero323
Copy link
Member

@zero323 zero323 commented May 9, 2017

What changes were proposed in this pull request?

  • Add new Params to pyspark.ml.classification.LogisticRegression.
  • Add toMatrix method to pyspark.ml.param.TypeConverters.
  • Add generate_multinomial_logistic_input helper to pyspark.ml.tests.

How was this patch tested?

Unit tests

@zero323 zero323 changed the title [SPARK-2060][PYTHON][ML] Python API Changes for Constrained Logistic Regression Params [SPARK-20601][PYTHON][ML] Python API Changes for Constrained Logistic Regression Params May 9, 2017
@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76671 has finished for PR 17922 at commit 1c0cb74.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76672 has finished for PR 17922 at commit d4eeb0f.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76674 has finished for PR 17922 at commit 550e165.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're voting on 2.2 now, I presume this will make it for 2.3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. I've seen that Scala version has been targeted for 2.2.1 so who knows? But let's make 2.3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Assign class", though IMO you could also just do away with the comments in this section.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76687 has finished for PR 17922 at commit 57faa5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 10, 2017

Test build #76756 has finished for PR 17922 at commit aa219c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zero323 Could you resolve the merge conflict, then I can review this? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it's not need to write exactly the same test as Scala in PySpark, we can use a simple test with loading a dataset or generating a very simple dataset and run constrained LR on it. You can refer test cases in test.py or other tests like this.

@zero323
Copy link
Member Author

zero323 commented Jun 3, 2017

Sure @yanboliang. Give me a sec.

@SparkQA
Copy link

SparkQA commented Jun 3, 2017

Test build #77705 has finished for PR 17922 at commit 649bf28.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


@staticmethod
def toMatrix(value):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML -> MLlib, MLlib is the only official name for both spark.mllib and spark.ml package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am aware of this, distinction between ml.linalg and mllib.linalg, is a common source of confusion for the PySpark users. Of course we could be more forgiving, and automatically convert objects to the required class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a big issue, but you can still refer the clarification in MLlib user guide to get the convention in MLlib.

LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
)

def test_binomial_logistic_regression_bounds(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zero323 We usually run PySpark MLlib test with loading a dataset from data/mllib/ or manual generating a dummy/hard-coded dataset rather than rewrite the same test case as Scala. We keep PySpark test as simple as possible. You can refer this test case. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example datasets are not that good for checking constraints, and generator seems like a better idea than creating large enough example by hand. I can of course remove it, if this is an issue.

Copy link
Contributor

@yanboliang yanboliang Jun 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PySpark, we should only check the output be consistent with Scala. The most straight-forward way for this test should be loading data directly and run constraint LR on it:

data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)
......

This will make the test case simple and time-saving. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is probably overkill for testing this. The functionality is already in Scala and should be tested there, here in python we are just setting the parameters.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @zero323 ! I commented on some minor style issues and tests should be simplified a bit. Otherwise looks fine.

"(1, number of features) for binomial regression, or "
"(number of classes, number of features) "
"for multinomial regression.",
typeConverter=TypeConverters.toMatrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can condense this and the above text blocks some more

rawPredictionCol="rawPrediction", standardization=True, weightCol=None,
aggregationDepth=2, family="auto"):
aggregationDepth=2, family="auto",
lowerBoundsOnCoefficients=None, upperBoundsOnCoefficients=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should fill up the previous line before starting another, here and below

"""
Convert a value to ML Matrix, if possible
"""
if isinstance(value, Matrix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method really necessary? It's not actually converting anything, just checking the type. If this wasn't here and the user put something other than a Matrix, what error would be raised?

LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
)

def test_binomial_logistic_regression_bounds(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is probably overkill for testing this. The functionality is already in Scala and should be tested there, here in python we are just setting the parameters.

@zero323
Copy link
Member Author

zero323 commented Jul 13, 2017

@BryanCutler @yanboliang @nchammas Thanks for all the comments. Unfortunately I don't have access to a hardware I can use for development at this moment, and most I likely I won't have in the upcoming weeks. I going to close this PR, but I'd really appreciate if one of you could pick it up from here. TIA

@zero323 zero323 closed this Jul 13, 2017
ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 2, 2017
## What changes were proposed in this pull request?
Python API for Constrained Logistic Regression based on apache#17922 , thanks for the original contribution from zero323 .

## How was this patch tested?
Unit tests.

Author: zero323 <zero323@users.noreply.github.com>
Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#18759 from yanboliang/SPARK-20601.
@zero323 zero323 deleted the SPARK-20601 branch February 2, 2020 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments