[SPARK-20601][PYTHON][ML] Python API Changes for Constrained Logistic Regression Params#17922
[SPARK-20601][PYTHON][ML] Python API Changes for Constrained Logistic Regression Params#17922zero323 wants to merge 4 commits intoapache:masterfrom
Conversation
|
Test build #76671 has finished for PR 17922 at commit
|
|
Test build #76672 has finished for PR 17922 at commit
|
|
Test build #76674 has finished for PR 17922 at commit
|
python/pyspark/ml/classification.py
Outdated
There was a problem hiding this comment.
Since we're voting on 2.2 now, I presume this will make it for 2.3.
There was a problem hiding this comment.
Probably. I've seen that Scala version has been targeted for 2.2.1 so who knows? But let's make 2.3.
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
"Assign class", though IMO you could also just do away with the comments in this section.
|
Test build #76687 has finished for PR 17922 at commit
|
|
Test build #76756 has finished for PR 17922 at commit
|
yanboliang
left a comment
There was a problem hiding this comment.
@zero323 Could you resolve the merge conflict, then I can review this? Thanks.
python/pyspark/ml/tests.py
Outdated
|
Sure @yanboliang. Give me a sec. |
|
Test build #77705 has finished for PR 17922 at commit
|
|
|
||
| @staticmethod | ||
| def toMatrix(value): | ||
| """ |
There was a problem hiding this comment.
ML -> MLlib, MLlib is the only official name for both spark.mllib and spark.ml package.
There was a problem hiding this comment.
While I am aware of this, distinction between ml.linalg and mllib.linalg, is a common source of confusion for the PySpark users. Of course we could be more forgiving, and automatically convert objects to the required class.
There was a problem hiding this comment.
This is not a big issue, but you can still refer the clarification in MLlib user guide to get the convention in MLlib.
| LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] | ||
| ) | ||
|
|
||
| def test_binomial_logistic_regression_bounds(self): |
There was a problem hiding this comment.
@zero323 We usually run PySpark MLlib test with loading a dataset from data/mllib/ or manual generating a dummy/hard-coded dataset rather than rewrite the same test case as Scala. We keep PySpark test as simple as possible. You can refer this test case. Thanks.
There was a problem hiding this comment.
Example datasets are not that good for checking constraints, and generator seems like a better idea than creating large enough example by hand. I can of course remove it, if this is an issue.
There was a problem hiding this comment.
For PySpark, we should only check the output be consistent with Scala. The most straight-forward way for this test should be loading data directly and run constraint LR on it:
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)
......
This will make the test case simple and time-saving. Thanks.
There was a problem hiding this comment.
I agree this is probably overkill for testing this. The functionality is already in Scala and should be tested there, here in python we are just setting the parameters.
BryanCutler
left a comment
There was a problem hiding this comment.
Thanks for doing this @zero323 ! I commented on some minor style issues and tests should be simplified a bit. Otherwise looks fine.
| "(1, number of features) for binomial regression, or " | ||
| "(number of classes, number of features) " | ||
| "for multinomial regression.", | ||
| typeConverter=TypeConverters.toMatrix) |
There was a problem hiding this comment.
I think you can condense this and the above text blocks some more
| rawPredictionCol="rawPrediction", standardization=True, weightCol=None, | ||
| aggregationDepth=2, family="auto"): | ||
| aggregationDepth=2, family="auto", | ||
| lowerBoundsOnCoefficients=None, upperBoundsOnCoefficients=None, |
There was a problem hiding this comment.
should fill up the previous line before starting another, here and below
| """ | ||
| Convert a value to ML Matrix, if possible | ||
| """ | ||
| if isinstance(value, Matrix): |
There was a problem hiding this comment.
Is this method really necessary? It's not actually converting anything, just checking the type. If this wasn't here and the user put something other than a Matrix, what error would be raised?
| LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] | ||
| ) | ||
|
|
||
| def test_binomial_logistic_regression_bounds(self): |
There was a problem hiding this comment.
I agree this is probably overkill for testing this. The functionality is already in Scala and should be tested there, here in python we are just setting the parameters.
|
@BryanCutler @yanboliang @nchammas Thanks for all the comments. Unfortunately I don't have access to a hardware I can use for development at this moment, and most I likely I won't have in the upcoming weeks. I going to close this PR, but I'd really appreciate if one of you could pick it up from here. TIA |
## What changes were proposed in this pull request? Python API for Constrained Logistic Regression based on apache#17922 , thanks for the original contribution from zero323 . ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#18759 from yanboliang/SPARK-20601.
What changes were proposed in this pull request?
Paramstopyspark.ml.classification.LogisticRegression.toMatrixmethod topyspark.ml.param.TypeConverters.generate_multinomial_logistic_inputhelper topyspark.ml.tests.How was this patch tested?
Unit tests