[SPARK-32232][ML][PySpark] Make sure ML has the same default solver values between Scala and Python #29060

huaxingao · 2020-07-09T19:14:11Z

What changes were proposed in this pull request?

current problems:

        mlp = MultilayerPerceptronClassifier(layers=[2, 2, 2], seed=123)
        model = mlp.fit(df)
        path = tempfile.mkdtemp()
        model_path = path + "/mlp"
        model.save(model_path)
        model2 = MultilayerPerceptronClassificationModel.load(model_path)
        self.assertEqual(model2.getSolver(), "l-bfgs")    # this fails because model2.getSolver() returns 'auto'
        model2.transform(df)  
        # this fails with Exception pyspark.sql.utils.IllegalArgumentException: MultilayerPerceptronClassifier_dec859ed24ec parameter solver given invalid value auto.

FMClassifier/Regression and GeneralizedLinearRegression have the same problems.

Here are the root cause of the problems:

In HasSolver, both Scala and Python default solver to 'auto'
On Scala side, mlp overrides the default of solver to 'l-bfgs', FMClassifier/Regression overrides the default of solver to 'adamW', and glr overrides the default of solver to 'irls'
On Scala side, mlp overrides the default of solver in MultilayerPerceptronClassificationParams, so both MultilayerPerceptronClassification and MultilayerPerceptronClassificationModel have 'l-bfgs' as default
On Python side, mlp overrides the default of solver in MultilayerPerceptronClassification, so it has default as 'l-bfgs', but MultilayerPerceptronClassificationModel doesn't override the default so it gets the default from HasSolver which is 'auto'. In theory, we don't care about the solver value or any other params values for MultilayerPerceptronClassificationModel, because we have the fitted model already. That's why on Python side, we never set default values for any of the XXXModel.
when calling getSolver on the loaded mlp model, it calls this line of code underneath:

    def _transfer_params_from_java(self):
        """
        Transforms the embedded params from the companion Java object.
        """
        ......
                # SPARK-14931: Only check set params back to avoid default params mismatch.
                if self._java_obj.isSet(java_param):
                    value = _java2py(sc, self._java_obj.getOrDefault(java_param))
                    self._set(**{param.name: value})
        ......

that's why model2.getSolver() returns 'auto'. The code doesn't get the default Scala value (in this case 'l-bfgs') to set to Python param, so it takes the default value (in this case 'auto') on Python side.

when calling model2.transform(df), it calls this underneath:

    def _transfer_params_to_java(self):
        """
        Transforms the embedded params to the companion Java object.
        """
        ......
            if self.hasDefault(param):
                pair = self._make_java_param_pair(param, self._defaultParamMap[param])
                pair_defaults.append(pair)
        ......

Again, it gets the Python default solver which is 'auto', and this caused the Exception

Currently, on Scala side, for some of the algorithms, we set default values in the XXXParam, so both estimator and transformer get the default value. However, for some of the algorithms, we only set default in estimators, and the XXXModel doesn't get the default value. On Python side, we never set defaults for the XXXModel. This causes the default value inconsistency.
My proposed solution: set default params in XXXParam for both Scala and Python, so both the estimator and transformer have the same default value for both Scala and Python. I currently only changed solver in this PR. If everyone is OK with the fix, I will change all the other params as well.

I hope my explanation makes sense to your folks :)

Why are the changes needed?

Fix bug

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing and new tests

…alues between scala and python

SparkQA · 2020-07-09T20:46:39Z

Test build #125511 has finished for PR 29060 at commit d535ee0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-07-09T20:56:33Z

retest this please

huaxingao · 2020-07-09T21:36:50Z

cc @srowen @viirya @zhengruifeng

huaxingao · 2020-07-09T21:38:44Z

also cc @BryanCutler

srowen

Great analysis. Anything we can do to improve consistency probably avoids many more issues down the road. Thanks! LGTM pending tests.

SparkQA · 2020-07-09T22:54:58Z

Test build #125515 has finished for PR 29060 at commit d535ee0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-07-10T05:27:56Z

LGTM!
There is still something inconsistent between Scala side and Python side, after py.ml was refactored in 3.0.

srowen · 2020-07-11T15:02:27Z

Just so I'm clear, this is a standalone fix that can go into 3.0, but you might make other similar changes?

huaxingao · 2020-07-11T15:23:08Z

@srowen Right. I will make other similar changes in separate PRs.

…alues between Scala and Python # What changes were proposed in this pull request? current problems: ``` mlp = MultilayerPerceptronClassifier(layers=[2, 2, 2], seed=123) model = mlp.fit(df) path = tempfile.mkdtemp() model_path = path + "/mlp" model.save(model_path) model2 = MultilayerPerceptronClassificationModel.load(model_path) self.assertEqual(model2.getSolver(), "l-bfgs") # this fails because model2.getSolver() returns 'auto' model2.transform(df) # this fails with Exception pyspark.sql.utils.IllegalArgumentException: MultilayerPerceptronClassifier_dec859ed24ec parameter solver given invalid value auto. ``` FMClassifier/Regression and GeneralizedLinearRegression have the same problems. Here are the root cause of the problems: 1. In HasSolver, both Scala and Python default solver to 'auto' 2. On Scala side, mlp overrides the default of solver to 'l-bfgs', FMClassifier/Regression overrides the default of solver to 'adamW', and glr overrides the default of solver to 'irls' 3. On Scala side, mlp overrides the default of solver in MultilayerPerceptronClassificationParams, so both MultilayerPerceptronClassification and MultilayerPerceptronClassificationModel have 'l-bfgs' as default 4. On Python side, mlp overrides the default of solver in MultilayerPerceptronClassification, so it has default as 'l-bfgs', but MultilayerPerceptronClassificationModel doesn't override the default so it gets the default from HasSolver which is 'auto'. In theory, we don't care about the solver value or any other params values for MultilayerPerceptronClassificationModel, because we have the fitted model already. That's why on Python side, we never set default values for any of the XXXModel. 5. when calling getSolver on the loaded mlp model, it calls this line of code underneath: ``` def _transfer_params_from_java(self): """ Transforms the embedded params from the companion Java object. """ ...... # SPARK-14931: Only check set params back to avoid default params mismatch. if self._java_obj.isSet(java_param): value = _java2py(sc, self._java_obj.getOrDefault(java_param)) self._set(**{param.name: value}) ...... ``` that's why model2.getSolver() returns 'auto'. The code doesn't get the default Scala value (in this case 'l-bfgs') to set to Python param, so it takes the default value (in this case 'auto') on Python side. 6. when calling model2.transform(df), it calls this underneath: ``` def _transfer_params_to_java(self): """ Transforms the embedded params to the companion Java object. """ ...... if self.hasDefault(param): pair = self._make_java_param_pair(param, self._defaultParamMap[param]) pair_defaults.append(pair) ...... ``` Again, it gets the Python default solver which is 'auto', and this caused the Exception 7. Currently, on Scala side, for some of the algorithms, we set default values in the XXXParam, so both estimator and transformer get the default value. However, for some of the algorithms, we only set default in estimators, and the XXXModel doesn't get the default value. On Python side, we never set defaults for the XXXModel. This causes the default value inconsistency. 8. My proposed solution: set default params in XXXParam for both Scala and Python, so both the estimator and transformer have the same default value for both Scala and Python. I currently only changed solver in this PR. If everyone is OK with the fix, I will change all the other params as well. I hope my explanation makes sense to your folks :) ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing and new tests Closes #29060 from huaxingao/solver_parity. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 99b4b06) Signed-off-by: Sean Owen <[email protected]>

srowen · 2020-07-11T15:38:15Z

Merged to master/3.0. In your own time you're welcome to apply similar fixes. Thanks for tracking it down!

huaxingao · 2020-07-11T15:42:50Z

Thank you everyone!

…alues between Scala and Python # What changes were proposed in this pull request? current problems: ``` mlp = MultilayerPerceptronClassifier(layers=[2, 2, 2], seed=123) model = mlp.fit(df) path = tempfile.mkdtemp() model_path = path + "/mlp" model.save(model_path) model2 = MultilayerPerceptronClassificationModel.load(model_path) self.assertEqual(model2.getSolver(), "l-bfgs") # this fails because model2.getSolver() returns 'auto' model2.transform(df) # this fails with Exception pyspark.sql.utils.IllegalArgumentException: MultilayerPerceptronClassifier_dec859ed24ec parameter solver given invalid value auto. ``` FMClassifier/Regression and GeneralizedLinearRegression have the same problems. Here are the root cause of the problems: 1. In HasSolver, both Scala and Python default solver to 'auto' 2. On Scala side, mlp overrides the default of solver to 'l-bfgs', FMClassifier/Regression overrides the default of solver to 'adamW', and glr overrides the default of solver to 'irls' 3. On Scala side, mlp overrides the default of solver in MultilayerPerceptronClassificationParams, so both MultilayerPerceptronClassification and MultilayerPerceptronClassificationModel have 'l-bfgs' as default 4. On Python side, mlp overrides the default of solver in MultilayerPerceptronClassification, so it has default as 'l-bfgs', but MultilayerPerceptronClassificationModel doesn't override the default so it gets the default from HasSolver which is 'auto'. In theory, we don't care about the solver value or any other params values for MultilayerPerceptronClassificationModel, because we have the fitted model already. That's why on Python side, we never set default values for any of the XXXModel. 5. when calling getSolver on the loaded mlp model, it calls this line of code underneath: ``` def _transfer_params_from_java(self): """ Transforms the embedded params from the companion Java object. """ ...... # SPARK-14931: Only check set params back to avoid default params mismatch. if self._java_obj.isSet(java_param): value = _java2py(sc, self._java_obj.getOrDefault(java_param)) self._set(**{param.name: value}) ...... ``` that's why model2.getSolver() returns 'auto'. The code doesn't get the default Scala value (in this case 'l-bfgs') to set to Python param, so it takes the default value (in this case 'auto') on Python side. 6. when calling model2.transform(df), it calls this underneath: ``` def _transfer_params_to_java(self): """ Transforms the embedded params to the companion Java object. """ ...... if self.hasDefault(param): pair = self._make_java_param_pair(param, self._defaultParamMap[param]) pair_defaults.append(pair) ...... ``` Again, it gets the Python default solver which is 'auto', and this caused the Exception 7. Currently, on Scala side, for some of the algorithms, we set default values in the XXXParam, so both estimator and transformer get the default value. However, for some of the algorithms, we only set default in estimators, and the XXXModel doesn't get the default value. On Python side, we never set defaults for the XXXModel. This causes the default value inconsistency. 8. My proposed solution: set default params in XXXParam for both Scala and Python, so both the estimator and transformer have the same default value for both Scala and Python. I currently only changed solver in this PR. If everyone is OK with the fix, I will change all the other params as well. I hope my explanation makes sense to your folks :) ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing and new tests Closes apache#29060 from huaxingao/solver_parity. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 99b4b06) Signed-off-by: Sean Owen <[email protected]>

[SPARK-32232][ML][PySpark] Make sure ML has the same default solver v…

d535ee0

…alues between scala and python

probot-autolabeler bot added ML PYTHON labels Jul 9, 2020

srowen approved these changes Jul 9, 2020

View reviewed changes

viirya approved these changes Jul 9, 2020

View reviewed changes

zhengruifeng approved these changes Jul 10, 2020

View reviewed changes

srowen closed this in 99b4b06 Jul 11, 2020

huaxingao deleted the solver_parity branch July 11, 2020 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-32232][ML][PySpark] Make sure ML has the same default solver values between Scala and Python #29060

[SPARK-32232][ML][PySpark] Make sure ML has the same default solver values between Scala and Python #29060

Uh oh!

huaxingao commented Jul 9, 2020 •

edited

Loading

Uh oh!

SparkQA commented Jul 9, 2020

Uh oh!

huaxingao commented Jul 9, 2020

Uh oh!

huaxingao commented Jul 9, 2020

Uh oh!

huaxingao commented Jul 9, 2020

Uh oh!

srowen left a comment

Uh oh!

SparkQA commented Jul 9, 2020

Uh oh!

zhengruifeng commented Jul 10, 2020

Uh oh!

srowen commented Jul 11, 2020

Uh oh!

huaxingao commented Jul 11, 2020

Uh oh!

srowen commented Jul 11, 2020

Uh oh!

huaxingao commented Jul 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-32232][ML][PySpark] Make sure ML has the same default solver values between Scala and Python #29060

[SPARK-32232][ML][PySpark] Make sure ML has the same default solver values between Scala and Python #29060

Uh oh!

Conversation

huaxingao commented Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 9, 2020

Uh oh!

huaxingao commented Jul 9, 2020

Uh oh!

huaxingao commented Jul 9, 2020

Uh oh!

huaxingao commented Jul 9, 2020

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 9, 2020

Uh oh!

zhengruifeng commented Jul 10, 2020

Uh oh!

srowen commented Jul 11, 2020

Uh oh!

huaxingao commented Jul 11, 2020

Uh oh!

srowen commented Jul 11, 2020

Uh oh!

huaxingao commented Jul 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

huaxingao commented Jul 9, 2020 •

edited

Loading