[SPARK-28243][PYSPARK][ML] Remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams by huaxingao · Pull Request #25046 · apache/spark

huaxingao · 2019-07-03T15:05:33Z

What changes were proposed in this pull request?

Remove deprecated setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams

How was this patch tested?

Use existing tests.

…amplingRate from Python TreeEnsembleParams

SparkQA · 2019-07-03T15:25:18Z

Test build #107187 has finished for PR 25046 at commit 8eeb6fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-07-17T12:33:52Z

@huaxingao I think you're probably right on this but can you remind us here why you also remove subsampling rate? the feature strategy setter is still on the Scala side; is it also meant to just move rather than go away?

huaxingao · 2019-07-17T17:59:28Z

@srowen Sorry I didn't make it clear in the PR description.

On Scala side, initially, both setSubsamplingRate and setFeatureSubsetStrategy were in trait TreeEnsembleParams in treeParams.scala. These two setters were deprecated and moved from trait TreeEnsembleParams to RandomForestClassifier/Regressor, GBTClassifier/Regressor in 3.0.0.

In this PR, I did the same thing on python side, I moved setSubsamplingRate from TreeEnsembleParams to RandomForestClassifier/Regressor and GBTClassifier/Regressor. setFeatureSubsetStrategy is already in python RandomForestClassifier/Regressor and GBTClassifier/Regressor, so I simply removed it from TreeEnsembleParams.

srowen · 2019-07-17T18:14:47Z

OK this is really a follow up of 4aa9ccb#diff-6b8a041f558af2b7bc50d930b1ad2670 then. I wonder if we missed any other setters that were removed in Scala? but this seems OK. CC @mgaido91 FYI

huaxingao · 2019-07-17T18:43:17Z

I initially did this #21413 so I know I marked this TreeEnsembleParams.setFeatureSubsetStrategy deprecated and need to remove it later on. When I removed it I saw setSubsamplingRate and removed it too.

I will also move all the other deprecated setters in 4aa9ccb#diff-6b8a041f558af2b7bc50d930b1ad2670 too. Do you prefer me to do it in this PR or have a separate PR? @srowen

srowen · 2019-07-17T18:51:52Z

You can do it here if it's also just the Pyspark part of the change for consistency. Thanks!

mgaido91 · 2019-07-17T19:51:13Z

thanks for checking! Yes, it would be great to check them all. Thanks.

huaxingao · 2019-07-18T22:48:44Z

I made modifications in python to match the changes in 4aa9ccb#diff-6b8a041f558af2b7bc50d930b1ad2670. However, I didn't move the following 4 setters:
HasMaxIter.setMaxIter
HasSeed.setSeed
HasCheckpointInterval.setCheckpointInterval
HasStepSize.setStepSize

The reason that I didn't change these 4 setters is because besides DecisionTreeClassifier/Regressor, GBTClassifier/Regressor, RandomForestClassifier/Regressor (these are the files that changed in 4aa9ccb#diff-6b8a041f558af2b7bc50d930b1ad2670), quite a few other classes in PySpark also implement these HasXXX, for example, LogisticRegression, GaussianMixture, KMeans, LDA, etc. also implement HasMaxIter and rely on its setMaxIter. So I guess I will leave HasMaxIter.setMaxIter and the other 3 setters as is.

…assifierParams

SparkQA · 2019-07-18T23:24:39Z

Test build #107865 has finished for PR 25046 at commit 0dd3ba6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I think this looks good.

python/pyspark/ml/classification.py

srowen · 2019-07-20T15:44:39Z

Merged to master

…samplingRate from Python TreeEnsembleParams ## What changes were proposed in this pull request? Remove deprecated setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams ## How was this patch tested? Use existing tests. Closes apache#25046 from huaxingao/spark-28243. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

zhengruifeng · 2019-08-10T05:12:34Z

@huaxingao @srowen @mgaido91
I agree that we should remove those setter from the py side. However, we should not directly touch param/shared.py, instead we have to modify _shared_params_code_gen.py and then run python _shared_params_code_gen.py > shared.py.

This is caused by that the _shared_params_code_gen.py will automatic generate both the setter and the getter, while in the scala side, only getter is generated.
And in the scala side, DecisionTreeParams is not placed in shareParam.scala.

There are too many design conflicts between the class hierarchy of scala and py, it's too confusing that can not be maintained easily. Maybe it is time to re-org the py side to keep it in line the scala side.

I found this when I'm adding Implement Tree-Based Feature Transformation #25383 in the py side.

mgaido91 · 2019-08-10T08:11:09Z

There are too many design conflicts between the class hierarchy of scala and py, it's too confusing that can not be maintained easily. Maybe it is time to re-org the py side to keep it in line the scala side.

Yes, I do agree with you. we could also think to have a script which generates both APIs, in order to be sure that they are in sync. WDYT?

srowen · 2019-08-10T13:35:30Z

Oh, hm, OK. I am not even sure if shared.py is in sync with what the script produces then. If it's easy to fix later or separately, great, but not even sure this is a good strategy for maintaining the code.

huaxingao · 2019-08-10T18:23:24Z

I will modify _shared_params_code_gen.py and use it to generate shared.py. I am out of the town and will work on this after I come back on 8/15.
I also agree there are too many design conflicts between the class hierarchy of scala and py. It's a good idea to re-org the py side to keep it consistent with the scala side.

zhengruifeng · 2019-08-12T02:12:26Z

@mgaido91 It is a good idea. I think we may start with some script that only check the parity.

@srowen shared.py is not in sync with _shared_params_code_gen.py now.

@huaxingao I tend to change this part in #25383, maybe by moving DecisionTreeParams out of shared.py. Hope you could help reviewing.

[SPARK-28243][PYSPARK][ML]remove setFeatureSubsetStrategy and setSubs…

8eeb6fa

…amplingRate from Python TreeEnsembleParams

dongjoon-hyun changed the title ~~[SPARK-28243][PYSPARK][ML]remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams~~ [SPARK-28243][PYSPARK][ML] Remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams Jul 3, 2019

dongjoon-hyun added ML PYSPARK labels Jul 3, 2019

move setters from python DecisionTreeParams/RandomForestParams/TreeCl…

0dd3ba6

…assifierParams

srowen reviewed Jul 18, 2019

View reviewed changes

python/pyspark/ml/classification.py Show resolved Hide resolved

srowen approved these changes Jul 19, 2019

View reviewed changes

srowen closed this in 72c80ee Jul 20, 2019

zhengruifeng mentioned this pull request Aug 10, 2019

[SPARK-13677][ML] Implement Tree-Based Feature Transformation for ML #25383

Closed

huaxingao deleted the spark-28243 branch August 11, 2019 07:54

Conversation

huaxingao commented Jul 3, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 3, 2019

Uh oh!

srowen commented Jul 17, 2019

Uh oh!

huaxingao commented Jul 17, 2019

Uh oh!

srowen commented Jul 17, 2019

Uh oh!

huaxingao commented Jul 17, 2019

Uh oh!

srowen commented Jul 17, 2019

Uh oh!

mgaido91 commented Jul 17, 2019

Uh oh!

huaxingao commented Jul 18, 2019

Uh oh!

SparkQA commented Jul 18, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srowen commented Jul 20, 2019

Uh oh!

zhengruifeng commented Aug 10, 2019

Uh oh!

mgaido91 commented Aug 10, 2019

Uh oh!

srowen commented Aug 10, 2019

Uh oh!

huaxingao commented Aug 10, 2019

Uh oh!

zhengruifeng commented Aug 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants