Skip to content

[SPARK-28243][PYSPARK][ML] Remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams#25046

Closed
huaxingao wants to merge 2 commits intoapache:masterfrom
huaxingao:spark-28243
Closed

[SPARK-28243][PYSPARK][ML] Remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams#25046
huaxingao wants to merge 2 commits intoapache:masterfrom
huaxingao:spark-28243

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Remove deprecated setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams

How was this patch tested?

Use existing tests.

@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107187 has finished for PR 25046 at commit 8eeb6fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28243][PYSPARK][ML]remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams [SPARK-28243][PYSPARK][ML] Remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams Jul 3, 2019
@srowen
Copy link
Member

srowen commented Jul 17, 2019

@huaxingao I think you're probably right on this but can you remind us here why you also remove subsampling rate? the feature strategy setter is still on the Scala side; is it also meant to just move rather than go away?

@huaxingao
Copy link
Contributor Author

@srowen Sorry I didn't make it clear in the PR description.

On Scala side, initially, both setSubsamplingRate and setFeatureSubsetStrategy were in trait TreeEnsembleParams in treeParams.scala. These two setters were deprecated and moved from trait TreeEnsembleParams to RandomForestClassifier/Regressor, GBTClassifier/Regressor in 3.0.0.

In this PR, I did the same thing on python side, I moved setSubsamplingRate from TreeEnsembleParams to RandomForestClassifier/Regressor and GBTClassifier/Regressor. setFeatureSubsetStrategy is already in python RandomForestClassifier/Regressor and GBTClassifier/Regressor, so I simply removed it from TreeEnsembleParams.

@srowen
Copy link
Member

srowen commented Jul 17, 2019

OK this is really a follow up of 4aa9ccb#diff-6b8a041f558af2b7bc50d930b1ad2670 then. I wonder if we missed any other setters that were removed in Scala? but this seems OK. CC @mgaido91 FYI

@huaxingao
Copy link
Contributor Author

I initially did this #21413 so I know I marked this TreeEnsembleParams.setFeatureSubsetStrategy deprecated and need to remove it later on. When I removed it I saw setSubsamplingRate and removed it too.

I will also move all the other deprecated setters in 4aa9ccb#diff-6b8a041f558af2b7bc50d930b1ad2670 too. Do you prefer me to do it in this PR or have a separate PR? @srowen

@srowen
Copy link
Member

srowen commented Jul 17, 2019

You can do it here if it's also just the Pyspark part of the change for consistency. Thanks!

@mgaido91
Copy link
Contributor

thanks for checking! Yes, it would be great to check them all. Thanks.

@huaxingao
Copy link
Contributor Author

I made modifications in python to match the changes in 4aa9ccb#diff-6b8a041f558af2b7bc50d930b1ad2670. However, I didn't move the following 4 setters:
HasMaxIter.setMaxIter
HasSeed.setSeed
HasCheckpointInterval.setCheckpointInterval
HasStepSize.setStepSize

The reason that I didn't change these 4 setters is because besides DecisionTreeClassifier/Regressor, GBTClassifier/Regressor, RandomForestClassifier/Regressor (these are the files that changed in 4aa9ccb#diff-6b8a041f558af2b7bc50d930b1ad2670), quite a few other classes in PySpark also implement these HasXXX, for example, LogisticRegression, GaussianMixture, KMeans, LDA, etc. also implement HasMaxIter and rely on its setMaxIter. So I guess I will leave HasMaxIter.setMaxIter and the other 3 setters as is.

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107865 has finished for PR 25046 at commit 0dd3ba6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good.

@srowen
Copy link
Member

srowen commented Jul 20, 2019

Merged to master

@srowen srowen closed this in 72c80ee Jul 20, 2019
yiheng pushed a commit to yiheng/spark that referenced this pull request Jul 24, 2019
…samplingRate from Python TreeEnsembleParams

## What changes were proposed in this pull request?

Remove deprecated setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams

## How was this patch tested?

Use existing tests.

Closes apache#25046 from huaxingao/spark-28243.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@zhengruifeng
Copy link
Contributor

@huaxingao @srowen @mgaido91
I agree that we should remove those setter from the py side. However, we should not directly touch param/shared.py, instead we have to modify _shared_params_code_gen.py and then run python _shared_params_code_gen.py > shared.py.

This is caused by that the _shared_params_code_gen.py will automatic generate both the setter and the getter, while in the scala side, only getter is generated.
And in the scala side, DecisionTreeParams is not placed in shareParam.scala.

There are too many design conflicts between the class hierarchy of scala and py, it's too confusing that can not be maintained easily. Maybe it is time to re-org the py side to keep it in line the scala side.

I found this when I'm adding Implement Tree-Based Feature Transformation #25383 in the py side.

@mgaido91
Copy link
Contributor

There are too many design conflicts between the class hierarchy of scala and py, it's too confusing that can not be maintained easily. Maybe it is time to re-org the py side to keep it in line the scala side.

Yes, I do agree with you. we could also think to have a script which generates both APIs, in order to be sure that they are in sync. WDYT?

@srowen
Copy link
Member

srowen commented Aug 10, 2019

Oh, hm, OK. I am not even sure if shared.py is in sync with what the script produces then. If it's easy to fix later or separately, great, but not even sure this is a good strategy for maintaining the code.

@huaxingao
Copy link
Contributor Author

I will modify _shared_params_code_gen.py and use it to generate shared.py. I am out of the town and will work on this after I come back on 8/15.
I also agree there are too many design conflicts between the class hierarchy of scala and py. It's a good idea to re-org the py side to keep it consistent with the scala side.

@huaxingao huaxingao deleted the spark-28243 branch August 11, 2019 07:54
@zhengruifeng
Copy link
Contributor

@mgaido91 It is a good idea. I think we may start with some script that only check the parity.

@srowen shared.py is not in sync with _shared_params_code_gen.py now.

@huaxingao I tend to change this part in #25383, maybe by moving DecisionTreeParams out of shared.py. Hope you could help reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants