Skip to content

Comments

[SPARK-30533][ML][PYSPARK] Add classes to represent Java Regressors and RegressionModels#27241

Closed
zero323 wants to merge 2 commits intoapache:masterfrom
zero323:SPARK-30533
Closed

[SPARK-30533][ML][PYSPARK] Add classes to represent Java Regressors and RegressionModels#27241
zero323 wants to merge 2 commits intoapache:masterfrom
zero323:SPARK-30533

Conversation

@zero323
Copy link
Member

@zero323 zero323 commented Jan 16, 2020

What changes were proposed in this pull request?

This PR adds:

  • pyspark.ml.regression.JavaRegressor
  • pyspark.ml.regression.JavaRegressionModel

classes and replaces JavaPredictor and JavaPredictionModel in

  • LinearRegression / LinearRegressionModel
  • DecisionTreeRegressor / DecisionTreeRegressionModel (just addition as JavaPredictionModel hasn't been used)
  • RandomForestRegressor / RandomForestRegressionModel (just addition as JavaPredictionModel hasn't been used)
  • GBTRegressor / GBTRegressionModel (just addition as JavaPredictionModel hasn't been used)
  • AFTSurvivalRegression / AFTSurvivalRegressionModel
  • GeneralizedLinearRegression / GeneralizedLinearRegressionModel
  • FMRegressor / FMRegressionModel

Why are the changes needed?

  • Internal PySpark consistency.
  • Feature parity with Scala.
  • Intermediate step towards implementing SPARK-29212

Does this PR introduce any user-facing change?

It adds new base classes, so it will affect mro. Otherwise interfaces should stay intact.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116868 has finished for PR 27241 at commit 8c5988a.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JavaRegressor(JavaPredictor, _JavaPredictorParams):
  • class JavaRegressionModel(JavaPredictionModel, _JavaPredictorParams):
  • class LinearRegression(JavaRegressor, _LinearRegressionParams, JavaMLWritable, JavaMLReadable):
  • class LinearRegressionModel(JavaRegressionModel, _LinearRegressionParams, GeneralJavaMLWritable,
  • class DecisionTreeRegressor(JavaRegressor, _DecisionTreeRegressorParams, JavaMLWritable,
  • class DecisionTreeRegressionModel(JavaRegressionModel, _DecisionTreeModel, _DecisionTreeRegressorParams,
  • class RandomForestRegressor(JavaRegressor, _RandomForestRegressorParams, JavaMLWritable,
  • class RandomForestRegressionModel(JavaRegressionModel, _TreeEnsembleModel, _RandomForestRegressorParams,
  • class GBTRegressor(JavaRegressor, _GBTRegressorParams, JavaMLWritable, JavaMLReadable):
  • class GBTRegressionModel(JavaRegressionModel, _TreeEnsembleModel, _GBTRegressorParams, JavaMLWritable, JavaMLReadable):
  • class AFTSurvivalRegression(JavaRegressor, _AFTSurvivalRegressionParams,
  • class AFTSurvivalRegressionModel(JavaRegressionModel, _AFTSurvivalRegressionParams,
  • class GeneralizedLinearRegression(JavaRegressor, _GeneralizedLinearRegressionParams,
  • class GeneralizedLinearRegressionModel(JavaRegressionModel, _GeneralizedLinearRegressionParams,
  • class FMRegressor(JavaRegressor, _FactorizationMachinesParams, JavaMLWritable, JavaMLReadable):
  • class FMRegressionModel(JavaRegressionModel, _FactorizationMachinesParams, JavaMLWritable,

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116869 has finished for PR 27241 at commit b8127b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DecisionTreeRegressionModel(
  • class RandomForestRegressionModel(
  • class GBTRegressionModel(

@zero323 zero323 changed the title [WIP][SPARK-30533][ML][PYSPARK] Add classes to represent Java Regressors and RegressionModels [SPARK-30533][ML][PYSPARK] Add classes to represent Java Regressors and RegressionModels Jan 16, 2020
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this more for symmetry than anything else? I don't have a strong opinion. I'd defer to @huaxingao and @zhengruifeng who have been watching this much more closely.

@zero323
Copy link
Member Author

zero323 commented Jan 17, 2020

Is this more for symmetry than anything else? I don't have a strong opinion. I'd defer to @huaxingao and @zhengruifeng who have been watching this much more closely.

To some extent. In general it useful to be able to distinguish between Regressors, Classiffiers and other types of Params (more about here). Additionally we cannot really get Scala parity without these.

@zhengruifeng
Copy link
Contributor

Looks reasonable

@huaxingao
Copy link
Contributor

I think on scala side, it makes more sense to make all Regression classes extend Regressor, because the Regressor class is already there and some of the Regression classes already implement Regressor, so it's better to make all Regression classes follow the same standard. But I am not so sure about python side: the Regressor class is basically empty and I am not sure if we should add another layer of abstraction, so I chose not to add Regressor/Regressor class on python side when I did #27168. But I know you can argue that we need change python too to keep the parity between scala and python. I am OK either way.

@zero323
Copy link
Member Author

zero323 commented Jan 17, 2020

The Regressor class is basically empty and I am not sure if we should add another layer of abstraction, so I chose not to add Regressor/Regressor class on python side when I did #27168. But I know you can argue that we need change python too to keep the parity between scala and python. I am OK either way.

I am actually not that interested in parity (as it is right now it provides little or no value to the end user, inflates pyspark.ml codebase, and actually increases effort required to maintain the whole thing) as much as practical value. As I argued in discussion around #25776 (comment) ability to distinguish between types of predictors is fundamental for building complex ML workflows, and current API is not sufficient to that (Classifiers and non classifier Predictors usually have the same API, and produce identical output schema for predictions).

the Regressor class is basically empty

I am afraid that's, for good or bad, argument you can make against significant chunk of the API.

@huaxingao
Copy link
Contributor

LGTM

@srowen srowen closed this in 3228732 Jan 18, 2020
@srowen
Copy link
Member

srowen commented Jan 18, 2020

Merged to master

@zero323 zero323 deleted the SPARK-30533 branch January 18, 2020 01:42
@zero323
Copy link
Member Author

zero323 commented Jan 18, 2020

Thanks @huaxingao @srowen @zhengruifeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants