[SPARK-13677][ML] Implement Tree-Based Feature Transformation for ML by zhengruifeng · Pull Request #25383 · apache/spark

zhengruifeng · 2019-08-08T03:06:58Z

What changes were proposed in this pull request?

Tree-based feature transformation is a widely used feature and already implemented in many famous libraries, like sklearn/xgboost/lightgbm/catboost. But is still missing in ML.
The previous discussions and design doc can be found in SPARK-13677, which is the only left subtask in 'GBT improvement umbrella' SPARK-14047.

This pr is to add tree-based feature transformation.

How was this patch tested?

existing and added suites

SparkQA · 2019-08-08T04:23:56Z

Test build #108794 has finished for PR 25383 at commit 3af602e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Hm, it does add some extra complexity to be sure. It sounds kind of useful though.

srowen · 2019-08-09T15:18:18Z

mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala


    if ($(predictionCol).nonEmpty) {
-      val predictUDF = udf { (features: Vector) => predict(features) }
+      val predictUDF = udf { vector: Vector => predict(vector) }


Total nit but do you need to change these?

to remove the unnecessary brackets, then I rename the var to keep in line with other places. I am neutral to revert these.

srowen · 2019-08-09T15:22:43Z

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala

+  def setLeafCol(value: String): this.type = set(leafCol, value)
+
+  @Since("3.0.0")
+  def predictLeaf(features: Vector): Double = predictLeafImpl(features)


Can this go in a superclass? Is there any need to separate predictLeaf from predictLeafImpl?
Likewise can setLeafCol go in a superclass to avoid repeating it?

@srowen I prefer not to move setLeafCol into the superclass, since it looks like the mllib's convention, such as setVarianceCol in DecisionTreeRegressionModel.

zhengruifeng · 2019-08-10T04:04:57Z

@WeichenXu123 @mgaido91 Could you please help reviewing this too? This feature involve some complexity, but shoule be useful.

SparkQA · 2019-08-10T04:54:11Z

Test build #108904 has finished for PR 25383 at commit f104995.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-10T05:52:05Z

Test build #108907 has finished for PR 25383 at commit 9c01d4b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-08-10T06:23:12Z

I was going to add it to the py side, and found I have to add a new param leafCol in DecisionTreeParams in shared.py.
After I add leafCol in _shared_params_code_gen.py and run cmd python _shared_params_code_gen.py > shared.py, I find that setter of leafCol is also generated however it should not appear here, moreover #25046 was reverted.

I leave the py side for now, since it needs to touch many py files to pass though, like moving DecisionTreeParams out of shared.py.

Due to the confusion of class hierarchy in the py side, there are many conflicts between scaia sidd and python side. It was more than six yeaars since spark support mllib in the python side, however, for a prediction model, we still cannot set the input/output columns.

I'm wondering if we can begin to re-org the hierarchy of mllib in the python side? @srowen

mgaido91 · 2019-08-10T08:17:48Z

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala


+  /** @group setParam */
+  @Since("3.0.0")
+  def setLeafCol(value: String): this.type = set(leafCol, value)


I think we can do some refactoring here, in order to dedup this. Can we add it to a trait?

This way seems like mllib's convention that not add setter into the xxxParam-like trait, like setVarianceCol in DecisionTreeRegressionModel

Yes, agree, it shouldn't be in the model classes. So DecisionTreeClassifierParams doesn't help. Hm... per the discussion below, I agree it's extra weight to refactor the common elements into a superclass of two decision tree classifiers, but it might well be worth it. It looks like it would save a few hundred lines of duplicated code? that would mitigate the concern about the large change here. I'm lightly in favor of going that way. I wouldn't do it for 10 lines of code or something.

mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala

mgaido91 · 2019-08-10T08:25:41Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

+private[ml] trait GBTClassifierParams extends GBTParams with HasVarianceImpurity
+  with ProbabilisticClassifierParams {
+
+  override protected def validateAndTransformSchema(


in general, in this PR there are many code parts which are copy-pasted...can we dedup the code?

Good point, I will look into it.

@mgaido91 I tend to keep current way, that is because the superclasses are different:
1,the super.validateAndTransformSchema(schema, fitting, featuresDataType) in GBTClassifierParams & RandomForestClassifierParams are from ProbabilisticClassifierParams, which check cols probabilityCol,rawPredictionCol,featuresCol,labelCol,weightCol,predictionCol
2,while the super method called in RandomForestRegressorParams & GBTRegressorParams are from PredictorParams, which only check cols featuresCol,labelCol,weightCol,predictionCol

We can add another two trait for classification and regression, respectively. Like TreeEnsembleClassifierParams & TreeEnsembleRegressorParams.
However, I think this maybe not worthwhile, since there will be only two subclasses for each, and this will make the hierarchy more complex.

srowen · 2019-08-14T14:20:46Z

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala


+  /** @group setParam */
+  @Since("3.0.0")
+  def setLeafCol(value: String): this.type = set(leafCol, value)


Yes, agree, it shouldn't be in the model classes. So DecisionTreeClassifierParams doesn't help. Hm... per the discussion below, I agree it's extra weight to refactor the common elements into a superclass of two decision tree classifiers, but it might well be worth it. It looks like it would save a few hundred lines of duplicated code? that would mitigate the concern about the large change here. I'm lightly in favor of going that way. I wouldn't do it for 10 lines of code or something.

mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala

srowen · 2019-08-14T14:28:34Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala

+  }
+
+  /** Returns the index of leaf given a input vector.
+   *  The leave are indexed from zero by pre-order.


Nit: might just render this as

/** * @return ... */

zhengruifeng · 2019-08-15T10:26:47Z

@srowen I update the pr by adding two trait for ensamble models.
BTW, I rename two internal traits of impurty and add some comments, because the old TreeClassifierParams is quite misleading that take me half an hour to find the trait conflict when changing the hierarchy.

SparkQA · 2019-08-16T02:19:32Z

Test build #109166 has finished for PR 25383 at commit 4a2fcfa.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-16T04:06:50Z

Test build #109175 has finished for PR 25383 at commit 092e115.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-08-16T08:25:27Z

renaming the traits cause mima tests failure, so I revert it.

SparkQA · 2019-08-19T06:16:55Z

Test build #109313 has finished for PR 25383 at commit 068f7f9.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-08-19T06:24:06Z

retest this please

SparkQA · 2019-08-19T06:39:37Z

Test build #109316 has finished for PR 25383 at commit 068f7f9.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-19T07:04:12Z

Test build #109319 has finished for PR 25383 at commit 7638ee4.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-19T08:11:44Z

Test build #109321 has finished for PR 25383 at commit d69183b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-08-19T09:19:26Z

@srowen Since the leaf transform is irrelevant to the type of model (it is only related to the tree struture)

I update the suites to make the realated data and tree structures in common, which reduce more than 100 lines.

As to the py side, since current pr touch to many files, I tend to leave it alone for now, and impl it in a follow up pr. WDYT?

SparkQA · 2019-08-20T04:30:05Z

Test build #109383 has finished for PR 25383 at commit 5c5a76e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looking good, just some minor requests.

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala

srowen · 2019-08-20T14:04:14Z

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala

+      super.transform(dataset)
+        .withColumn($(leafCol), leafUDF(col($(featuresCol))))
+    } else {
+      super.transform(dataset)


It's trivial but I guess you could avoid calling this in two places ... call it once and either return it, or the result of it with a new column.

srowen · 2019-08-20T14:06:30Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala

+
+  /**
+   * @return the index of leaf given a input vector.
+   *         The leave are indexed from zero by pre-order.


Nit: leave -> leaves. I might just write: @return the index of the leaf corresponding to the feature vector. Leaves are indexed in pre-order from 0. (Same for other occurrences)

srowen · 2019-08-20T14:08:09Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

+    val df = sc.parallelize(data, 1).toDF("leafId", "features")
+    model.transform(df).select("leafId", "predictedLeafId")
+      .collect().foreach {
+      case Row(leafId: Vector, predictedLeafId: Vector) =>


Total nit, but this should indent further, or just pull the previous line onto the line 2 lines above. Same below.

zhengruifeng · 2019-08-21T02:44:54Z

@srowen I make extra modifications, which move setLeafCol into a superclass.
I am looking into the hierarchy of both py and scala sides, and found that in many places some setter are defined in both Estimator and Model, however there are still some places which put common setters in the corresponding param trait. I prefer the latter method, and suggest we may move common setters into superclasses in the future.

SparkQA · 2019-08-21T02:47:27Z

Test build #109452 has finished for PR 25383 at commit 8247395.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-21T04:08:18Z

Test build #109454 has finished for PR 25383 at commit 4eb97be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-08-21T17:14:23Z

Yeah, I think setters shouldn't be exposed on model objects. Some changes in the past have fixed some of that.

srowen · 2019-08-22T14:37:48Z

Merged to master

zhengruifeng · 2019-08-23T01:52:20Z

Thank you for reviewing! @srowen @mgaido91

dongjoon-hyun added the ML label Aug 8, 2019

srowen reviewed Aug 9, 2019

View reviewed changes

zhengruifeng mentioned this pull request Aug 10, 2019

[SPARK-28243][PYSPARK][ML] Remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams #25046

Closed

mgaido91 reviewed Aug 10, 2019

View reviewed changes

srowen reviewed Aug 14, 2019

View reviewed changes

zhengruifeng force-pushed the tree_path branch from 068f7f9 to 7638ee4 Compare August 19, 2019 06:49

zhengruifeng added 9 commits August 20, 2019 11:13

init

95b1d22

xxx

c17af00

init

92c2c86

add testsuites

c0ba410

update transform

8b297ff

update transform II

13027cb

nit

e46d3a1

move predictLeaf in to superclass

8709ed1

add some comments

eda4192

zhengruifeng added 8 commits August 20, 2019 11:13

add trait TreeEnsembleClassifierParams & TreeEnsembleRegressorParams

a4e60e3

rename trait of impurity

dd68ac6

nit

1d59303

revert trait renaming

3904602

make test suites more common

ae1cb9d

nit

b21f8e8

update suites

d9d7368

mv structure comment into the block

5c5a76e

zhengruifeng force-pushed the tree_path branch from d69183b to 5c5a76e Compare August 20, 2019 03:14

srowen requested changes Aug 20, 2019

View reviewed changes

update

8247395

revert setVarianceCol to avoid mima

4eb97be

srowen approved these changes Aug 22, 2019

View reviewed changes

srowen closed this in defb65e Aug 22, 2019

zhengruifeng deleted the tree_path branch August 23, 2019 01:51

Conversation

zhengruifeng commented Aug 8, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 8, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Aug 10, 2019

Uh oh!

SparkQA commented Aug 10, 2019

Uh oh!

SparkQA commented Aug 10, 2019

Uh oh!

zhengruifeng commented Aug 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Aug 15, 2019

Uh oh!

SparkQA commented Aug 16, 2019

Uh oh!

SparkQA commented Aug 16, 2019

Uh oh!

zhengruifeng commented Aug 16, 2019

Uh oh!

SparkQA commented Aug 19, 2019

Uh oh!

zhengruifeng commented Aug 19, 2019

Uh oh!

SparkQA commented Aug 19, 2019

Uh oh!

SparkQA commented Aug 19, 2019

Uh oh!

SparkQA commented Aug 19, 2019

Uh oh!

zhengruifeng commented Aug 19, 2019

Uh oh!

SparkQA commented Aug 20, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Aug 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 21, 2019

zhengruifeng commented Aug 21, 2019 •

edited

Loading