Skip to content

[SPARK-13677][ML] Implement Tree-Based Feature Transformation for ML#25383

Closed
zhengruifeng wants to merge 19 commits intoapache:masterfrom
zhengruifeng:tree_path
Closed

[SPARK-13677][ML] Implement Tree-Based Feature Transformation for ML#25383
zhengruifeng wants to merge 19 commits intoapache:masterfrom
zhengruifeng:tree_path

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Tree-based feature transformation is a widely used feature and already implemented in many famous libraries, like sklearn/xgboost/lightgbm/catboost. But is still missing in ML.
The previous discussions and design doc can be found in SPARK-13677, which is the only left subtask in 'GBT improvement umbrella' SPARK-14047.

This pr is to add tree-based feature transformation.

How was this patch tested?

existing and added suites

@SparkQA
Copy link

SparkQA commented Aug 8, 2019

Test build #108794 has finished for PR 25383 at commit 3af602e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it does add some extra complexity to be sure. It sounds kind of useful though.


if ($(predictionCol).nonEmpty) {
val predictUDF = udf { (features: Vector) => predict(features) }
val predictUDF = udf { vector: Vector => predict(vector) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nit but do you need to change these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove the unnecessary brackets, then I rename the var to keep in line with other places. I am neutral to revert these.

def setLeafCol(value: String): this.type = set(leafCol, value)

@Since("3.0.0")
def predictLeaf(features: Vector): Double = predictLeafImpl(features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go in a superclass? Is there any need to separate predictLeaf from predictLeafImpl?
Likewise can setLeafCol go in a superclass to avoid repeating it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I prefer not to move setLeafCol into the superclass, since it looks like the mllib's convention, such as setVarianceCol in DecisionTreeRegressionModel.

@zhengruifeng
Copy link
Contributor Author

@WeichenXu123 @mgaido91 Could you please help reviewing this too? This feature involve some complexity, but shoule be useful.

@SparkQA
Copy link

SparkQA commented Aug 10, 2019

Test build #108904 has finished for PR 25383 at commit f104995.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2019

Test build #108907 has finished for PR 25383 at commit 9c01d4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

I was going to add it to the py side, and found I have to add a new param leafCol in DecisionTreeParams in shared.py.
After I add leafCol in _shared_params_code_gen.py and run cmd python _shared_params_code_gen.py > shared.py, I find that setter of leafCol is also generated however it should not appear here, moreover #25046 was reverted.

I leave the py side for now, since it needs to touch many py files to pass though, like moving DecisionTreeParams out of shared.py.

Due to the confusion of class hierarchy in the py side, there are many conflicts between scaia sidd and python side. It was more than six yeaars since spark support mllib in the python side, however, for a prediction model, we still cannot set the input/output columns.

I'm wondering if we can begin to re-org the hierarchy of mllib in the python side? @srowen


/** @group setParam */
@Since("3.0.0")
def setLeafCol(value: String): this.type = set(leafCol, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do some refactoring here, in order to dedup this. Can we add it to a trait?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way seems like mllib's convention that not add setter into the xxxParam-like trait, like setVarianceCol in DecisionTreeRegressionModel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree, it shouldn't be in the model classes. So DecisionTreeClassifierParams doesn't help. Hm... per the discussion below, I agree it's extra weight to refactor the common elements into a superclass of two decision tree classifiers, but it might well be worth it. It looks like it would save a few hundred lines of duplicated code? that would mitigate the concern about the large change here. I'm lightly in favor of going that way. I wouldn't do it for 10 lines of code or something.

private[ml] trait GBTClassifierParams extends GBTParams with HasVarianceImpurity
with ProbabilisticClassifierParams {

override protected def validateAndTransformSchema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, in this PR there are many code parts which are copy-pasted...can we dedup the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 I tend to keep current way, that is because the superclasses are different:
1,the super.validateAndTransformSchema(schema, fitting, featuresDataType) in GBTClassifierParams & RandomForestClassifierParams are from ProbabilisticClassifierParams, which check cols probabilityCol,rawPredictionCol,featuresCol,labelCol,weightCol,predictionCol
2,while the super method called in RandomForestRegressorParams & GBTRegressorParams are from PredictorParams, which only check cols featuresCol,labelCol,weightCol,predictionCol

We can add another two trait for classification and regression, respectively. Like TreeEnsembleClassifierParams & TreeEnsembleRegressorParams.
However, I think this maybe not worthwhile, since there will be only two subclasses for each, and this will make the hierarchy more complex.


/** @group setParam */
@Since("3.0.0")
def setLeafCol(value: String): this.type = set(leafCol, value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree, it shouldn't be in the model classes. So DecisionTreeClassifierParams doesn't help. Hm... per the discussion below, I agree it's extra weight to refactor the common elements into a superclass of two decision tree classifiers, but it might well be worth it. It looks like it would save a few hundred lines of duplicated code? that would mitigate the concern about the large change here. I'm lightly in favor of going that way. I wouldn't do it for 10 lines of code or something.

}

/** Returns the index of leaf given a input vector.
* The leave are indexed from zero by pre-order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: might just render this as

/**
 * @return ...
 */

@zhengruifeng
Copy link
Contributor Author

@srowen I update the pr by adding two trait for ensamble models.
BTW, I rename two internal traits of impurty and add some comments, because the old TreeClassifierParams is quite misleading that take me half an hour to find the trait conflict when changing the hierarchy.

@SparkQA
Copy link

SparkQA commented Aug 16, 2019

Test build #109166 has finished for PR 25383 at commit 4a2fcfa.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2019

Test build #109175 has finished for PR 25383 at commit 092e115.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

renaming the traits cause mima tests failure, so I revert it.

@SparkQA
Copy link

SparkQA commented Aug 19, 2019

Test build #109313 has finished for PR 25383 at commit 068f7f9.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 19, 2019

Test build #109316 has finished for PR 25383 at commit 068f7f9.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2019

Test build #109319 has finished for PR 25383 at commit 7638ee4.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2019

Test build #109321 has finished for PR 25383 at commit d69183b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@srowen Since the leaf transform is irrelevant to the type of model (it is only related to the tree struture)

I update the suites to make the realated data and tree structures in common, which reduce more than 100 lines.

As to the py side, since current pr touch to many files, I tend to leave it alone for now, and impl it in a follow up pr. WDYT?

@SparkQA
Copy link

SparkQA commented Aug 20, 2019

Test build #109383 has finished for PR 25383 at commit 5c5a76e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just some minor requests.

super.transform(dataset)
.withColumn($(leafCol), leafUDF(col($(featuresCol))))
} else {
super.transform(dataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's trivial but I guess you could avoid calling this in two places ... call it once and either return it, or the result of it with a new column.


/**
* @return the index of leaf given a input vector.
* The leave are indexed from zero by pre-order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: leave -> leaves. I might just write: @return the index of the leaf corresponding to the feature vector. Leaves are indexed in pre-order from 0. (Same for other occurrences)

val df = sc.parallelize(data, 1).toDF("leafId", "features")
model.transform(df).select("leafId", "predictedLeafId")
.collect().foreach {
case Row(leafId: Vector, predictedLeafId: Vector) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nit, but this should indent further, or just pull the previous line onto the line 2 lines above. Same below.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Aug 21, 2019

@srowen I make extra modifications, which move setLeafCol into a superclass.
I am looking into the hierarchy of both py and scala sides, and found that in many places some setter are defined in both Estimator and Model, however there are still some places which put common setters in the corresponding param trait. I prefer the latter method, and suggest we may move common setters into superclasses in the future.

@SparkQA
Copy link

SparkQA commented Aug 21, 2019

Test build #109452 has finished for PR 25383 at commit 8247395.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 21, 2019

Test build #109454 has finished for PR 25383 at commit 4eb97be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 21, 2019

Yeah, I think setters shouldn't be exposed on model objects. Some changes in the past have fixed some of that.

@srowen
Copy link
Member

srowen commented Aug 22, 2019

Merged to master

@srowen srowen closed this in defb65e Aug 22, 2019
@zhengruifeng zhengruifeng deleted the tree_path branch August 23, 2019 01:51
@zhengruifeng
Copy link
Contributor Author

Thank you for reviewing! @srowen @mgaido91

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments