[SPARK-13677][ML] Implement Tree-Based Feature Transformation for ML#25383
[SPARK-13677][ML] Implement Tree-Based Feature Transformation for ML#25383zhengruifeng wants to merge 19 commits intoapache:masterfrom
Conversation
|
Test build #108794 has finished for PR 25383 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Hm, it does add some extra complexity to be sure. It sounds kind of useful though.
|
|
||
| if ($(predictionCol).nonEmpty) { | ||
| val predictUDF = udf { (features: Vector) => predict(features) } | ||
| val predictUDF = udf { vector: Vector => predict(vector) } |
There was a problem hiding this comment.
Total nit but do you need to change these?
There was a problem hiding this comment.
to remove the unnecessary brackets, then I rename the var to keep in line with other places. I am neutral to revert these.
| def setLeafCol(value: String): this.type = set(leafCol, value) | ||
|
|
||
| @Since("3.0.0") | ||
| def predictLeaf(features: Vector): Double = predictLeafImpl(features) |
There was a problem hiding this comment.
Can this go in a superclass? Is there any need to separate predictLeaf from predictLeafImpl?
Likewise can setLeafCol go in a superclass to avoid repeating it?
There was a problem hiding this comment.
@srowen I prefer not to move setLeafCol into the superclass, since it looks like the mllib's convention, such as setVarianceCol in DecisionTreeRegressionModel.
|
@WeichenXu123 @mgaido91 Could you please help reviewing this too? This feature involve some complexity, but shoule be useful. |
|
Test build #108904 has finished for PR 25383 at commit
|
|
Test build #108907 has finished for PR 25383 at commit
|
|
I was going to add it to the py side, and found I have to add a new param I leave the py side for now, since it needs to touch many py files to pass though, like moving Due to the confusion of class hierarchy in the py side, there are many conflicts between scaia sidd and python side. It was more than six yeaars since spark support mllib in the python side, however, for a prediction model, we still cannot set the input/output columns. I'm wondering if we can begin to re-org the hierarchy of mllib in the python side? @srowen |
|
|
||
| /** @group setParam */ | ||
| @Since("3.0.0") | ||
| def setLeafCol(value: String): this.type = set(leafCol, value) |
There was a problem hiding this comment.
I think we can do some refactoring here, in order to dedup this. Can we add it to a trait?
There was a problem hiding this comment.
This way seems like mllib's convention that not add setter into the xxxParam-like trait, like setVarianceCol in DecisionTreeRegressionModel
There was a problem hiding this comment.
Yes, agree, it shouldn't be in the model classes. So DecisionTreeClassifierParams doesn't help. Hm... per the discussion below, I agree it's extra weight to refactor the common elements into a superclass of two decision tree classifiers, but it might well be worth it. It looks like it would save a few hundred lines of duplicated code? that would mitigate the concern about the large change here. I'm lightly in favor of going that way. I wouldn't do it for 10 lines of code or something.
| private[ml] trait GBTClassifierParams extends GBTParams with HasVarianceImpurity | ||
| with ProbabilisticClassifierParams { | ||
|
|
||
| override protected def validateAndTransformSchema( |
There was a problem hiding this comment.
in general, in this PR there are many code parts which are copy-pasted...can we dedup the code?
There was a problem hiding this comment.
Good point, I will look into it.
There was a problem hiding this comment.
@mgaido91 I tend to keep current way, that is because the superclasses are different:
1,the super.validateAndTransformSchema(schema, fitting, featuresDataType) in GBTClassifierParams & RandomForestClassifierParams are from ProbabilisticClassifierParams, which check cols probabilityCol,rawPredictionCol,featuresCol,labelCol,weightCol,predictionCol
2,while the super method called in RandomForestRegressorParams & GBTRegressorParams are from PredictorParams, which only check cols featuresCol,labelCol,weightCol,predictionCol
We can add another two trait for classification and regression, respectively. Like TreeEnsembleClassifierParams & TreeEnsembleRegressorParams.
However, I think this maybe not worthwhile, since there will be only two subclasses for each, and this will make the hierarchy more complex.
|
|
||
| /** @group setParam */ | ||
| @Since("3.0.0") | ||
| def setLeafCol(value: String): this.type = set(leafCol, value) |
There was a problem hiding this comment.
Yes, agree, it shouldn't be in the model classes. So DecisionTreeClassifierParams doesn't help. Hm... per the discussion below, I agree it's extra weight to refactor the common elements into a superclass of two decision tree classifiers, but it might well be worth it. It looks like it would save a few hundred lines of duplicated code? that would mitigate the concern about the large change here. I'm lightly in favor of going that way. I wouldn't do it for 10 lines of code or something.
| } | ||
|
|
||
| /** Returns the index of leaf given a input vector. | ||
| * The leave are indexed from zero by pre-order. |
There was a problem hiding this comment.
Nit: might just render this as
/**
* @return ...
*/
|
@srowen I update the pr by adding two trait for ensamble models. |
|
Test build #109166 has finished for PR 25383 at commit
|
|
Test build #109175 has finished for PR 25383 at commit
|
|
renaming the traits cause mima tests failure, so I revert it. |
|
Test build #109313 has finished for PR 25383 at commit
|
|
retest this please |
|
Test build #109316 has finished for PR 25383 at commit
|
068f7f9 to
7638ee4
Compare
|
Test build #109319 has finished for PR 25383 at commit
|
|
Test build #109321 has finished for PR 25383 at commit
|
|
@srowen Since the leaf transform is irrelevant to the type of model (it is only related to the tree struture) I update the suites to make the realated data and tree structures in common, which reduce more than 100 lines. As to the py side, since current pr touch to many files, I tend to leave it alone for now, and impl it in a follow up pr. WDYT? |
d69183b to
5c5a76e
Compare
|
Test build #109383 has finished for PR 25383 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Looking good, just some minor requests.
mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
Show resolved
Hide resolved
| super.transform(dataset) | ||
| .withColumn($(leafCol), leafUDF(col($(featuresCol)))) | ||
| } else { | ||
| super.transform(dataset) |
There was a problem hiding this comment.
It's trivial but I guess you could avoid calling this in two places ... call it once and either return it, or the result of it with a new column.
|
|
||
| /** | ||
| * @return the index of leaf given a input vector. | ||
| * The leave are indexed from zero by pre-order. |
There was a problem hiding this comment.
Nit: leave -> leaves. I might just write: @return the index of the leaf corresponding to the feature vector. Leaves are indexed in pre-order from 0. (Same for other occurrences)
| val df = sc.parallelize(data, 1).toDF("leafId", "features") | ||
| model.transform(df).select("leafId", "predictedLeafId") | ||
| .collect().foreach { | ||
| case Row(leafId: Vector, predictedLeafId: Vector) => |
There was a problem hiding this comment.
Total nit, but this should indent further, or just pull the previous line onto the line 2 lines above. Same below.
|
@srowen I make extra modifications, which move |
|
Test build #109452 has finished for PR 25383 at commit
|
|
Test build #109454 has finished for PR 25383 at commit
|
|
Yeah, I think setters shouldn't be exposed on model objects. Some changes in the past have fixed some of that. |
|
Merged to master |
What changes were proposed in this pull request?
Tree-based feature transformation is a widely used feature and already implemented in many famous libraries, like sklearn/xgboost/lightgbm/catboost. But is still missing in ML.
The previous discussions and design doc can be found in SPARK-13677, which is the only left subtask in 'GBT improvement umbrella' SPARK-14047.
This pr is to add tree-based feature transformation.
How was this patch tested?
existing and added suites