-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31893][ML] Add a generic ClassificationSummary trait #28710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * Field in "predictions" which gives the probability or rawPrediction of each class as a vector. | ||
| */ | ||
| @Since("3.1.0") | ||
| def scoreCol: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't name it to probabilityCol because not all the classifiers have probabilityCol. If no probability, will use rawPrediction instead to calculate the binary classification metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards-compatibility, keep probabilityCol in the subclass as an alias? or else that's a breaking change?
Would we ever want to expose both a probability and some raw score like logits? That might argue for some mixin for probabilityCol or something but that may get too complex.
|
Test build #123450 has finished for PR 28710 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly makes
mllib/src/main/scala/org/apache/spark/ml/classification/ClassificationSummary.scala
Outdated
Show resolved
Hide resolved
| * Field in "predictions" which gives the probability or rawPrediction of each class as a vector. | ||
| */ | ||
| @Since("3.1.0") | ||
| def scoreCol: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards-compatibility, keep probabilityCol in the subclass as an alias? or else that's a breaking change?
Would we ever want to expose both a probability and some raw score like logits? That might argue for some mixin for probabilityCol or something but that may get too complex.
|
I put probabilityCol back in the subclass and this should fix the test failure I had yesterday :) |
|
also cc @zhengruifeng |
|
Test build #123505 has finished for PR 28710 at commit
|
|
Test build #123509 has finished for PR 28710 at commit
|
|
Test build #123543 has finished for PR 28710 at commit
|
|
Test build #123545 has finished for PR 28710 at commit
|
project/MimaExcludes.scala
Outdated
| //[SPARK-31840] Add instance weight support in LogisticRegressionSummary | ||
| // weightCol in org.apache.spark.ml.classification.LogisticRegressionSummary is present only in current version | ||
| ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionSummary.weightCol") | ||
| ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionSummary.weightCol"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking through what if anything breaks here in user code. Are all the subclasses sealed traits? so could any user code have subclassed the old traits in a way that would then not compile now? I don't think so but I'm just going off reading the diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same general question on the Python side; I'm not as sure about the possibility of breaking something there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before change, the code structure is like this:
sealed trait LogisticRegressionSummary extends Serializable
sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary
sealed trait BinaryLogisticRegressionSummary extends LogisticRegressionSummary
sealed trait BinaryLogisticRegressionTrainingSummary extends
BinaryLogisticRegressionSummary with LogisticRegressionTrainingSummary
So seems OK to me, no users can subclass the old traits.
Python seems OK to me too. I only added a new method. Applications written for old versions of Spark shouldn't be affected by this.
project/MimaExcludes.scala
Outdated
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary.org$apache$spark$ml$classification$ClassificationSummary$_setter_$org$apache$spark$ml$classification$ClassificationSummary$$multiclassMetrics_="), | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary.org$apache$spark$ml$classification$ClassificationSummary$$multiclassMetrics"), | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary.weightCol"), | ||
| ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.classification.LogisticRegressionSummary.asBinary"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though they are sealed traits, could possibly the change of return type like this break in user code? Users get a BinaryClassificationSummary instead of BinaryLogisticRegressionSummary, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Thanks for your comment. I think we are still OK, because summary is still an instance of BinaryLogisticRegressionSummary
val blorModel = lr.fit(smallBinaryDataset)
assert(blorModel.summary.isInstanceOf[BinaryLogisticRegressionTrainingSummary])
assert(blorModel.summary.asBinary.isInstanceOf[BinaryLogisticRegressionSummary])
assert(blorModel.binarySummary.isInstanceOf[BinaryLogisticRegressionTrainingSummary])
Similarly, in multinomial case, summary is still an instance of LogisticRegressionTrainingSummary
val mlorModel = lr.setFamily("multinomial").fit(smallMultinomialDataset)
assert(mlorModel.summary.isInstanceOf[LogisticRegressionTrainingSummary])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in multinomial case, summary is still an instance of LogisticRegressionTrainingSummary
Is it expected?
I think a reasonable behavior should be:
for any classification model, a ClassificationSummary should be attached;
if and only if numClasses=2, then this summary is also a BinaryClassificationSummary
| import sparkSession.implicits._ | ||
|
|
||
| // TODO: Allow the user to vary the number of bins using a setBins method in | ||
| // BinaryClassificationMetrics. For now the default is set to 100. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'For now the default is set to 100.' it is wrong, the default numBins in BinaryClassificationMetrics is always 1000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of code was used in LogisticRegression and I moved it here. I'm not really sure what the comment means, but it might mean "Allow the user to vary the number of bins using a setBins method in BinaryClassificationMetrics. For now set it to 100."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I suggest changing it to the default value of 1000, so that the metrics are the same with evaluator and BinaryClassificationMetrics with default numBins
| predictions.select(col(scoreCol), col(labelCol).cast(DoubleType), | ||
| lit(1.0)).rdd.map { | ||
| case Row(score: Vector, label: Double, weight: Double) => (score(1), label, weight) | ||
| }, 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| * Field in "predictions" which gives the probability or rawPrediction of each class as a vector. | ||
| */ | ||
| @Since("3.1.0") | ||
| def scoreCol: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since probabilityCol can be added in subclasses, so can scoreCol be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need to keep scoreCol in this generic trait because it is needed in BinaryClassificationSummary
new BinaryClassificationMetrics(
predictions.select(col(scoreCol), col(labelCol).cast(DoubleType),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add another trait ProbabilisticClassifierSummary to avoid this scoreCol? it seems does not exist in current impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why there is no rawPredictionCol?
|
Test build #123789 has finished for PR 28710 at commit
|
|
So I guess we can add a |
|
@zhengruifeng Thanks for your comments. However, currently BinaryLogisticRegressionSummary uses probabilityCol for BinaryClassificationMetrics and this probabilityCol is in LogisticRegressionSummary instead of BinaryLogisticRegressionSummary. In order not to break the existing code, I need to make several changes for the above traits
To implement summary for other classifiers: For RandomForestClassifer (also for DecisionTreeClassifier and GBTClassiifer) |
|
@huaxingao Then in |
|
Test build #123980 has finished for PR 28710 at commit
|
|
Test build #124021 has finished for PR 28710 at commit
|
|
retest this please |
|
Test build #124030 has finished for PR 28710 at commit
|
|
retest this please |
| /** | ||
| * Abstraction for binary classification results for a given model. | ||
| */ | ||
| trait BinaryClassificationSummary extends ClassificationSummary { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[classification]?
| @Since("1.5.0") | ||
| @transient lazy val recallByThreshold: DataFrame = { | ||
| binaryMetrics.recallByThreshold().toDF("threshold", "recall") | ||
| override def scoreCol: String = if (probabilityCol.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if probabilityCol.isEmpty use rawPredictionCol instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since existing computation of binary did not use rawPredictionCol, so I think we can just let it alone for now.
In the furture, I think we can make it a little heuristic: if probabilityCol.nonEmpty use probabilityCol; then try rawPredictionCol
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.BinaryLogisticRegressionSummary.org$apache$spark$ml$classification$ClassificationSummary$$multiclassMetrics"), | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.BinaryLogisticRegressionSummary.weightCol") | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there some MiMa execlusions can be removed after the lastest change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MiMa exclusions have already been updated
|
Test build #124045 has finished for PR 28710 at commit
|
|
retest this please |
|
Test build #124110 has finished for PR 28710 at commit
|
|
retest this please |
|
Test build #124128 has finished for PR 28710 at commit
|
|
retest this please |
|
Test build #124129 has finished for PR 28710 at commit
|
|
retest this please |
|
Test build #124131 has finished for PR 28710 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I moved All the rest of the MiMa problems are InheritedNewAbstractMethodProblem. I think those are OK. |
|
retest this please |
| /** Number of training iterations. */ | ||
| @Since("3.1.0") | ||
| def totalIterations: Int = { | ||
| assert(objectiveHistory.length > 0, s"objectiveHistory length should be greater than 1.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert looks like objectiveHistory length should be greater than 0?
And string interpolation s"" is unnecessary.
| override def scoreCol: String = if (probabilityCol.nonEmpty) { | ||
| probabilityCol | ||
| } else { | ||
| throw new SparkException(s"probabilityCol is required for BinaryLogisticRegressionSummary.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s"" is not needed.
python/pyspark/ml/classification.py
Outdated
|
|
||
| @property | ||
| @since("2.0.0") | ||
| @since("3.1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this since version? Like above probabilityCol is not changed?
|
Test build #124197 has finished for PR 28710 at commit
|
|
Test build #124198 has finished for PR 28710 at commit
|
|
Test build #124206 has finished for PR 28710 at commit
|
| .. note:: This ignores instance weights (setting all to 1.0) from | ||
| `LogisticRegression.weightCol`. This will change in later Spark | ||
| versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you remove such note in latest commit. So now these methods use weightCol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I added weightCol in another PR, but forgot to update python comment.
| //[SPARK-31893] Add a generic ClassificationSummary trait | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionTrainingSummary.org$apache$spark$ml$classification$ClassificationSummary$_setter_$org$apache$spark$ml$classification$ClassificationSummary$$multiclassMetrics_="), | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionTrainingSummary.org$apache$spark$ml$classification$ClassificationSummary$$multiclassMetrics"), | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionTrainingSummary.weightCol"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it is for all cases. But this looks like just because weightCol is inherited from ClassificationSummary instead of LogisticRegressionSummary now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added weightCol in LogisticRegressionSummary in another PR (I just realized the MiMaExclude for that PR is not needed any more since I moved weightCol to ClassificationSummary)
3.0: no weightCol
after my changes:
ClassificationSummary has weightCol, LogisticRegressionSummary, LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary, BinaryLogisticRegressionTrainingSummary all have inherited weightCol, so there are four InheritedNewAbstractMethodProblem for weightCol
|
Test build #124227 has finished for PR 28710 at commit
|
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay. The InheritedNewAbstractMethodProblem mima problem looks because moving some methods to the traits created in this change.
|
@srowen Any more comments? |
|
Just catching up here: so all the MiMa excludes don't actually represent public API changes right? they're all sealed or private? |
|
Right. They are all sealed or private. No public API changes. @srowen |
|
Merged to master |
|
Thank you all! |
What changes were proposed in this pull request?
Add a generic ClassificationSummary trait
Why are the changes needed?
Add a generic ClassificationSummary trait so all the classification models can use it to implement summary.
Currently in classification, we only have summary implemented in
LogisticRegression. There are requests to implement summary forLinearSVCModelin https://issues.apache.org/jira/browse/SPARK-20249 and to implement summary forRandomForestClassificationModelin https://issues.apache.org/jira/browse/SPARK-23631. If we add a generic ClassificationSummary trait and put all the common code there, we can easily add summary toLinearSVCModelandRandomForestClassificationModel, and also add summary to all the other classification models.We can use the same approach to add a generic RegressionSummary trait to regression package and implement summary for all the regression models.
Does this PR introduce any user-facing change?
How was this patch tested?
existing tests