Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add weight support in ClusteringEvaluator

Why are the changes needed?

Currently, BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator support instance weight, but ClusteringEvaluator doesn't, so we will add instance weight support in ClusteringEvaluator.

Does this PR introduce any user-facing change?

Yes.
ClusteringEvaluator.setWeightCol

How was this patch tested?

add new unit test

@SparkQA
Copy link

SparkQA commented May 16, 2020

Test build #122734 has finished for PR 28553 at commit ac5ea1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@huaxingao
Copy link
Contributor Author

cc @srowen

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it but does this check somewhere that the weights are positive?

@huaxingao
Copy link
Contributor Author

Added the check to make sure weights are all positive.
We probably need to do this for everything that has instance weight support. I will open a ticket to fix all of them?

@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122945 has finished for PR 28553 at commit 06a04a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 22, 2020

Yeah there are several points in the code that check weights with a require. I don't know if they can be refactored easily. Your approach is fine. Other places check it farther downstream inside some other operation but it doesn't matter much.

@huaxingao
Copy link
Contributor Author

I mean in the following algorithms that have instance weight support, we probably also need to make sure the weights are positive.

  • DecisionTreeClassifier/Regressor
  • GBTClassifier/Regressor
  • LinearSVC
  • LogisticRegression
  • RandomForestClassifier/Regressor
  • LinearRegression
  • GeneralizedLinearRegression
  • KMean
  • BisectiongKMean
  • GaussianMixture
  • BinaryClassificationEvaluation
  • MulticlassClassificationEvaluation
  • RegressionEvaluation

I will take a look and fix them all. Not in this PR, though.

) =>
BLAS.axpy(1.0, features, featureSum)
(featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
require (weight >= 0.0, "illegal weight value: " + weight + " weight must be >= 0.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to do the check here so it doesn't require an extra pass to get all the weights.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; it's consistent with your other change. Really minor: use string interpolation?

@SparkQA
Copy link

SparkQA commented May 23, 2020

Test build #123031 has finished for PR 28553 at commit 8494352.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

) =>
BLAS.axpy(1.0, features, featureSum)
(featureSum, squaredNormSum + squaredNorm, numOfPoints + 1)
require (weight >= 0.0, "illegal weight value: " + weight + " weight must be >= 0.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; it's consistent with your other change. Really minor: use string interpolation?

(normalizedFeaturesSum, numOfPoints + 1)
case ((normalizedFeaturesSum: DenseVector, weightSum: Double),
(normalizedFeatures, weight)) =>
require (weight >= 0.0, "illegal weight value: " + weight + " weight must be >= 0.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and nit: remove space after require

@huaxingao
Copy link
Contributor Author

I didn't resolve conflicts correctly. I will fix the problem.

@SparkQA
Copy link

SparkQA commented May 25, 2020

Test build #123070 has finished for PR 28553 at commit ebd8848.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2020

Test build #123069 has finished for PR 28553 at commit 4f78591.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2020

Test build #123073 has finished for PR 28553 at commit cab88cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ClusterStats(featureSum: Vector, squaredNormSum: Double, weightSum: Double)

@srowen srowen closed this in d400777 May 25, 2020
@srowen
Copy link
Member

srowen commented May 25, 2020

Merged to master. You may want to further change the nonnegativity check in your other PR to use the new method you introduced there.

@huaxingao
Copy link
Contributor Author

Thanks! @srowen @zhengruifeng
I have updated the nonnegativity check in the other PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants