-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case #16355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
f77904d
55ab179
75192a7
14c7ce3
a8cf2f0
138ab34
ba00ad0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,9 +29,12 @@ class BisectingKMeansSuite | |
| final val k = 5 | ||
| @transient var dataset: Dataset[_] = _ | ||
|
|
||
| @transient var sparseDataset: Dataset[_] = _ | ||
|
|
||
| override def beforeAll(): Unit = { | ||
| super.beforeAll() | ||
| dataset = KMeansSuite.generateKMeansData(spark, 50, 3, k) | ||
| sparseDataset = KMeansSuite.generateSparseData(spark, 10, 1000, 42) | ||
| } | ||
|
|
||
| test("default parameters") { | ||
|
|
@@ -51,6 +54,18 @@ class BisectingKMeansSuite | |
| assert(copiedModel.hasSummary) | ||
| } | ||
|
|
||
| test("SPARK-16473: Verify Bisecting K-Means does not fail in edge case where" + | ||
| "one cluster is empty after split") { | ||
| val bkm = new BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4) | ||
|
|
||
| // Verify fit does not fail on very sparse data | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not clear to me that this unit test actually tests the issue fixed in this PR. Is there a good way to see why it would? If not, then it would be great to write a tiny dataset by hand which would trigger the failure.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added this check to verify: |
||
| val model = bkm.fit(sparseDataset) | ||
| val result = model.transform(sparseDataset) | ||
| val numClusters = result.select("prediction").distinct().collect().length | ||
| // Verify we hit the edge case | ||
| assert(numClusters < k && numClusters > 1) | ||
| } | ||
|
|
||
| test("setter/getter") { | ||
| val bkm = new BisectingKMeans() | ||
| .setK(9) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please set the seed too since this test seems at risk of flakiness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, added seed