[SPARK-18596][ML] add checking and caching to bisecting kmeans #16020

hhbyyh · 2016-11-27T05:40:19Z

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-18596
This is a follow up for https://issues.apache.org/jira/browse/SPARK-18356.

Check if the DataFrame sent to BisectingKMeans is cached, if not, we need to cache the converted RDD to ensure performance for BisectingKMeans.

How was this patch tested?

existing unit tests.

SparkQA · 2016-11-27T06:41:40Z

Test build #69198 has finished for PR 16020 at commit a4c88ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-11-28T09:10:18Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala


  @Since("2.0.0")
  override def fit(dataset: Dataset[_]): BisectingKMeansModel = {
+    val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE


By the way, I've been meaning to log a ticket for this issue, but have been tied up.

This will actually never work. dataset.rdd will always have storage level NONE. To see this:

scala> import org.apache.spark.storage.StorageLevel import org.apache.spark.storage.StorageLevel scala> val df = spark.range(10).toDF("num") df: org.apache.spark.sql.DataFrame = [num: bigint] scala> df.storageLevel == StorageLevel.NONE res0: Boolean = true scala> df.persist res1: df.type = [num: bigint] scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK res2: Boolean = true scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK res3: Boolean = false scala> df.rdd.getStorageLevel == StorageLevel.NONE res4: Boolean = true

So in fact all the algorithms that are checking for storage level using dataset.rdd are actually double-caching the data if the input DataFrame is actually cached, because the RDD will not appear to be cached.

So we should migrate all the checks to use dataset.storageLevel which was added in #13780

Thanks for checking on this. I feel like we should have a unit test for this, but probably not here.

MLnick · 2016-11-28T09:10:34Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

    model.setSummary(Some(summary))
+    if (handlePersistence) instances.unpersist()
    instr.logSuccess(model)
-    if (handlePersistence) {


prefer to keep this form according to style guide.

MLnick · 2016-11-28T09:11:10Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

    val summary = new BisectingKMeansSummary(
      model.transform(dataset), $(predictionCol), $(featuresCol), $(k))
    model.setSummary(Some(summary))
+    if (handlePersistence) rdd.unpersist()


Prefer

if (handlePersistence) { rdd.unpersist() }

SparkQA · 2016-11-28T14:42:03Z

Test build #69247 has finished for PR 16020 at commit 0105220.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-12-02T07:50:49Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

    if (handlePersistence) {
      instances.unpersist()
    }
+    instr.logSuccess(model)


The handlePersistence check in KMeans at L309 should also be updated to use dataset.storageLevel. Since we're touching KMeans here anyway we may as well do it now.

hhbyyh · 2016-12-02T18:00:53Z

Thanks @MLnick . Sure we can change KMeans as well. I'll send update.

MLnick · 2016-12-02T18:28:26Z

Yes unit tests would be good to add. Tests may require using event listeners to check the caching of the intermediate dataset with/without cached initial data. Or at least that is the only way I thought of so far ...

…

On Fri, 2 Dec 2016 at 20:01, Yuhao Yang ***@***.***> wrote: Sure we can change KMeans as well. And perhaps we can also add some unit tests. I'll send update. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16020 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_SBzXhXX-ontVOGsi0svaWEx1b3Itcks5rEFztgaJpZM4K9Eet> .

hhbyyh · 2016-12-02T18:51:42Z

I just added a straight forward unit test to confirm we're using the correct way to check the cache level. Checking how to register a event listener...

SparkQA · 2016-12-02T20:03:51Z

Test build #69582 has finished for PR 16020 at commit 678cd43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2016-12-08T03:15:13Z

@MLnick Do you think we still need the event listener unit test?

hhbyyh · 2017-02-21T06:51:50Z

Close this as it's better resolved in https://issues.apache.org/jira/browse/SPARK-18608.
Thanks for the comments and discussion.

add checking and caching to bisecting kmeans

a4c88ee

MLnick suggested changes Nov 28, 2016

View reviewed changes

hhbyyh added 2 commits November 28, 2016 05:32

Merge remote-tracking branch 'upstream/master' into bikcache

f38c81a

change storage check

0105220

MLnick reviewed Dec 2, 2016

View reviewed changes

Merge remote-tracking branch 'upstream/master' into bikcache

0fc34dc

kmeans cache

678cd43

hhbyyh closed this Feb 21, 2017

[SPARK-18596][ML] add checking and caching to bisecting kmeans #16020

[SPARK-18596][ML] add checking and caching to bisecting kmeans #16020

Uh oh!

Conversation

hhbyyh commented Nov 27, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 27, 2016

Uh oh!

MLnick Nov 28, 2016

Choose a reason for hiding this comment

Uh oh!

hhbyyh Nov 28, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick Nov 28, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick Nov 28, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 28, 2016

Uh oh!

MLnick Dec 2, 2016

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Dec 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MLnick commented Dec 2, 2016 via email

Uh oh!

hhbyyh commented Dec 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

hhbyyh commented Dec 8, 2016

Uh oh!

hhbyyh commented Feb 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hhbyyh commented Dec 2, 2016 •

edited

Loading

hhbyyh commented Dec 2, 2016 •

edited

Loading