Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Nov 27, 2016

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-18596
This is a follow up for https://issues.apache.org/jira/browse/SPARK-18356.

Check if the DataFrame sent to BisectingKMeans is cached, if not, we need to cache the converted RDD to ensure performance for BisectingKMeans.

How was this patch tested?

existing unit tests.

@SparkQA
Copy link

SparkQA commented Nov 27, 2016

Test build #69198 has finished for PR 16020 at commit a4c88ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


@Since("2.0.0")
override def fit(dataset: Dataset[_]): BisectingKMeansModel = {
val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I've been meaning to log a ticket for this issue, but have been tied up.

This will actually never work. dataset.rdd will always have storage level NONE. To see this:

scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel

scala> val df = spark.range(10).toDF("num")
df: org.apache.spark.sql.DataFrame = [num: bigint]

scala> df.storageLevel == StorageLevel.NONE
res0: Boolean = true

scala> df.persist
res1: df.type = [num: bigint]

scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
res2: Boolean = true

scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
res3: Boolean = false

scala> df.rdd.getStorageLevel == StorageLevel.NONE
res4: Boolean = true

So in fact all the algorithms that are checking for storage level using dataset.rdd are actually double-caching the data if the input DataFrame is actually cached, because the RDD will not appear to be cached.

So we should migrate all the checks to use dataset.storageLevel which was added in #13780

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking on this. I feel like we should have a unit test for this, but probably not here.

model.setSummary(Some(summary))
if (handlePersistence) instances.unpersist()
instr.logSuccess(model)
if (handlePersistence) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to keep this form according to style guide.

val summary = new BisectingKMeansSummary(
model.transform(dataset), $(predictionCol), $(featuresCol), $(k))
model.setSummary(Some(summary))
if (handlePersistence) rdd.unpersist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer

if (handlePersistence) {
  rdd.unpersist()
}

@SparkQA
Copy link

SparkQA commented Nov 28, 2016

Test build #69247 has finished for PR 16020 at commit 0105220.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (handlePersistence) {
instances.unpersist()
}
instr.logSuccess(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handlePersistence check in KMeans at L309 should also be updated to use dataset.storageLevel. Since we're touching KMeans here anyway we may as well do it now.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Dec 2, 2016

Thanks @MLnick . Sure we can change KMeans as well. I'll send update.

@MLnick
Copy link
Contributor

MLnick commented Dec 2, 2016 via email

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Dec 2, 2016

I just added a straight forward unit test to confirm we're using the correct way to check the cache level. Checking how to register a event listener...

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69582 has finished for PR 16020 at commit 678cd43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Dec 8, 2016

@MLnick Do you think we still need the event listener unit test?

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Feb 21, 2017

Close this as it's better resolved in https://issues.apache.org/jira/browse/SPARK-18608.
Thanks for the comments and discussion.

@hhbyyh hhbyyh closed this Feb 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants