-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18596][ML] add checking and caching to bisecting kmeans #16020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #69198 has finished for PR 16020 at commit
|
|
|
||
| @Since("2.0.0") | ||
| override def fit(dataset: Dataset[_]): BisectingKMeansModel = { | ||
| val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I've been meaning to log a ticket for this issue, but have been tied up.
This will actually never work. dataset.rdd will always have storage level NONE. To see this:
scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel
scala> val df = spark.range(10).toDF("num")
df: org.apache.spark.sql.DataFrame = [num: bigint]
scala> df.storageLevel == StorageLevel.NONE
res0: Boolean = true
scala> df.persist
res1: df.type = [num: bigint]
scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
res2: Boolean = true
scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
res3: Boolean = false
scala> df.rdd.getStorageLevel == StorageLevel.NONE
res4: Boolean = true
So in fact all the algorithms that are checking for storage level using dataset.rdd are actually double-caching the data if the input DataFrame is actually cached, because the RDD will not appear to be cached.
So we should migrate all the checks to use dataset.storageLevel which was added in #13780
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking on this. I feel like we should have a unit test for this, but probably not here.
| model.setSummary(Some(summary)) | ||
| if (handlePersistence) instances.unpersist() | ||
| instr.logSuccess(model) | ||
| if (handlePersistence) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer to keep this form according to style guide.
| val summary = new BisectingKMeansSummary( | ||
| model.transform(dataset), $(predictionCol), $(featuresCol), $(k)) | ||
| model.setSummary(Some(summary)) | ||
| if (handlePersistence) rdd.unpersist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer
if (handlePersistence) {
rdd.unpersist()
}
|
Test build #69247 has finished for PR 16020 at commit
|
| if (handlePersistence) { | ||
| instances.unpersist() | ||
| } | ||
| instr.logSuccess(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handlePersistence check in KMeans at L309 should also be updated to use dataset.storageLevel. Since we're touching KMeans here anyway we may as well do it now.
|
Thanks @MLnick . Sure we can change KMeans as well. I'll send update. |
|
Yes unit tests would be good to add.
Tests may require using event listeners to check the caching of the
intermediate dataset with/without cached initial data. Or at least that is
the only way I thought of so far ...
…On Fri, 2 Dec 2016 at 20:01, Yuhao Yang ***@***.***> wrote:
Sure we can change KMeans as well. And perhaps we can also add some unit
tests. I'll send update.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16020 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SBzXhXX-ontVOGsi0svaWEx1b3Itcks5rEFztgaJpZM4K9Eet>
.
|
|
I just added a straight forward unit test to confirm we're using the correct way to check the cache level. Checking how to register a event listener... |
|
Test build #69582 has finished for PR 16020 at commit
|
|
@MLnick Do you think we still need the event listener unit test? |
|
Close this as it's better resolved in https://issues.apache.org/jira/browse/SPARK-18608. |
What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-18596
This is a follow up for https://issues.apache.org/jira/browse/SPARK-18356.
Check if the DataFrame sent to BisectingKMeans is cached, if not, we need to cache the converted RDD to ensure performance for BisectingKMeans.
How was this patch tested?
existing unit tests.