-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16063][SQL] Add storageLevel to Dataset #13780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * | ||
| * @group basic | ||
| * @since 2.0.0 | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just call it storageLevel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be more familiar for RDD users, but no strong feeling about it. Will update
|
Can you reset the non relevant changes? |
|
Test build #60836 has finished for PR 13780 at commit
|
|
Test build #60843 has finished for PR 13780 at commit
|
| ds1.unpersist() | ||
| assert(ds1.storageLevel() == StorageLevel.NONE) | ||
| // non-default storage level | ||
| ds1.persist(StorageLevel.MEMORY_ONLY_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When writing black-box testing, I might just try all the levels in the test case. Even we can include some customized StorageLevel, which is different from the defined one.
import org.apache.spark.storage.StorageLevel._
Seq(NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER,
MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER,
MEMORY_AND_DISK_SER_2, OFF_HEAP).foreach { level =>
ds1.persist(level)
assert(ds1.storageLevel() == level)
ds1.unpersist()
assert(ds1.storageLevel() == StorageLevel.NONE)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kinda neutral on this - it doesn't really seem necessary to me, since pretty much by definition if one storage level works then they all do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew. : ) That is white box testing. Normally, writing test cases should not be done by the same person who wrote the code.
|
Overall LGTM, just one minor comment about the test case. BTW, maybe you can change the PR title after you updating the function name? Thanks! |
|
Can you make sure Python Dataframe has this method too? |
| @since(1.3) | ||
| def cache(self): | ||
| """ Persists with the default storage level (C{MEMORY_ONLY}). | ||
| """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin I updated the default in the doc, as it was actually incorrect previously.
|
just some minor comments. this looks pretty good |
| after the first time it is computed. This can only be used to assign | ||
| a new storage level if the RDD does not have a storage level set yet. | ||
| If no storage level is specified defaults to (C{MEMORY_ONLY}). | ||
| def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin I updated the default here in persist, to match cache.
But actually, it's still not quite correct - the default storage levels for Python are all serialized. But the MEMORY-based ones don't match the Scala side (which are deserialized). This was done for RDDs but doesn't quite work for DataFrames (since DF on the Scala side is cached deserialized by default).
So here df.cache() results in MEMORY_AND_DISK (deser) while df.persist() results in MEMORY_AND_DISK (ser). Ideally I'd say we don't want to encourage users to accidentally use the serialized forms for memory-based DF caching (since it is less efficient, as I understand it?). Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option is to set the default storage level here to None instead, and if it's not set call _jfd.persist() to ensure behaviour is the same as cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @rxin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @rxin on this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One downside of that approach is the user can't easily explicitly cache in-memory only deserialized or even cache on two machines deserialized easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is true. The issue is that the deser versions were deprecated in #10092 and made to equal the ser versions, and so can't actually be specified in Python any more. Hence we have a discrepancy now between Python RDDs (always stored ser) and DataFrames (stored deser in tungsten/spark sql binary format by default, but it is possible to store ser though AFAIK that will always be non-optimal so should certainly be discouraged).
|
Test build #60922 has finished for PR 13780 at commit
|
|
Test build #60925 has finished for PR 13780 at commit
|
|
ping @rxin @marmbrus @davies @gatorsmile for comment on the Python storage level issue I mention at #13780 (comment) |
|
so just following up because I know there is some other loosely blocked on this. Do @rxin @marmbrus @davies @gatorsmile have any comments? |
|
Sorry for the delay. I'm going to merge this to master. I'll update the since versions while merging. Thanks for working on this! |
|
@marmbrus thanks for merging this. For me there is still an open question around handling of deser storage levels on the PySpark side (see my comments https://github.com/apache/spark/pull/13780/files#r67833027). Would like to get your thoughts on that. What is blocked on this by the way? (Just to understand). |
## What changes were proposed in this pull request? Add storageLevel to DataFrame for SparkR. This is similar to this RP: apache#13780 but in R I do not make a class for `StorageLevel` but add a method `storageToString` ## How was this patch tested? test added. Author: WeichenXu <[email protected]> Closes apache#15516 from WeichenXu123/storageLevel_df_r.
[SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`. Updated `DatasetCacheSuite`. Author: Nick Pentreath <[email protected]> Closes apache#13780 from MLnick/ds-storagelevel. Signed-off-by: Michael Armbrust <[email protected]>
## What changes were proposed in this pull request? Add storageLevel to DataFrame for SparkR. This is similar to this RP: apache#13780 but in R I do not make a class for `StorageLevel` but add a method `storageToString` ## How was this patch tested? test added. Author: WeichenXu <[email protected]> Closes apache#15516 from WeichenXu123/storageLevel_df_r.
[SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`. Updated `DatasetCacheSuite`. Author: Nick Pentreath <[email protected]> Closes apache#13780 from MLnick/ds-storagelevel. Signed-off-by: Michael Armbrust <[email protected]>
## What changes were proposed in this pull request? Add storageLevel to DataFrame for SparkR. This is similar to this RP: apache#13780 but in R I do not make a class for `StorageLevel` but add a method `storageToString` ## How was this patch tested? test added. Author: WeichenXu <[email protected]> Closes apache#15516 from WeichenXu123/storageLevel_df_r.
SPARK-11905 added support for
persist/cacheforDataset. However, there is no user-facing API to check if aDatasetis cached and if so what the storage level is. This PR addsgetStorageLeveltoDataset, analogous toRDD.getStorageLevel.How was this patch tested?
Updated
DatasetCacheSuite.