[SPARK-16063][SQL] Add storageLevel to Dataset #13780

MLnick · 2016-06-20T07:29:35Z

SPARK-11905 added support for persist/cache for Dataset. However, there is no user-facing API to check if a Dataset is cached and if so what the storage level is. This PR adds getStorageLevel to Dataset, analogous to RDD.getStorageLevel.

How was this patch tested?

Updated DatasetCacheSuite.

MLnick · 2016-06-20T07:32:09Z

cc @gatorsmile @marmbrus @rxin

rxin · 2016-06-20T07:35:40Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   *
+   * @group basic
+   * @since 2.0.0
+   */


just call it storageLevel?

I thought it would be more familiar for RDD users, but no strong feeling about it. Will update

rxin · 2016-06-20T07:35:52Z

Can you reset the non relevant changes?

SparkQA · 2016-06-20T09:00:52Z

Test build #60836 has finished for PR 13780 at commit 8b4a490.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-20T10:10:38Z

Test build #60843 has finished for PR 13780 at commit 90684bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-20T16:53:00Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetCacheSuite.scala

+    ds1.unpersist()
+    assert(ds1.storageLevel() == StorageLevel.NONE)
+    // non-default storage level
+    ds1.persist(StorageLevel.MEMORY_ONLY_2)


When writing black-box testing, I might just try all the levels in the test case. Even we can include some customized StorageLevel, which is different from the defined one.

import org.apache.spark.storage.StorageLevel._ Seq(NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP).foreach { level => ds1.persist(level) assert(ds1.storageLevel() == level) ds1.unpersist() assert(ds1.storageLevel() == StorageLevel.NONE) }

I'm kinda neutral on this - it doesn't really seem necessary to me, since pretty much by definition if one storage level works then they all do.

I knew. : ) That is white box testing. Normally, writing test cases should not be done by the same person who wrote the code.

gatorsmile · 2016-06-20T16:54:35Z

Overall LGTM, just one minor comment about the test case. BTW, maybe you can change the PR title after you updating the function name? Thanks!

rxin · 2016-06-20T19:21:04Z

Can you make sure Python Dataframe has this method too?

MLnick · 2016-06-21T08:58:16Z

python/pyspark/sql/dataframe.py

    @since(1.3)
    def cache(self):
-        """ Persists with the default storage level (C{MEMORY_ONLY}).
+        """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).


@rxin I updated the default in the doc, as it was actually incorrect previously.

rxin · 2016-06-21T09:01:28Z

just some minor comments. this looks pretty good

MLnick · 2016-06-21T09:07:50Z

python/pyspark/sql/dataframe.py

-        after the first time it is computed. This can only be used to assign
-        a new storage level if the RDD does not have a storage level set yet.
-        If no storage level is specified defaults to (C{MEMORY_ONLY}).
+    def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):


@rxin I updated the default here in persist, to match cache.

But actually, it's still not quite correct - the default storage levels for Python are all serialized. But the MEMORY-based ones don't match the Scala side (which are deserialized). This was done for RDDs but doesn't quite work for DataFrames (since DF on the Scala side is cached deserialized by default).

So here df.cache() results in MEMORY_AND_DISK (deser) while df.persist() results in MEMORY_AND_DISK (ser). Ideally I'd say we don't want to encourage users to accidentally use the serialized forms for memory-based DF caching (since it is less efficient, as I understand it?). Let me know what you think.

One option is to set the default storage level here to None instead, and if it's not set call _jfd.persist() to ensure behaviour is the same as cache.

ping @rxin on this comment?

One downside of that approach is the user can't easily explicitly cache in-memory only deserialized or even cache on two machines deserialized easily.

Yes, this is true. The issue is that the deser versions were deprecated in #10092 and made to equal the ser versions, and so can't actually be specified in Python any more. Hence we have a discrepancy now between Python RDDs (always stored ser) and DataFrames (stored deser in tungsten/spark sql binary format by default, but it is possible to store ser though AFAIK that will always be non-optimal so should certainly be discouraged).

cc @gatorsmile @davies @rxin

SparkQA · 2016-06-21T10:33:10Z

Test build #60922 has finished for PR 13780 at commit 93af5ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-21T12:25:59Z

Test build #60925 has finished for PR 13780 at commit cfe9ab7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-08-25T06:59:09Z

ping @rxin @marmbrus @davies @gatorsmile for comment on the Python storage level issue I mention at #13780 (comment)

holdenk · 2016-10-07T19:57:17Z

so just following up because I know there is some other loosely blocked on this. Do @rxin @marmbrus @davies @gatorsmile have any comments?

marmbrus · 2016-10-14T22:04:51Z

Sorry for the delay. I'm going to merge this to master. I'll update the since versions while merging. Thanks for working on this!

MLnick · 2016-10-17T07:44:28Z

@marmbrus thanks for merging this. For me there is still an open question around handling of deser storage levels on the PySpark side (see my comments https://github.com/apache/spark/pull/13780/files#r67833027). Would like to get your thoughts on that.

What is blocked on this by the way? (Just to understand).

## What changes were proposed in this pull request? Add storageLevel to DataFrame for SparkR. This is similar to this RP: apache#13780 but in R I do not make a class for `StorageLevel` but add a method `storageToString` ## How was this patch tested? test added. Author: WeichenXu <[email protected]> Closes apache#15516 from WeichenXu123/storageLevel_df_r.

[SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`. Updated `DatasetCacheSuite`. Author: Nick Pentreath <[email protected]> Closes apache#13780 from MLnick/ds-storagelevel. Signed-off-by: Michael Armbrust <[email protected]>

## What changes were proposed in this pull request? Add storageLevel to DataFrame for SparkR. This is similar to this RP: apache#13780 but in R I do not make a class for `StorageLevel` but add a method `storageToString` ## How was this patch tested? test added. Author: WeichenXu <[email protected]> Closes apache#15516 from WeichenXu123/storageLevel_df_r.

[SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`. Updated `DatasetCacheSuite`. Author: Nick Pentreath <[email protected]> Closes apache#13780 from MLnick/ds-storagelevel. Signed-off-by: Michael Armbrust <[email protected]>

## What changes were proposed in this pull request? Add storageLevel to DataFrame for SparkR. This is similar to this RP: apache#13780 but in R I do not make a class for `StorageLevel` but add a method `storageToString` ## How was this patch tested? test added. Author: WeichenXu <[email protected]> Closes apache#15516 from WeichenXu123/storageLevel_df_r.

Nick Pentreath added 2 commits June 17, 2016 19:49

wip

f2ea25c

Clean up and tests

8b4a490

rxin reviewed Jun 20, 2016
View reviewed changes

Nick Pentreath added 2 commits June 20, 2016 10:37

back out intellij auto-format changes

3cb8ef6

getStorageLevel -> storageLevel

90684bd

gatorsmile reviewed Jun 20, 2016
View reviewed changes

Add storageLevel to PySpark DataFrame, and fix default storage levels

93af5ec

MLnick reviewed Jun 21, 2016
View reviewed changes

MLnick changed the title ~~[SPARK-16063][SQL] Add getStoragelevel to Dataset~~ [SPARK-16063][SQL] Add storageLevel to Dataset Jun 21, 2016

Nick Pentreath added 2 commits June 21, 2016 12:35

make storageLevel a property

691c64a

remvoe parens from storageLevel def

cfe9ab7

asfgit closed this in 5aeb738 Oct 14, 2016

MLnick deleted the ds-storagelevel branch October 17, 2016 07:44

WeichenXu123 mentioned this pull request Oct 17, 2016

[SPARK-17961][SparkR][SQL] Add storageLevel to DataFrame for SparkR #15516

Closed

MLnick mentioned this pull request Nov 28, 2016

[SPARK-18596][ML] add checking and caching to bisecting kmeans #16020

Closed

[SPARK-16063][SQL] Add storageLevel to Dataset #13780

[SPARK-16063][SQL] Add storageLevel to Dataset #13780

Uh oh!

Conversation

MLnick commented Jun 20, 2016

How was this patch tested?

Uh oh!

MLnick commented Jun 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jun 20, 2016

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

gatorsmile Jun 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jun 20, 2016

Uh oh!

rxin commented Jun 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jun 21, 2016

Uh oh!

MLnick Jun 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

MLnick commented Aug 25, 2016

Uh oh!

holdenk commented Oct 7, 2016

Uh oh!

marmbrus commented Oct 14, 2016

Uh oh!

MLnick commented Oct 17, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gatorsmile Jun 20, 2016 •

edited

Loading

MLnick Jun 21, 2016 •

edited

Loading