Skip to content

Conversation

@MLnick
Copy link
Contributor

@MLnick MLnick commented Jun 20, 2016

SPARK-11905 added support for persist/cache for Dataset. However, there is no user-facing API to check if a Dataset is cached and if so what the storage level is. This PR adds getStorageLevel to Dataset, analogous to RDD.getStorageLevel.

How was this patch tested?

Updated DatasetCacheSuite.

Nick Pentreath added 2 commits June 17, 2016 19:49
@MLnick
Copy link
Contributor Author

MLnick commented Jun 20, 2016

cc @gatorsmile @marmbrus @rxin

*
* @group basic
* @since 2.0.0
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call it storageLevel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be more familiar for RDD users, but no strong feeling about it. Will update

@rxin
Copy link
Contributor

rxin commented Jun 20, 2016

Can you reset the non relevant changes?

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60836 has finished for PR 13780 at commit 8b4a490.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60843 has finished for PR 13780 at commit 90684bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ds1.unpersist()
assert(ds1.storageLevel() == StorageLevel.NONE)
// non-default storage level
ds1.persist(StorageLevel.MEMORY_ONLY_2)
Copy link
Member

@gatorsmile gatorsmile Jun 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing black-box testing, I might just try all the levels in the test case. Even we can include some customized StorageLevel, which is different from the defined one.

    import org.apache.spark.storage.StorageLevel._
    Seq(NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER,
      MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER,
      MEMORY_AND_DISK_SER_2, OFF_HEAP).foreach { level =>
      ds1.persist(level)
      assert(ds1.storageLevel() == level)
      ds1.unpersist()
      assert(ds1.storageLevel() == StorageLevel.NONE)
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda neutral on this - it doesn't really seem necessary to me, since pretty much by definition if one storage level works then they all do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew. : ) That is white box testing. Normally, writing test cases should not be done by the same person who wrote the code.

@gatorsmile
Copy link
Member

Overall LGTM, just one minor comment about the test case. BTW, maybe you can change the PR title after you updating the function name? Thanks!

@rxin
Copy link
Contributor

rxin commented Jun 20, 2016

Can you make sure Python Dataframe has this method too?

@since(1.3)
def cache(self):
""" Persists with the default storage level (C{MEMORY_ONLY}).
"""Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin I updated the default in the doc, as it was actually incorrect previously.

@rxin
Copy link
Contributor

rxin commented Jun 21, 2016

just some minor comments. this looks pretty good

after the first time it is computed. This can only be used to assign
a new storage level if the RDD does not have a storage level set yet.
If no storage level is specified defaults to (C{MEMORY_ONLY}).
def persist(self, storageLevel=StorageLevel.MEMORY_AND_DISK):
Copy link
Contributor Author

@MLnick MLnick Jun 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin I updated the default here in persist, to match cache.

But actually, it's still not quite correct - the default storage levels for Python are all serialized. But the MEMORY-based ones don't match the Scala side (which are deserialized). This was done for RDDs but doesn't quite work for DataFrames (since DF on the Scala side is cached deserialized by default).

So here df.cache() results in MEMORY_AND_DISK (deser) while df.persist() results in MEMORY_AND_DISK (ser). Ideally I'd say we don't want to encourage users to accidentally use the serialized forms for memory-based DF caching (since it is less efficient, as I understand it?). Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to set the default storage level here to None instead, and if it's not set call _jfd.persist() to ensure behaviour is the same as cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @rxin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @rxin on this comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside of that approach is the user can't easily explicitly cache in-memory only deserialized or even cache on two machines deserialized easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is true. The issue is that the deser versions were deprecated in #10092 and made to equal the ser versions, and so can't actually be specified in Python any more. Hence we have a discrepancy now between Python RDDs (always stored ser) and DataFrames (stored deser in tungsten/spark sql binary format by default, but it is possible to store ser though AFAIK that will always be non-optimal so should certainly be discouraged).

cc @gatorsmile @davies @rxin

@MLnick MLnick changed the title [SPARK-16063][SQL] Add getStoragelevel to Dataset [SPARK-16063][SQL] Add storageLevel to Dataset Jun 21, 2016
@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60922 has finished for PR 13780 at commit 93af5ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60925 has finished for PR 13780 at commit cfe9ab7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented Aug 25, 2016

ping @rxin @marmbrus @davies @gatorsmile for comment on the Python storage level issue I mention at #13780 (comment)

@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

so just following up because I know there is some other loosely blocked on this. Do @rxin @marmbrus @davies @gatorsmile have any comments?

@marmbrus
Copy link
Contributor

Sorry for the delay. I'm going to merge this to master. I'll update the since versions while merging. Thanks for working on this!

@asfgit asfgit closed this in 5aeb738 Oct 14, 2016
@MLnick
Copy link
Contributor Author

MLnick commented Oct 17, 2016

@marmbrus thanks for merging this. For me there is still an open question around handling of deser storage levels on the PySpark side (see my comments https://github.com/apache/spark/pull/13780/files#r67833027). Would like to get your thoughts on that.

What is blocked on this by the way? (Just to understand).

@MLnick MLnick deleted the ds-storagelevel branch October 17, 2016 07:44
ghost pushed a commit to dbtsai/spark that referenced this pull request Oct 26, 2016
## What changes were proposed in this pull request?

Add storageLevel to DataFrame for SparkR.
This is similar to this RP:  apache#13780

but in R I do not make a class for `StorageLevel`
but add a method `storageToString`

## How was this patch tested?

test added.

Author: WeichenXu <[email protected]>

Closes apache#15516 from WeichenXu123/storageLevel_df_r.
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
[SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`.

Updated `DatasetCacheSuite`.

Author: Nick Pentreath <[email protected]>

Closes apache#13780 from MLnick/ds-storagelevel.

Signed-off-by: Michael Armbrust <[email protected]>
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?

Add storageLevel to DataFrame for SparkR.
This is similar to this RP:  apache#13780

but in R I do not make a class for `StorageLevel`
but add a method `storageToString`

## How was this patch tested?

test added.

Author: WeichenXu <[email protected]>

Closes apache#15516 from WeichenXu123/storageLevel_df_r.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
[SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`.

Updated `DatasetCacheSuite`.

Author: Nick Pentreath <[email protected]>

Closes apache#13780 from MLnick/ds-storagelevel.

Signed-off-by: Michael Armbrust <[email protected]>
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Add storageLevel to DataFrame for SparkR.
This is similar to this RP:  apache#13780

but in R I do not make a class for `StorageLevel`
but add a method `storageToString`

## How was this patch tested?

test added.

Author: WeichenXu <[email protected]>

Closes apache#15516 from WeichenXu123/storageLevel_df_r.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants