[SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels #10092

gatorsmile · 2015-12-02T06:45:14Z

The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.

@davies Is this inconsistency intentional? Thanks!

Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.

Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level. The available storage levels in Python include MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, DISK_ONLY, DISK_ONLY_2 and OFF_HEAP.

davies · 2015-12-02T06:53:56Z

This is on purpose, see https://issues.apache.org/jira/browse/SPARK-2014 cc @mateiz

SparkQA · 2015-12-02T07:28:28Z

Test build #47041 has finished for PR 10092 at commit d52deb5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-12-02T07:51:51Z

@davies Thank you for showing me the original JIRA by @mateiz . It sounds like it does not make sense to keep data as deserialized Java objects since data is serialized on the Python side. Is my understanding correct?

I am wondering if we should automatically convert MEMORY_ONLY to MEMORY_ONLY_SER and MEMORY_AND_DISK to MEMORY_AND_DISK_SER, if users are choosing these two options? So far, the official document does not mention these issues. Users might not realize this when making a decision.

Thank you!

mateiz · 2015-12-02T17:03:35Z

It might be nice to only expose a smaller # of storage levels in Python, i.e. call them memory_only and memory_and_disk, but always use the serialized ones underneath.

gatorsmile · 2015-12-02T17:28:43Z

@mateiz Thank you for your answer! Will try to do it soon.

gatorsmile · 2015-12-03T05:01:03Z

Removed all the constants whose deserialized values are true.
Update the comments of StorageLevel
Change the default storage levels of Kinesis level from MEMORY_AND_DISK_2 to MEMORY_AND_DISK_SER_2.

Please verify if my changes are OK. @mateiz @davies Thank you very much!

gatorsmile · 2015-12-03T05:08:34Z

Just re-read the comments and will change the names soon. Thanks!

SparkQA · 2015-12-03T05:29:58Z

Test build #47116 has finished for PR 10092 at commit a6b7dd9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Renaming MEMORY_ONLY_SER_2 to MEMORY_ONLY_2 Renaming MEMORY_AND_DISK_SER to MEMORY_AND_DISK Renaming MEMORY_AND_DISK_SER_2 to MEMORY_AND_DISK_2

gatorsmile · 2015-12-03T05:38:45Z

Based on the comments of @mateiz , the extra changes are made:

Renaming MEMORY_ONLY_SER to MEMORY_ONLY
Renaming MEMORY_ONLY_SER_2 to MEMORY_ONLY_2
Renaming MEMORY_AND_DISK_SER to MEMORY_AND_DISK
Renaming MEMORY_AND_DISK_SER_2 to MEMORY_AND_DISK_2

Thanks!

davies · 2015-12-03T06:18:53Z

python/pyspark/storagelevel.py

Removing these will break backward compatibility, I'd like to deprecate them, explain the difference between Python and Java (say records will always serialized in Python)

Agree! Just updated the codes with the deprecated notes. Trying to follow the existing PySpark style. Please check if they are good. : )

Not sure if this will be merged to 1.6. The note is still using 1.6. Thank you!

It's too late for 1.6, and this change (API change) is good for 2.0, sounds good?

Sure. Just changed it. : )

SparkQA · 2015-12-03T06:24:03Z

Test build #47119 has finished for PR 10092 at commit 0e074b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-03T08:19:50Z

Test build #47125 has finished for PR 10092 at commit 014a3a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-03T08:43:14Z

Test build #47129 has finished for PR 10092 at commit fef7ada.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-12-18T00:17:47Z

Hi, @davies Will this be merged? or need more updates? Thanks! : )

davies · 2015-12-18T05:09:25Z

@gatorsmile These changes looks good to me, could also update the docs/ (configuration and programming guide) to say that the storage level of Python RDD is different than Java/Scala ones?

gatorsmile · 2015-12-18T05:58:59Z

Sure, will do it. Thanks!

SparkQA · 2015-12-19T02:12:03Z

Test build #48041 has finished for PR 10092 at commit 3864f87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-12-19T04:05:23Z

@gatorsmile LGTM, merging into master, thanks!

gatorsmile added 2 commits December 1, 2015 22:33

Changed the default storage level of persist

bd7af48

Merge remote-tracking branch 'upstream/master' into persistStorageLevel

d52deb5

Removal of the JAVA-specific deserialized storage levels

a6b7dd9

gatorsmile changed the title ~~[SPARK-12091] [PYSPARK] [Minor] Default storage level of persist/cache~~ [SPARK-12091] [PYSPARK] Removal of the JAVA-specific deserialized storage levels Dec 3, 2015

Renaming MEMORY_ONLY_SER to MEMORY_ONLY

0e074b6

Renaming MEMORY_ONLY_SER_2 to MEMORY_ONLY_2 Renaming MEMORY_AND_DISK_SER to MEMORY_AND_DISK Renaming MEMORY_AND_DISK_SER_2 to MEMORY_AND_DISK_2

davies reviewed Dec 3, 2015
View reviewed changes

gatorsmile added 2 commits December 2, 2015 23:34

Add deprecated notes for the four deprecated storage levels.

15e668e

Add deprecated notes for the four deprecated storage levels.

014a3a8

gatorsmile changed the title ~~[SPARK-12091] [PYSPARK] Removal of the JAVA-specific deserialized storage levels~~ [SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels Dec 3, 2015

change the deprecated version from 1.6 to 2.0

fef7ada

yanboliang mentioned this pull request Dec 8, 2015

[SPARK-12170] [SparkR] Deprecate the JAVA-specific deserialized storage levels #10189

Closed

update documents.

3864f87

asfgit closed this in 499ac3e Dec 19, 2015

MLnick mentioned this pull request Aug 4, 2016

[SPARK-16063][SQL] Add storageLevel to Dataset #13780

Closed

[SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels #10092

[SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels #10092

Uh oh!

Conversation

gatorsmile commented Dec 2, 2015

Uh oh!

davies commented Dec 2, 2015

Uh oh!

SparkQA commented Dec 2, 2015

Uh oh!

gatorsmile commented Dec 2, 2015

Uh oh!

mateiz commented Dec 2, 2015

Uh oh!

gatorsmile commented Dec 2, 2015

Uh oh!

gatorsmile commented Dec 3, 2015

Uh oh!

gatorsmile commented Dec 3, 2015

Uh oh!

SparkQA commented Dec 3, 2015

Uh oh!

gatorsmile commented Dec 3, 2015

Uh oh!

davies Dec 3, 2015

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 3, 2015

Choose a reason for hiding this comment

Uh oh!

davies Dec 3, 2015

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 3, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 3, 2015

Uh oh!

SparkQA commented Dec 3, 2015

Uh oh!

SparkQA commented Dec 3, 2015

Uh oh!

gatorsmile commented Dec 18, 2015

Uh oh!

davies commented Dec 18, 2015

Uh oh!

gatorsmile commented Dec 18, 2015

Uh oh!

SparkQA commented Dec 19, 2015

Uh oh!

davies commented Dec 19, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants