Skip to content

Conversation

@gatorsmile
Copy link
Member

The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.

@davies Is this inconsistency intentional? Thanks!

Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.

Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level. The available storage levels in Python include MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, DISK_ONLY, DISK_ONLY_2 and OFF_HEAP.

@davies
Copy link
Contributor

davies commented Dec 2, 2015

This is on purpose, see https://issues.apache.org/jira/browse/SPARK-2014 cc @mateiz

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #47041 has finished for PR 10092 at commit d52deb5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@davies Thank you for showing me the original JIRA by @mateiz . It sounds like it does not make sense to keep data as deserialized Java objects since data is serialized on the Python side. Is my understanding correct?

I am wondering if we should automatically convert MEMORY_ONLY to MEMORY_ONLY_SER and MEMORY_AND_DISK to MEMORY_AND_DISK_SER, if users are choosing these two options? So far, the official document does not mention these issues. Users might not realize this when making a decision.

Thank you!

@mateiz
Copy link
Contributor

mateiz commented Dec 2, 2015

It might be nice to only expose a smaller # of storage levels in Python, i.e. call them memory_only and memory_and_disk, but always use the serialized ones underneath.

@gatorsmile
Copy link
Member Author

@mateiz Thank you for your answer! Will try to do it soon.

@gatorsmile gatorsmile changed the title [SPARK-12091] [PYSPARK] [Minor] Default storage level of persist/cache [SPARK-12091] [PYSPARK] Removal of the JAVA-specific deserialized storage levels Dec 3, 2015
@gatorsmile
Copy link
Member Author

  • Removed all the constants whose deserialized values are true.
  • Update the comments of StorageLevel
  • Change the default storage levels of Kinesis level from MEMORY_AND_DISK_2 to MEMORY_AND_DISK_SER_2.

Please verify if my changes are OK. @mateiz @davies Thank you very much!

@gatorsmile
Copy link
Member Author

Just re-read the comments and will change the names soon. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47116 has finished for PR 10092 at commit a6b7dd9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Renaming MEMORY_ONLY_SER_2 to MEMORY_ONLY_2
Renaming MEMORY_AND_DISK_SER to MEMORY_AND_DISK
Renaming MEMORY_AND_DISK_SER_2 to MEMORY_AND_DISK_2
@gatorsmile
Copy link
Member Author

Based on the comments of @mateiz , the extra changes are made:

  • Renaming MEMORY_ONLY_SER to MEMORY_ONLY
  • Renaming MEMORY_ONLY_SER_2 to MEMORY_ONLY_2
  • Renaming MEMORY_AND_DISK_SER to MEMORY_AND_DISK
  • Renaming MEMORY_AND_DISK_SER_2 to MEMORY_AND_DISK_2

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these will break backward compatibility, I'd like to deprecate them, explain the difference between Python and Java (say records will always serialized in Python)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! Just updated the codes with the deprecated notes. Trying to follow the existing PySpark style. Please check if they are good. : )

Not sure if this will be merged to 1.6. The note is still using 1.6. Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too late for 1.6, and this change (API change) is good for 2.0, sounds good?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Just changed it. : )

@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47119 has finished for PR 10092 at commit 0e074b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [SPARK-12091] [PYSPARK] Removal of the JAVA-specific deserialized storage levels [SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels Dec 3, 2015
@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47125 has finished for PR 10092 at commit 014a3a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47129 has finished for PR 10092 at commit fef7ada.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

Hi, @davies Will this be merged? or need more updates? Thanks! : )

@davies
Copy link
Contributor

davies commented Dec 18, 2015

@gatorsmile These changes looks good to me, could also update the docs/ (configuration and programming guide) to say that the storage level of Python RDD is different than Java/Scala ones?

@gatorsmile
Copy link
Member Author

Sure, will do it. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 19, 2015

Test build #48041 has finished for PR 10092 at commit 3864f87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Dec 19, 2015

@gatorsmile LGTM, merging into master, thanks!

@asfgit asfgit closed this in 499ac3e Dec 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants