-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels #10092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is on purpose, see https://issues.apache.org/jira/browse/SPARK-2014 cc @mateiz |
|
Test build #47041 has finished for PR 10092 at commit
|
|
@davies Thank you for showing me the original JIRA by @mateiz . It sounds like it does not make sense to keep data as deserialized Java objects since data is serialized on the Python side. Is my understanding correct? I am wondering if we should automatically convert Thank you! |
|
It might be nice to only expose a smaller # of storage levels in Python, i.e. call them memory_only and memory_and_disk, but always use the serialized ones underneath. |
|
@mateiz Thank you for your answer! Will try to do it soon. |
|
Just re-read the comments and will change the names soon. Thanks! |
|
Test build #47116 has finished for PR 10092 at commit
|
Renaming MEMORY_ONLY_SER_2 to MEMORY_ONLY_2 Renaming MEMORY_AND_DISK_SER to MEMORY_AND_DISK Renaming MEMORY_AND_DISK_SER_2 to MEMORY_AND_DISK_2
|
Based on the comments of @mateiz , the extra changes are made:
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing these will break backward compatibility, I'd like to deprecate them, explain the difference between Python and Java (say records will always serialized in Python)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree! Just updated the codes with the deprecated notes. Trying to follow the existing PySpark style. Please check if they are good. : )
Not sure if this will be merged to 1.6. The note is still using 1.6. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too late for 1.6, and this change (API change) is good for 2.0, sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Just changed it. : )
|
Test build #47119 has finished for PR 10092 at commit
|
|
Test build #47125 has finished for PR 10092 at commit
|
|
Test build #47129 has finished for PR 10092 at commit
|
|
Hi, @davies Will this be merged? or need more updates? Thanks! : ) |
|
@gatorsmile These changes looks good to me, could also update the docs/ (configuration and programming guide) to say that the storage level of Python RDD is different than Java/Scala ones? |
|
Sure, will do it. Thanks! |
|
Test build #48041 has finished for PR 10092 at commit
|
|
@gatorsmile LGTM, merging into master, thanks! |
The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.
@davies Is this inconsistency intentional? Thanks!
Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.
Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level. The available storage levels in Python include
MEMORY_ONLY,MEMORY_ONLY_2,MEMORY_AND_DISK,MEMORY_AND_DISK_2,DISK_ONLY,DISK_ONLY_2andOFF_HEAP.