-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4841] fix zip with textFile() #3706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #24471 has started for PR 3706 at commit
|
python/pyspark/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reproduce the bug by creating an RDD of string directly? It is simpler than touching disk. It is also helpful to put the JIRA number in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't generate a RDD with UTF8Deserializer right now, it's only used to read data from JVM.
|
Test build #24472 has started for PR 3706 at commit
|
|
It looks like there are still a few lingering bugs related to text = sc.textFile("README.md")
numbers = text.map(lambda x: 1)
text.zip(numbers).count() # Works fine
text.zip(numbers).count() # A second time, this throws an error:The error is Not sure if this is a new bug, but potentially an indicator that there's other lingering problems in |
|
Test build #24471 has finished for PR 3706 at commit
|
|
Test PASSed. |
|
@JoshRosen Good catch, it's a bug in _reserialize(), introduced in #2920, great thanks! |
|
Test build #24473 has started for PR 3706 at commit
|
|
Test build #24472 has finished for PR 3706 at commit
|
|
Test PASSed. |
|
Test build #24473 has finished for PR 3706 at commit
|
|
Test PASSed. |
|
Thanks for the update. I've looked over this again and tried it out with a few more hand-written test cases and I've been unable to find any more bugs, so this looks good to me. I'm going to merge this into |
|
I've merged this into |
UTF8Deserializer can not be used in BatchedSerializer, so always use PickleSerializer() when change batchSize in zip(). Also, if two RDD have the same batch size already, they did not need re-serialize any more. Author: Davies Liu <[email protected]> Closes #3706 from davies/fix_4841 and squashes the following commits: 20ce3a3 [Davies Liu] fix bug in _reserialize() e3ebf7c [Davies Liu] add comment 379d2c8 [Davies Liu] fix zip with textFile() (cherry picked from commit c246b95) Signed-off-by: Josh Rosen <[email protected]>
UTF8Deserializer can not be used in BatchedSerializer, so always use PickleSerializer() when change batchSize in zip().
Also, if two RDD have the same batch size already, they did not need re-serialize any more.