-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2334] fix rdd.id() Attribute Error #1276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In some cases self._id is not getting set and calls to id() are therefore resulting in an AttributeError. This change fixes that by returning the id of the underlying jrdd instead. Test case: sc.parallelize([1,2,3]).map(lambda x: x+1).id()
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
@dianacarroll I think it would make sense to also delete the |
|
@PatrickWendell Before we do that...I was doing more testing on this and I will give that fix a try next week. On Fri, Jul 4, 2014 at 1:45 AM, Patrick Wendell [email protected]
|
|
The cause seems to be that when you do operations like map() followed by map(), you get a PipelinedRDD, which does not necessarily have an underlying Java RDD until you access its _jrdd property. Creating a Java RDD for each PipelinedRDD is probably expensive so we shouldn't do that until we call id() on it. On the other hand, we probably want the IDs to match what will show up in the web UI, so I think we have to return the Java version of the ID, not a new set of numbers we make up in Python. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._jrdd.id() will need an RPC in py4j, so it's better to cache it as _id.
For PipelineRDD() and SchemaRDD(), we can override id() to fetch the id from _jrdd (also cache it).
rdd.id() was returning an Attribute Error in some cases because self._id is not getting set. So instead of returning the _id attribute, return the value of id() from the jrdd. Fixes bug SPARK-2334.
Test with: sc.parallelize([1,2,3]).map(lambda x: x+1).id()