Skip to content

Conversation

@dianacarroll
Copy link

rdd.id() was returning an Attribute Error in some cases because self._id is not getting set. So instead of returning the _id attribute, return the value of id() from the jrdd. Fixes bug SPARK-2334.
Test with: sc.parallelize([1,2,3]).map(lambda x: x+1).id()

In some cases self._id is not getting set and calls to id() are therefore resulting in an AttributeError. This change fixes that by returning the id of the underlying jrdd instead.
Test case: sc.parallelize([1,2,3]).map(lambda x: x+1).id()
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16284/

@pwendell
Copy link
Contributor

pwendell commented Jul 4, 2014

@dianacarroll I think it would make sense to also delete the self._id field from the RDD class since it's never used.

@dianacarroll
Copy link
Author

@PatrickWendell Before we do that...I was doing more testing on this and
found a performance impact. The call to jrdd.id() takes much longer than I
would have expected it to...on the order of .5-1 seconds! In my use case,
that was a big issue because I was doing an iterative process and
displaying the RDD's ID each time through the loop, and it slowed my
process down 10x. So perhaps the real fix is to figure out why _id isn't
getting set properly in some cases...it at least to check if it's set on
each call, and if it is, return the cache value, and only get the
underlying value if it is unset.

I will give that fix a try next week.

On Fri, Jul 4, 2014 at 1:45 AM, Patrick Wendell [email protected]
wrote:

@dianacarroll https://github.com/dianacarroll I think it would make
sense to also delete the self._id field from the RDD class since it's
never used.


Reply to this email directly or view it on GitHub
#1276 (comment).

@mateiz
Copy link
Contributor

mateiz commented Jul 5, 2014

The cause seems to be that when you do operations like map() followed by map(), you get a PipelinedRDD, which does not necessarily have an underlying Java RDD until you access its _jrdd property. Creating a Java RDD for each PipelinedRDD is probably expensive so we shouldn't do that until we call id() on it. On the other hand, we probably want the IDs to match what will show up in the web UI, so I think we have to return the Java version of the ID, not a new set of numbers we make up in Python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._jrdd.id() will need an RPC in py4j, so it's better to cache it as _id.

For PipelineRDD() and SchemaRDD(), we can override id() to fetch the id from _jrdd (also cache it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants