[SPARK-2334] fix rdd.id() Attribute Error #1276

dianacarroll · 2014-07-01T17:10:03Z

rdd.id() was returning an Attribute Error in some cases because self._id is not getting set. So instead of returning the _id attribute, return the value of id() from the jrdd. Fixes bug SPARK-2334.
Test with: sc.parallelize([1,2,3]).map(lambda x: x+1).id()

In some cases self._id is not getting set and calls to id() are therefore resulting in an AttributeError. This change fixes that by returning the id of the underlying jrdd instead. Test case: sc.parallelize([1,2,3]).map(lambda x: x+1).id()

AmplabJenkins · 2014-07-01T17:10:41Z

Merged build triggered.

AmplabJenkins · 2014-07-01T17:10:49Z

Merged build started.

AmplabJenkins · 2014-07-01T17:54:54Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-07-01T17:54:54Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16284/

pwendell · 2014-07-04T05:45:23Z

@dianacarroll I think it would make sense to also delete the self._id field from the RDD class since it's never used.

dianacarroll · 2014-07-04T13:18:34Z

@PatrickWendell Before we do that...I was doing more testing on this and
found a performance impact. The call to jrdd.id() takes much longer than I
would have expected it to...on the order of .5-1 seconds! In my use case,
that was a big issue because I was doing an iterative process and
displaying the RDD's ID each time through the loop, and it slowed my
process down 10x. So perhaps the real fix is to figure out why _id isn't
getting set properly in some cases...it at least to check if it's set on
each call, and if it is, return the cache value, and only get the
underlying value if it is unset.

I will give that fix a try next week.

On Fri, Jul 4, 2014 at 1:45 AM, Patrick Wendell [email protected]
wrote:

@dianacarroll https://github.com/dianacarroll I think it would make
sense to also delete the self._id field from the RDD class since it's
never used.

—
Reply to this email directly or view it on GitHub
#1276 (comment).

mateiz · 2014-07-05T16:18:01Z

The cause seems to be that when you do operations like map() followed by map(), you get a PipelinedRDD, which does not necessarily have an underlying Java RDD until you access its _jrdd property. Creating a Java RDD for each PipelinedRDD is probably expensive so we shouldn't do that until we call id() on it. On the other hand, we probably want the IDs to match what will show up in the web UI, so I think we have to return the Java version of the ID, not a new set of numbers we make up in Python.

davies · 2014-09-02T19:56:12Z

python/pyspark/rdd.py

self._jrdd.id() will need an RPC in py4j, so it's better to cache it as _id.

For PipelineRDD() and SchemaRDD(), we can override id() to fetch the id from _jrdd (also cache it).

rdd.id(): return id of underlying jrdd

a69e6a9

In some cases self._id is not getting set and calls to id() are therefore resulting in an AttributeError. This change fixes that by returning the id of the underlying jrdd instead. Test case: sc.parallelize([1,2,3]).map(lambda x: x+1).id()

JoshRosen mentioned this pull request Jul 28, 2014

[SPARK-2141] Adding getPersistentRddIds and unpersistRDD to Java/Python #1082

Closed

davies reviewed Sep 2, 2014
View reviewed changes

JoshRosen mentioned this pull request Sep 6, 2014

[SPARK-2334] fix AttributeError when call PipelineRDD.id() #2296

Closed

asfgit closed this in 110fb8b Sep 6, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-2334] fix rdd.id() Attribute Error #1276

[SPARK-2334] fix rdd.id() Attribute Error #1276

Uh oh!

dianacarroll commented Jul 1, 2014

Uh oh!

AmplabJenkins commented Jul 1, 2014

Uh oh!

AmplabJenkins commented Jul 1, 2014

Uh oh!

AmplabJenkins commented Jul 1, 2014

Uh oh!

AmplabJenkins commented Jul 1, 2014

Uh oh!

pwendell commented Jul 4, 2014

Uh oh!

dianacarroll commented Jul 4, 2014

Uh oh!

mateiz commented Jul 5, 2014

Uh oh!

davies Sep 2, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-2334] fix rdd.id() Attribute Error #1276

[SPARK-2334] fix rdd.id() Attribute Error #1276

Uh oh!

Conversation

dianacarroll commented Jul 1, 2014

Uh oh!

AmplabJenkins commented Jul 1, 2014

Uh oh!

AmplabJenkins commented Jul 1, 2014

Uh oh!

AmplabJenkins commented Jul 1, 2014

Uh oh!

AmplabJenkins commented Jul 1, 2014

Uh oh!

pwendell commented Jul 4, 2014

Uh oh!

dianacarroll commented Jul 4, 2014

Uh oh!

mateiz commented Jul 5, 2014

Uh oh!

davies Sep 2, 2014

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants