[SPARK-25591][PySpark][SQL] Avoid overwriting deserialized accumulator #22635

viirya · 2018-10-04T23:49:00Z

What changes were proposed in this pull request?

If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry.

How was this patch tested?

Added test.

viirya · 2018-10-05T00:16:28Z

cc @HyukjinKwon

SparkQA · 2018-10-05T00:31:55Z

Test build #96960 has finished for PR 22635 at commit db0a583.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AbdealiLoKo · 2018-10-05T04:38:05Z

python/pyspark/accumulators.py

-    _accumulatorRegistry[aid] = accum
-    return accum
+    # If this certain accumulator was deserialized, don't overwrite it.
+    if aid in _accumulatorRegistry:


Should it be if aid in _accumulatorRegistry and _accumulatorRegistry[aid]._deserialized is True
or:

if aid in _accumulatorRegistry: _accumulatorRegistry[aid]._deserialize = True return _accumulatorRegistry[aid]

To make double sure that this function always returns a deserialize version of the accum ?

We only save deserialized accumulators (_deserialized is True) into this dict.

That doesnt seem right because the constructor for Accumulator has:

... self._deserialized = False _accumulatorRegistry[aid] = self

PS: First time Im looking at this code, so not too familiar with it

Yeah, but _deserialize_accumulator is only called when doing deserialzation at executors. The constructor saves accumulators in _accumulatorRegistry at driver.

I see - got it 👍

HyukjinKwon · 2018-10-05T05:31:00Z

Thanks for cc'ing me. Will take a look this week.

viirya · 2018-10-05T13:20:07Z

Since this is for correctness, I think we should include this into 2.4 if it can catch up. cc @cloud-fan

HyukjinKwon · 2018-10-07T15:53:17Z

python/pyspark/sql/tests.py

+        data = data.withColumn("out1", func_udf(data["a"]))
+        data = data.withColumn("out2", func_udf2(data["b"]))
+        data.collect()
+        self.assertEqual(test_accum.value, 101)


@viirya, can we just use int for data and accumulator as well in this test case?

HyukjinKwon · 2018-10-07T15:56:05Z

python/pyspark/accumulators.py

-    _accumulatorRegistry[aid] = accum
-    return accum
+    # If this certain accumulator was deserialized, don't overwrite it.
+    if aid in _accumulatorRegistry:


Ah, so the problem is this accumulator is de/serialized multiple times and _deserialize_accumulator modifies the global status multiple times. I see. LGTM.

HyukjinKwon · 2018-10-07T15:56:20Z

Nice catch @viirya LGTM.

viirya · 2018-10-08T05:38:45Z

Thanks @HyukjinKwon

SparkQA · 2018-10-08T06:22:07Z

Test build #97100 has finished for PR 22635 at commit 08c7223.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry. ## How was this patch tested? Added test. Closes #22635 from viirya/SPARK-25591. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: hyukjinkwon <[email protected]> (cherry picked from commit cb90617) Signed-off-by: hyukjinkwon <[email protected]>

HyukjinKwon · 2018-10-08T07:39:17Z

Merged to master and branch-2.4.

AbdealiLoKo · 2018-10-08T18:36:57Z

@cloud-fan @viirya Any chance of this making it into 2.4 ?

gatorsmile · 2018-10-08T21:33:48Z

How about pandas UDF? How about using RDD APIs? Do we face the same issues?

cloud-fan · 2018-10-08T23:32:28Z

@AbdealiJK since RC3 is not cut, this will be in 2.4.

HyukjinKwon · 2018-10-08T23:33:01Z

Yea, same issue exists in Pandas UDFs too (quickly double checked). This PR fixes it. That code path is rather one same place FYI.

viirya · 2018-10-08T23:53:01Z

@cloud-fan @gatorsmile @HyukjinKwon Thanks. Yes. As Pandas UDF has the same issue and it is fixed by this PR.

Tagar · 2018-11-19T20:35:49Z

Please review https://issues.apache.org/jira/browse/SPARK-26019
"pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()"

I suspect this change might have introduced SPARK-26019 regression.

viirya · 2018-11-19T23:14:46Z

@Tagar I will look into it. Thanks.

Tagar · 2018-11-19T23:18:28Z

Thank you @viirya

HyukjinKwon · 2018-11-20T00:03:38Z

How does it related with the JIRA? looks not quite related from a cursory look. Please leave some analysis next time or at least testing it before/after the specific commit. Let me take a look anyway.

HyukjinKwon · 2018-11-20T00:16:17Z

This is fixed in 2.4.0 and your issue is when 2.3.1 -> 2.3.2. It's not related.

viirya · 2018-11-20T00:19:35Z

Yeah, thanks @HyukjinKwon. I have an initial look, looks like it is not quite related.

Tagar · 2018-11-20T05:12:20Z

@viirya I appologize, as I mentioned in my comment in SPARK-26019, it's due to another change
15fc237#diff-c3339bbf2b850b79445b41e9eecf57c4R249 - error happens in authenticate_and_accum_updates() and that's a new code that was brought by that commit. Thanks for looking at that anyway!

## What changes were proposed in this pull request? If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry. ## How was this patch tested? Added test. Closes apache#22635 from viirya/SPARK-25591. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

Avoid overwriting deserialized accumulator.

db0a583

AbdealiLoKo reviewed Oct 5, 2018

View reviewed changes

HyukjinKwon reviewed Oct 7, 2018

View reviewed changes

Address comment.

08c7223

asfgit closed this in cb90617 Oct 8, 2018

viirya deleted the SPARK-25591 branch December 27, 2023 18:21

[SPARK-25591][PySpark][SQL] Avoid overwriting deserialized accumulator #22635

[SPARK-25591][PySpark][SQL] Avoid overwriting deserialized accumulator #22635

Uh oh!

Conversation

viirya commented Oct 4, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Oct 5, 2018

Uh oh!

SparkQA commented Oct 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AbdealiLoKo Oct 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 5, 2018

Uh oh!

viirya commented Oct 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 7, 2018

Uh oh!

viirya commented Oct 8, 2018

Uh oh!

SparkQA commented Oct 8, 2018

Uh oh!

HyukjinKwon commented Oct 8, 2018

Uh oh!

AbdealiLoKo commented Oct 8, 2018

Uh oh!

gatorsmile commented Oct 8, 2018

Uh oh!

cloud-fan commented Oct 8, 2018

Uh oh!

HyukjinKwon commented Oct 8, 2018

Uh oh!

viirya commented Oct 8, 2018

Uh oh!

Tagar commented Nov 19, 2018

Uh oh!

viirya commented Nov 19, 2018

Uh oh!

Tagar commented Nov 19, 2018

Uh oh!

HyukjinKwon commented Nov 20, 2018

Uh oh!

HyukjinKwon commented Nov 20, 2018

Uh oh!

viirya commented Nov 20, 2018

Uh oh!

Tagar commented Nov 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

AbdealiLoKo Oct 5, 2018 •

edited

Loading

Tagar commented Nov 20, 2018 •

edited

Loading