Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Oct 4, 2018

What changes were proposed in this pull request?

If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry.

How was this patch tested?

Added test.

@viirya
Copy link
Member Author

viirya commented Oct 5, 2018

cc @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96960 has finished for PR 22635 at commit db0a583.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

_accumulatorRegistry[aid] = accum
return accum
# If this certain accumulator was deserialized, don't overwrite it.
if aid in _accumulatorRegistry:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be if aid in _accumulatorRegistry and _accumulatorRegistry[aid]._deserialized is True
or:

if aid in _accumulatorRegistry:
    _accumulatorRegistry[aid]._deserialize = True
    return _accumulatorRegistry[aid]

To make double sure that this function always returns a deserialize version of the accum ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only save deserialized accumulators (_deserialized is True) into this dict.

Copy link

@AbdealiLoKo AbdealiLoKo Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesnt seem right because the constructor for Accumulator has:

        ...
        self._deserialized = False
        _accumulatorRegistry[aid] = self

PS: First time Im looking at this code, so not too familiar with it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but _deserialize_accumulator is only called when doing deserialzation at executors. The constructor saves accumulators in _accumulatorRegistry at driver.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - got it 👍

@HyukjinKwon
Copy link
Member

Thanks for cc'ing me. Will take a look this week.

@viirya
Copy link
Member Author

viirya commented Oct 5, 2018

Since this is for correctness, I think we should include this into 2.4 if it can catch up. cc @cloud-fan

data = data.withColumn("out1", func_udf(data["a"]))
data = data.withColumn("out2", func_udf2(data["b"]))
data.collect()
self.assertEqual(test_accum.value, 101)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya, can we just use int for data and accumulator as well in this test case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

_accumulatorRegistry[aid] = accum
return accum
# If this certain accumulator was deserialized, don't overwrite it.
if aid in _accumulatorRegistry:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the problem is this accumulator is de/serialized multiple times and _deserialize_accumulator modifies the global status multiple times. I see. LGTM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@HyukjinKwon
Copy link
Member

Nice catch @viirya LGTM.

@viirya
Copy link
Member Author

viirya commented Oct 8, 2018

Thanks @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Oct 8, 2018

Test build #97100 has finished for PR 22635 at commit 08c7223.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Oct 8, 2018
## What changes were proposed in this pull request?

If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry.

## How was this patch tested?

Added test.

Closes #22635 from viirya/SPARK-25591.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
(cherry picked from commit cb90617)
Signed-off-by: hyukjinkwon <[email protected]>
@asfgit asfgit closed this in cb90617 Oct 8, 2018
@HyukjinKwon
Copy link
Member

Merged to master and branch-2.4.

@AbdealiLoKo
Copy link

@cloud-fan @viirya Any chance of this making it into 2.4 ?

@gatorsmile
Copy link
Member

How about pandas UDF? How about using RDD APIs? Do we face the same issues?

@cloud-fan
Copy link
Contributor

@AbdealiJK since RC3 is not cut, this will be in 2.4.

@HyukjinKwon
Copy link
Member

Yea, same issue exists in Pandas UDFs too (quickly double checked). This PR fixes it. That code path is rather one same place FYI.

@viirya
Copy link
Member Author

viirya commented Oct 8, 2018

@cloud-fan @gatorsmile @HyukjinKwon Thanks. Yes. As Pandas UDF has the same issue and it is fixed by this PR.

@Tagar
Copy link

Tagar commented Nov 19, 2018

Please review https://issues.apache.org/jira/browse/SPARK-26019
"pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()"

I suspect this change might have introduced SPARK-26019 regression.

@viirya
Copy link
Member Author

viirya commented Nov 19, 2018

@Tagar I will look into it. Thanks.

@Tagar
Copy link

Tagar commented Nov 19, 2018

Thank you @viirya

@HyukjinKwon
Copy link
Member

How does it related with the JIRA? looks not quite related from a cursory look. Please leave some analysis next time or at least testing it before/after the specific commit. Let me take a look anyway.

@HyukjinKwon
Copy link
Member

This is fixed in 2.4.0 and your issue is when 2.3.1 -> 2.3.2. It's not related.

@viirya
Copy link
Member Author

viirya commented Nov 20, 2018

Yeah, thanks @HyukjinKwon. I have an initial look, looks like it is not quite related.

@Tagar
Copy link

Tagar commented Nov 20, 2018

@viirya I appologize, as I mentioned in my comment in SPARK-26019, it's due to another change
15fc237#diff-c3339bbf2b850b79445b41e9eecf57c4R249 - error happens in authenticate_and_accum_updates() and that's a new code that was brought by that commit. Thanks for looking at that anyway!

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry.

## How was this patch tested?

Added test.

Closes apache#22635 from viirya/SPARK-25591.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
@viirya viirya deleted the SPARK-25591 branch December 27, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants