-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25591][PySpark][SQL] Avoid overwriting deserialized accumulator #22635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -109,10 +109,14 @@ | |
|
|
||
| def _deserialize_accumulator(aid, zero_value, accum_param): | ||
| from pyspark.accumulators import _accumulatorRegistry | ||
| accum = Accumulator(aid, zero_value, accum_param) | ||
| accum._deserialized = True | ||
| _accumulatorRegistry[aid] = accum | ||
| return accum | ||
| # If this certain accumulator was deserialized, don't overwrite it. | ||
| if aid in _accumulatorRegistry: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, so the problem is this accumulator is de/serialized multiple times and
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. |
||
| return _accumulatorRegistry[aid] | ||
| else: | ||
| accum = Accumulator(aid, zero_value, accum_param) | ||
| accum._deserialized = True | ||
| _accumulatorRegistry[aid] = accum | ||
| return accum | ||
|
|
||
|
|
||
| class Accumulator(object): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3603,6 +3603,31 @@ def test_repr_behaviors(self): | |
| self.assertEquals(None, df._repr_html_()) | ||
| self.assertEquals(expected, df.__repr__()) | ||
|
|
||
| # SPARK-25591 | ||
| def test_same_accumulator_in_udfs(self): | ||
| from pyspark.sql.functions import udf | ||
|
|
||
| data_schema = StructType([StructField("a", IntegerType(), True), | ||
| StructField("b", IntegerType(), True)]) | ||
| data = self.spark.createDataFrame([[1, 2]], schema=data_schema) | ||
|
|
||
| test_accum = self.sc.accumulator(0) | ||
|
|
||
| def first_udf(x): | ||
| test_accum.add(1) | ||
| return x | ||
|
|
||
| def second_udf(x): | ||
| test_accum.add(100) | ||
| return x | ||
|
|
||
| func_udf = udf(first_udf, IntegerType()) | ||
| func_udf2 = udf(second_udf, IntegerType()) | ||
| data = data.withColumn("out1", func_udf(data["a"])) | ||
| data = data.withColumn("out2", func_udf2(data["b"])) | ||
| data.collect() | ||
| self.assertEqual(test_accum.value, 101) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @viirya, can we just use int for data and accumulator as well in this test case?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. |
||
|
|
||
|
|
||
| class HiveSparkSubmitTests(SparkSubmitTests): | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be
if aid in _accumulatorRegistry and _accumulatorRegistry[aid]._deserialized is Trueor:
To make double sure that this function always returns a deserialize version of the accum ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only save deserialized accumulators (
_deserializedisTrue) into this dict.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesnt seem right because the constructor for
Accumulatorhas:PS: First time Im looking at this code, so not too familiar with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but
_deserialize_accumulatoris only called when doing deserialzation at executors. The constructor saves accumulators in_accumulatorRegistryat driver.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - got it 👍