[SPARK-27871][SQL] LambdaVariable should use per-query unique IDs instead of globally unique IDs #24735

cloud-fan · 2019-05-29T04:38:19Z

What changes were proposed in this pull request?

For simplicity, all LambdaVariables are globally unique, to avoid any potential conflicts. However, this causes a perf problem: we can never hit codegen cache for encoder expressions that deal with collections (which means they contain LambdaVariable).

To overcome this problem, LambdaVariable should have per-query unique IDs. This PR does 2 things:

refactor LambdaVariable to carry an ID, so that it's easier to change the ID.
add an optimizer rule to reassign LambdaVariable IDs, which are per-query unique.

How was this patch tested?

new tests

cloud-fan · 2019-05-29T04:42:17Z

cc @ueshin @viirya @rednaxelafx @gatorsmile @maropu

SparkQA · 2019-05-29T06:11:43Z

Test build #105891 has finished for PR 24735 at commit 55677c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DummyExpressionHolder(exprs: Seq[Expression]) extends LeafNode

SparkQA · 2019-05-29T09:49:12Z

Test build #105898 has finished for PR 24735 at commit 3959b7a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DummyExpressionHolder(exprs: Seq[Expression]) extends LeafNode

SparkQA · 2019-05-29T10:15:39Z

Test build #105899 has finished for PR 24735 at commit fe9e44a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DummyExpressionHolder(exprs: Seq[Expression]) extends LeafNode

SparkQA · 2019-05-29T16:52:55Z

Test build #105913 has finished for PR 24735 at commit 88346d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DummyExpressionHolder(exprs: Seq[Expression]) extends LeafNode

rednaxelafx

In general I like the idea of normalizing lambda variable IDs to make higher-order functions and codegen cache work better together. The details in implementation needs some polishing though.

BTW, there are a few change related to resolvedEnc that don't seem directly related to the topic of lambda variables. Can we separate that out to another PR instead? I like those improvements but they look a bit confusing in the context of this PR.

rednaxelafx · 2019-05-29T17:49:41Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetOptimizationSuite.scala

My earlier PR that added the whole-stage codegen ID used another metric for the same purpose:
e57f394#diff-0314224342bb8c30143ab784b3805d19R296

Should we try to make them use the exact same logic for checking whether or not codegen cache was hit?

rednaxelafx · 2019-05-29T18:08:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/objects.scala

Nit: "starts from"

rednaxelafx · 2019-05-29T18:11:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/objects.scala

The two traversals on the plan here sure makes the intent clean: one pass for collecting old-to-new ID mappings, and then another pass to actually do the transformation.

But efficiency-wise, these two traversals can be combined into one easily, right? Reducing the number of traversals can help save a lot of time when dealing with large plans.

rednaxelafx · 2019-05-29T18:21:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

There an implicitly assumption here that within a plan, all the LambdaVariables either hold IDs that were the original IDs (positive) or the reassigned ones (negative). We should probably add a comment on that, because if the positive/negative ones are mixed together, you can actually get a conflict when you do abs.

rednaxelafx · 2019-05-29T18:23:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

I don't like this part. It might make codegen a bit easier to write but you're making unnecessary hoisting of local variables to Java object fields. Doesn't sound like a good idea to me.

Note that this just moves code around: we already blindly put LambdaVariable to the mutable states. I agree that there is room to optimize this part, we can do it in followups.

cloud-fan · 2019-05-30T00:48:34Z

BTW, there are a few change related to resolvedEnc that don't seem directly related to the topic of lambda variables.

Actually they are related. When using ExpressionEncoder directly, I added special logic to apply the new rule. Then I found out that I need to re-apply this special logic in Dataset twice. So I go ahead and refactor this part a bit, so that Dataset uses ExpressionEncoder directly.

ueshin · 2019-05-30T02:26:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

This rule should be applied only Once instead of fixedPoint? otherwise, per-query unique ID might conflict?

The newly generated unique IDs are negative, so this rule is idempotent because it only catches positive IDs.

But if the rule applied twice with including the positive IDs in the second loop for some reason, the new ID starts from -1 again, then it conflicts the first -1, I think.

The IDs should all be positive or negative. Let me add some checks to ensure that.

Thanks, sounds great with the check.

SparkQA · 2019-05-30T04:35:51Z

Test build #105936 has finished for PR 24735 at commit 8ac310c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DummyExpressionHolder(exprs: Seq[Expression]) extends LeafNode

viirya · 2019-05-30T06:38:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

Probably add comment about the id? The allocated ids here can't be directly used. It needs going through ReassignLambdaVariableID to reassign. If the expressions including LambdaVariable skip the normal query processing steps, we still make sure ReassignLambdaVariableID is applied.

The allocated ids here can't be directly used

Yes they can. A globally unique ID is fine here, it just breaks the codegen cache.

There we do abs on positive and negative ids, if globally unique id is used, won't it probably get conflict?

see the newly added check. If ReassignLambdaVariableID is applied, then all IDs are negative and unique. If the rule is not applied, then all IDs are positive and unique.

AFAIK this is safe as we never combine optimized plans.

viirya · 2019-05-30T06:41:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

Should we just add sanity-check if id is negative?

I've added a check in the rule.

SparkQA · 2019-05-30T11:16:14Z

Test build #105957 has finished for PR 24735 at commit eafdf2d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-30T19:14:29Z

Test build #105969 has finished for PR 24735 at commit 053b3ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-06-12T06:33:31Z

retest this please

SparkQA · 2019-06-12T07:05:02Z

Test build #106404 has finished for PR 24735 at commit 053b3ba.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-06-12T07:31:28Z

retest this please

SparkQA · 2019-06-12T08:57:34Z

Test build #106410 has finished for PR 24735 at commit 053b3ba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-06-13T06:19:07Z

retest this please

SparkQA · 2019-06-13T07:05:02Z

Test build #106457 has finished for PR 24735 at commit 053b3ba.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-06-19T03:17:36Z

retest this please

SparkQA · 2019-06-19T04:53:15Z

Test build #106655 has finished for PR 24735 at commit 053b3ba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-22T10:10:23Z

Test build #106790 has finished for PR 24735 at commit 51211f4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-06-22T11:04:44Z

retest this please

SparkQA · 2019-06-22T13:21:39Z

Test build #106791 has finished for PR 24735 at commit 51211f4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ique IDs

SparkQA · 2019-06-23T04:31:16Z

Test build #106799 has finished for PR 24735 at commit 2713f38.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-24T07:05:01Z

Test build #106817 has finished for PR 24735 at commit 73c74df.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-06-25T02:17:14Z

retest this please

SparkQA · 2019-06-25T05:35:57Z

Test build #106854 has finished for PR 24735 at commit 73c74df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rednaxelafx

LGTM

gatorsmile

LGTM

Thanks! Merged to master.

### What changes were proposed in this pull request? This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads. Here is an example demonstrating the problem: ```scala import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") } ``` Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled. This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`. The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually using the example listed above. Thanks to smcnamara-stripe for identifying this bug. Closes #26076 from JoshRosen/SPARK-29419. Authored-by: Josh Rosen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit f4499f6) Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads. Here is an example demonstrating the problem: ```scala import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") } ``` Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled. This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`. The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually using the example listed above. Thanks to smcnamara-stripe for identifying this bug. Closes #26076 from JoshRosen/SPARK-29419. Authored-by: Josh Rosen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads. Here is an example demonstrating the problem: ```scala import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") } ``` Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled. This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`. The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually using the example listed above. Thanks to smcnamara-stripe for identifying this bug. Closes #26076 from JoshRosen/SPARK-29419. Authored-by: Josh Rosen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit f4499f6) Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads. Here is an example demonstrating the problem: ```scala import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") } ``` Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled. This bug is similar to SPARK-22355 / apache#19577, a similar problem in `Dataset.collect()`. The fix implemented here is based on apache#24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually using the example listed above. Thanks to smcnamara-stripe for identifying this bug. Closes apache#26076 from JoshRosen/SPARK-29419. Authored-by: Josh Rosen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

cloud-fan force-pushed the dataset branch from 06e3f5e to 55677c0 Compare May 29, 2019 04:39

cloud-fan force-pushed the dataset branch 2 times, most recently from 3959b7a to fe9e44a Compare May 29, 2019 08:02

cloud-fan force-pushed the dataset branch from fe9e44a to 88346d7 Compare May 29, 2019 14:18

rednaxelafx reviewed May 29, 2019

View reviewed changes

cloud-fan force-pushed the dataset branch from 88346d7 to 8ac310c Compare May 30, 2019 00:49

ueshin reviewed May 30, 2019

View reviewed changes

viirya reviewed May 30, 2019

View reviewed changes

cloud-fan force-pushed the dataset branch from eafdf2d to 053b3ba Compare May 30, 2019 16:11

dongjoon-hyun added the SQL label Jun 14, 2019

cloud-fan force-pushed the dataset branch from 053b3ba to 51211f4 Compare June 22, 2019 07:51

cloud-fan added 3 commits June 23, 2019 10:02

LambdaVariable should use per-query unique IDs instead of globally un…

51bf22f

…ique IDs

address comments

d4ee9d5

fix a bug

cdefc6f

cloud-fan force-pushed the dataset branch from 51211f4 to 2713f38 Compare June 23, 2019 02:19

fix test

73c74df

cloud-fan force-pushed the dataset branch from 2713f38 to 73c74df Compare June 24, 2019 04:14

rednaxelafx approved these changes Jun 27, 2019

View reviewed changes

gatorsmile reviewed Jun 27, 2019

View reviewed changes

gatorsmile closed this in cded421 Jun 27, 2019

JoshRosen mentioned this pull request Oct 9, 2019

[SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) #26076

Closed

[SPARK-27871][SQL] LambdaVariable should use per-query unique IDs instead of globally unique IDs #24735

[SPARK-27871][SQL] LambdaVariable should use per-query unique IDs instead of globally unique IDs #24735

Uh oh!

Conversation

cloud-fan commented May 29, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 29, 2019

Uh oh!

SparkQA commented May 29, 2019

Uh oh!

SparkQA commented May 29, 2019

Uh oh!

SparkQA commented May 29, 2019

Uh oh!

rednaxelafx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 30, 2019

Uh oh!

SparkQA commented May 30, 2019

Uh oh!

cloud-fan commented Jun 12, 2019

Uh oh!

SparkQA commented Jun 12, 2019

Uh oh!

cloud-fan commented Jun 12, 2019

Uh oh!

SparkQA commented Jun 12, 2019

Uh oh!

cloud-fan commented Jun 13, 2019

Uh oh!

cloud-fan commented May 29, 2019 •

edited

Loading