-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27871][SQL] LambdaVariable should use per-query unique IDs instead of globally unique IDs #24735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #105891 has finished for PR 24735 at commit
|
3959b7a to
fe9e44a
Compare
|
Test build #105898 has finished for PR 24735 at commit
|
|
Test build #105899 has finished for PR 24735 at commit
|
|
Test build #105913 has finished for PR 24735 at commit
|
rednaxelafx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I like the idea of normalizing lambda variable IDs to make higher-order functions and codegen cache work better together. The details in implementation needs some polishing though.
BTW, there are a few change related to resolvedEnc that don't seem directly related to the topic of lambda variables. Can we separate that out to another PR instead? I like those improvements but they look a bit confusing in the context of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My earlier PR that added the whole-stage codegen ID used another metric for the same purpose:
e57f394#diff-0314224342bb8c30143ab784b3805d19R296
Should we try to make them use the exact same logic for checking whether or not codegen cache was hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "starts from"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two traversals on the plan here sure makes the intent clean: one pass for collecting old-to-new ID mappings, and then another pass to actually do the transformation.
But efficiency-wise, these two traversals can be combined into one easily, right? Reducing the number of traversals can help save a lot of time when dealing with large plans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There an implicitly assumption here that within a plan, all the LambdaVariables either hold IDs that were the original IDs (positive) or the reassigned ones (negative). We should probably add a comment on that, because if the positive/negative ones are mixed together, you can actually get a conflict when you do abs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this part. It might make codegen a bit easier to write but you're making unnecessary hoisting of local variables to Java object fields. Doesn't sound like a good idea to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this just moves code around: we already blindly put LambdaVariable to the mutable states. I agree that there is room to optimize this part, we can do it in followups.
Actually they are related. When using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule should be applied only Once instead of fixedPoint? otherwise, per-query unique ID might conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly generated unique IDs are negative, so this rule is idempotent because it only catches positive IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the rule applied twice with including the positive IDs in the second loop for some reason, the new ID starts from -1 again, then it conflicts the first -1, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IDs should all be positive or negative. Let me add some checks to ensure that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, sounds great with the check.
|
Test build #105936 has finished for PR 24735 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add comment about the id? The allocated ids here can't be directly used. It needs going through ReassignLambdaVariableID to reassign. If the expressions including LambdaVariable skip the normal query processing steps, we still make sure ReassignLambdaVariableID is applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocated ids here can't be directly used
Yes they can. A globally unique ID is fine here, it just breaks the codegen cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There we do abs on positive and negative ids, if globally unique id is used, won't it probably get conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the newly added check. If ReassignLambdaVariableID is applied, then all IDs are negative and unique. If the rule is not applied, then all IDs are positive and unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK this is safe as we never combine optimized plans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just add sanity-check if id is negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a check in the rule.
|
Test build #105957 has finished for PR 24735 at commit
|
|
Test build #105969 has finished for PR 24735 at commit
|
|
retest this please |
|
Test build #106404 has finished for PR 24735 at commit
|
|
retest this please |
|
Test build #106410 has finished for PR 24735 at commit
|
|
retest this please |
|
Test build #106457 has finished for PR 24735 at commit
|
|
retest this please |
|
Test build #106655 has finished for PR 24735 at commit
|
|
Test build #106790 has finished for PR 24735 at commit
|
|
retest this please |
|
Test build #106791 has finished for PR 24735 at commit
|
|
Test build #106799 has finished for PR 24735 at commit
|
|
Test build #106817 has finished for PR 24735 at commit
|
|
retest this please |
|
Test build #106854 has finished for PR 24735 at commit
|
rednaxelafx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! Merged to master.
### What changes were proposed in this pull request?
This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads.
Here is an example demonstrating the problem:
```scala
import org.apache.spark.sql._
val enc = implicitly[Encoder[(Int, Int)]]
val datasets = (1 to 100).par.map { _ =>
val pairs = (1 to 100).map(x => (x, x))
spark.createDataset(pairs)(enc)
}
datasets.reduce(_ union _).collect().foreach {
pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
}
```
Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled.
This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`.
The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Tested manually using the example listed above.
Thanks to smcnamara-stripe for identifying this bug.
Closes #26076 from JoshRosen/SPARK-29419.
Authored-by: Josh Rosen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit f4499f6)
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads.
Here is an example demonstrating the problem:
```scala
import org.apache.spark.sql._
val enc = implicitly[Encoder[(Int, Int)]]
val datasets = (1 to 100).par.map { _ =>
val pairs = (1 to 100).map(x => (x, x))
spark.createDataset(pairs)(enc)
}
datasets.reduce(_ union _).collect().foreach {
pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
}
```
Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled.
This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`.
The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Tested manually using the example listed above.
Thanks to smcnamara-stripe for identifying this bug.
Closes #26076 from JoshRosen/SPARK-29419.
Authored-by: Josh Rosen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads.
Here is an example demonstrating the problem:
```scala
import org.apache.spark.sql._
val enc = implicitly[Encoder[(Int, Int)]]
val datasets = (1 to 100).par.map { _ =>
val pairs = (1 to 100).map(x => (x, x))
spark.createDataset(pairs)(enc)
}
datasets.reduce(_ union _).collect().foreach {
pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
}
```
Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled.
This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`.
The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Tested manually using the example listed above.
Thanks to smcnamara-stripe for identifying this bug.
Closes #26076 from JoshRosen/SPARK-29419.
Authored-by: Josh Rosen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit f4499f6)
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads.
Here is an example demonstrating the problem:
```scala
import org.apache.spark.sql._
val enc = implicitly[Encoder[(Int, Int)]]
val datasets = (1 to 100).par.map { _ =>
val pairs = (1 to 100).map(x => (x, x))
spark.createDataset(pairs)(enc)
}
datasets.reduce(_ union _).collect().foreach {
pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
}
```
Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled.
This bug is similar to SPARK-22355 / apache#19577, a similar problem in `Dataset.collect()`.
The fix implemented here is based on apache#24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Tested manually using the example listed above.
Thanks to smcnamara-stripe for identifying this bug.
Closes apache#26076 from JoshRosen/SPARK-29419.
Authored-by: Josh Rosen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
For simplicity, all
LambdaVariables are globally unique, to avoid any potential conflicts. However, this causes a perf problem: we can never hit codegen cache for encoder expressions that deal with collections (which means they containLambdaVariable).To overcome this problem,
LambdaVariableshould have per-query unique IDs. This PR does 2 things:LambdaVariableto carry an ID, so that it's easier to change the ID.LambdaVariableIDs, which are per-query unique.How was this patch tested?
new tests