-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-38836][SQL] Improve the performance of ExpressionSet #36121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sigmod @cloud-fan Could you two help review this PR? Thanks. |
| if (e.deterministic) { | ||
| baseSet --= baseSet.filter(_ == e.canonicalized) | ||
| originals --= originals.filter(_.canonicalized == e.canonicalized) | ||
| baseSet.retain(_ != e.canonicalized) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
baseSet.retain(_ != e.canonicalized) only needs to traverse the set once, much better than the previous implementation.
| baseSet --= baseSet.filter(_ == e.canonicalized) | ||
| originals --= originals.filter(_.canonicalized == e.canonicalized) | ||
| baseSet.retain(_ != e.canonicalized) | ||
| originals = originals.filter(!_.semanticEquals(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two changes happening in this line.
- Changes to
originals = originals.filter(...). This is O(n) as the previous implementation is O(mn). - Uses
semanticEqualsinstead of the previous_.canonicalized == e.canonicalizedas the condition.semanticEqualscan short circuit and avoid unnecessary access ofcanonicalizedfor non-deterministic expressions.
By invariant 2, the current implementation should not evaluate canonicalized at all.
|
|
||
| override def filter(p: Expression => Boolean): ExpressionSet = { | ||
| val newBaseSet = baseSet.filter(e => p(e.canonicalized)) | ||
| val newBaseSet = baseSet.filter(e => p(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By invariant 1: .canonicalized is not needed and can cause performance issue.
| } | ||
|
|
||
| override def filterNot(p: Expression => Boolean): ExpressionSet = { | ||
| val newBaseSet = baseSet.filterNot(e => p(e.canonicalized)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By invariant 1: .canonicalized is not needed and can cause performance issue.
minyyy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put some review comments to pin the key changes.
|
Can one of the admins verify this patch? |
|
Is there a perf number ? FYI, It's for semantics why the |
Not yet, but I can create a benchmark for it.
I actually saw this PR. But from what I understood #29598. It did 2 things. The first is that it fixed the insertion stability issue of
What we should do is to also extend |
sql/catalyst/src/main/scala-2.12/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala-2.13/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala
Outdated
Show resolved
Hide resolved
|
LGTM, it's better to see some numbers of microbenchmark though. |
|
I ran the TPC-DS benchmark twice locally: ~/Work/spark (expr_set ✔) cat /tmp/master_result | grep Filter ~/Work/spark (master ✔) cat /tmp/master_result2 | grep Filter The major improvement is to InferFiltersFromConstraints, where it does lots of unnecessary access to |
|
|
The benchmark ran for tpc-ds queries. But the |
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please fix test failures.
|
All tests passed. CI was broken when I updated the PR last time. Re-triggered the CI. |
| if (e.deterministic) { | ||
| baseSet --= baseSet.filter(_ == e.canonicalized) | ||
| originals --= originals.filter(_.canonicalized == e.canonicalized) | ||
| baseSet.retain(_ != e.canonicalized) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about:
baseSet.remove(e.canonicalized)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| protected def remove(e: Expression): Unit = { | ||
| if (e.deterministic) { | ||
| baseSet.filterInPlace(_ != e.canonicalized) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
baseSet.remove(e.canonicalized)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
thanks, merging to master! |
…nabled
### What changes were proposed in this pull request?
This PR fixes a bug in broadcast handling `PythonRunner` when encryption is enabed. Due to this bug the following pyspark script:
```
bin/pyspark --conf spark.io.encryption.enabled=true
...
bar = {"a": "aa", "b": "bb"}
foo = spark.sparkContext.broadcast(bar)
spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
```
fails with:
```
22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command
command = serializer._read_with_length(file)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
EOFError: Ran out of input
```
The reason for this failure is that we have multiple Python UDF referencing the same broadcast and in the current code:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420
the number of broadcasts (`cnt`) is correct (1) but the broadcast id is serialized 2 times from JVM to Python ruining the next item that Python expects from JVM side.
Please note that the example above works in Spark 3.3 without this fix. That is because #36121 in Spark 3.4 modified `ExpressionSet` and so `udfs` in `ExtractPythonUDFs`:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242
changed from `Stream` to `Vector`. When `broadcastVars` (and so `idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is called before the 2nd item is calculated in `broadcastVars.flatMap`). But that doesn't mean that #36121 introduced the regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast data 2 times (which `EncryptedPythonBroadcastServer` does now, but it is not noticed) as it could fail other cases when there are more than 1 broadcast used in UDFs).
### Why are the changes needed?
To fix a bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added new UT.
Closes #38334 from peter-toth/SPARK-40874-fix-broadcasts-in-python-udf.
Authored-by: Peter Toth <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…nabled
This PR fixes a bug in broadcast handling `PythonRunner` when encryption is enabed. Due to this bug the following pyspark script:
```
bin/pyspark --conf spark.io.encryption.enabled=true
...
bar = {"a": "aa", "b": "bb"}
foo = spark.sparkContext.broadcast(bar)
spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
```
fails with:
```
22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command
command = serializer._read_with_length(file)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
EOFError: Ran out of input
```
The reason for this failure is that we have multiple Python UDF referencing the same broadcast and in the current code:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420
the number of broadcasts (`cnt`) is correct (1) but the broadcast id is serialized 2 times from JVM to Python ruining the next item that Python expects from JVM side.
Please note that the example above works in Spark 3.3 without this fix. That is because #36121 in Spark 3.4 modified `ExpressionSet` and so `udfs` in `ExtractPythonUDFs`:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242
changed from `Stream` to `Vector`. When `broadcastVars` (and so `idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is called before the 2nd item is calculated in `broadcastVars.flatMap`). But that doesn't mean that #36121 introduced the regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast data 2 times (which `EncryptedPythonBroadcastServer` does now, but it is not noticed) as it could fail other cases when there are more than 1 broadcast used in UDFs).
To fix a bug.
No.
Added new UT.
Closes #38334 from peter-toth/SPARK-40874-fix-broadcasts-in-python-udf.
Authored-by: Peter Toth <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…nabled
This PR fixes a bug in broadcast handling `PythonRunner` when encryption is enabed. Due to this bug the following pyspark script:
```
bin/pyspark --conf spark.io.encryption.enabled=true
...
bar = {"a": "aa", "b": "bb"}
foo = spark.sparkContext.broadcast(bar)
spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
```
fails with:
```
22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command
command = serializer._read_with_length(file)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
EOFError: Ran out of input
```
The reason for this failure is that we have multiple Python UDF referencing the same broadcast and in the current code:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420
the number of broadcasts (`cnt`) is correct (1) but the broadcast id is serialized 2 times from JVM to Python ruining the next item that Python expects from JVM side.
Please note that the example above works in Spark 3.3 without this fix. That is because #36121 in Spark 3.4 modified `ExpressionSet` and so `udfs` in `ExtractPythonUDFs`:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242
changed from `Stream` to `Vector`. When `broadcastVars` (and so `idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is called before the 2nd item is calculated in `broadcastVars.flatMap`). But that doesn't mean that #36121 introduced the regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast data 2 times (which `EncryptedPythonBroadcastServer` does now, but it is not noticed) as it could fail other cases when there are more than 1 broadcast used in UDFs).
To fix a bug.
No.
Added new UT.
Closes #38334 from peter-toth/SPARK-40874-fix-broadcasts-in-python-udf.
Authored-by: Peter Toth <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 8a96f69)
Signed-off-by: Hyukjin Kwon <[email protected]>
…nabled
This PR fixes a bug in broadcast handling `PythonRunner` when encryption is enabed. Due to this bug the following pyspark script:
```
bin/pyspark --conf spark.io.encryption.enabled=true
...
bar = {"a": "aa", "b": "bb"}
foo = spark.sparkContext.broadcast(bar)
spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
```
fails with:
```
22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command
command = serializer._read_with_length(file)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
EOFError: Ran out of input
```
The reason for this failure is that we have multiple Python UDF referencing the same broadcast and in the current code:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420
the number of broadcasts (`cnt`) is correct (1) but the broadcast id is serialized 2 times from JVM to Python ruining the next item that Python expects from JVM side.
Please note that the example above works in Spark 3.3 without this fix. That is because #36121 in Spark 3.4 modified `ExpressionSet` and so `udfs` in `ExtractPythonUDFs`:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242
changed from `Stream` to `Vector`. When `broadcastVars` (and so `idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is called before the 2nd item is calculated in `broadcastVars.flatMap`). But that doesn't mean that #36121 introduced the regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast data 2 times (which `EncryptedPythonBroadcastServer` does now, but it is not noticed) as it could fail other cases when there are more than 1 broadcast used in UDFs).
To fix a bug.
No.
Added new UT.
Closes #38334 from peter-toth/SPARK-40874-fix-broadcasts-in-python-udf.
Authored-by: Peter Toth <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 8a96f69)
Signed-off-by: Hyukjin Kwon <[email protected]>
…nabled
### What changes were proposed in this pull request?
This PR fixes a bug in broadcast handling `PythonRunner` when encryption is enabed. Due to this bug the following pyspark script:
```
bin/pyspark --conf spark.io.encryption.enabled=true
...
bar = {"a": "aa", "b": "bb"}
foo = spark.sparkContext.broadcast(bar)
spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
```
fails with:
```
22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command
command = serializer._read_with_length(file)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
EOFError: Ran out of input
```
The reason for this failure is that we have multiple Python UDF referencing the same broadcast and in the current code:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420
the number of broadcasts (`cnt`) is correct (1) but the broadcast id is serialized 2 times from JVM to Python ruining the next item that Python expects from JVM side.
Please note that the example above works in Spark 3.3 without this fix. That is because apache#36121 in Spark 3.4 modified `ExpressionSet` and so `udfs` in `ExtractPythonUDFs`:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242
changed from `Stream` to `Vector`. When `broadcastVars` (and so `idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is called before the 2nd item is calculated in `broadcastVars.flatMap`). But that doesn't mean that apache#36121 introduced the regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast data 2 times (which `EncryptedPythonBroadcastServer` does now, but it is not noticed) as it could fail other cases when there are more than 1 broadcast used in UDFs).
### Why are the changes needed?
To fix a bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added new UT.
Closes apache#38334 from peter-toth/SPARK-40874-fix-broadcasts-in-python-udf.
Authored-by: Peter Toth <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…nabled
This PR fixes a bug in broadcast handling `PythonRunner` when encryption is enabed. Due to this bug the following pyspark script:
```
bin/pyspark --conf spark.io.encryption.enabled=true
...
bar = {"a": "aa", "b": "bb"}
foo = spark.sparkContext.broadcast(bar)
spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
```
fails with:
```
22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command
command = serializer._read_with_length(file)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
return self.loads(obj)
File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads
return cloudpickle.loads(obj, encoding=encoding)
EOFError: Ran out of input
```
The reason for this failure is that we have multiple Python UDF referencing the same broadcast and in the current code:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420
the number of broadcasts (`cnt`) is correct (1) but the broadcast id is serialized 2 times from JVM to Python ruining the next item that Python expects from JVM side.
Please note that the example above works in Spark 3.3 without this fix. That is because apache#36121 in Spark 3.4 modified `ExpressionSet` and so `udfs` in `ExtractPythonUDFs`:
https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242
changed from `Stream` to `Vector`. When `broadcastVars` (and so `idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is called before the 2nd item is calculated in `broadcastVars.flatMap`). But that doesn't mean that apache#36121 introduced the regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast data 2 times (which `EncryptedPythonBroadcastServer` does now, but it is not noticed) as it could fail other cases when there are more than 1 broadcast used in UDFs).
To fix a bug.
No.
Added new UT.
Closes apache#38334 from peter-toth/SPARK-40874-fix-broadcasts-in-python-udf.
Authored-by: Peter Toth <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 8a96f69)
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit b6b4945)
…cala 2.12 ### What changes were proposed in this pull request? Fix `ExpressionSet` performance regression in scala 2.12. ### Why are the changes needed? The implementation of the `SetLike.++` method in scala 2.12 is to iteratively execute the `+` method. The `ExpressionSet.+` method first clones a new object and then adds element, which is very expensive. https://github.com/scala/scala/blob/ceaf7e68ac93e9bbe8642d06164714b2de709c27/src/library/scala/collection/SetLike.scala#L186 After #36121, the `++` and `--` methods in ExpressionSet of scala 2.12 were removed, causing performance regression. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Benchmark code: ``` object TestBenchmark { def main(args: Array[String]): Unit = { val count = 300 val benchmark = new Benchmark("Test ExpressionSetV2 ++ ", count) val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1)) var initialSet = ExpressionSet((0 until 300).map(i => aUpper + i)) val setToAddWithSameDeterministicExpression = ExpressionSet((0 until 300).map(i => aUpper + i)) benchmark.addCase("Test ++", 10) { _: Int => for (_ <- 0L until count) { initialSet ++= setToAddWithSameDeterministicExpression } } benchmark.run() } } ``` before this change: ``` OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64 Intel Core Processor (Skylake, IBRS) Test ExpressionSetV2 ++ : Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Test ++ 1577 1691 61 0.0 5255516.0 1.0X ``` after this change: ``` OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64 Intel Core Processor (Skylake, IBRS) Test ExpressionSetV2 ++ : Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Test ++ 14 14 0 0.0 45395.2 1.0X ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46114 from wForget/SPARK-47897. Authored-by: Zhen Wang <[email protected]> Signed-off-by: Kent Yao <[email protected]>
…cala 2.12 ### What changes were proposed in this pull request? Fix `ExpressionSet` performance regression in scala 2.12. ### Why are the changes needed? The implementation of the `SetLike.++` method in scala 2.12 is to iteratively execute the `+` method. The `ExpressionSet.+` method first clones a new object and then adds element, which is very expensive. https://github.com/scala/scala/blob/ceaf7e68ac93e9bbe8642d06164714b2de709c27/src/library/scala/collection/SetLike.scala#L186 After #36121, the `++` and `--` methods in ExpressionSet of scala 2.12 were removed, causing performance regression. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Benchmark code: ``` object TestBenchmark { def main(args: Array[String]): Unit = { val count = 300 val benchmark = new Benchmark("Test ExpressionSetV2 ++ ", count) val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1)) var initialSet = ExpressionSet((0 until 300).map(i => aUpper + i)) val setToAddWithSameDeterministicExpression = ExpressionSet((0 until 300).map(i => aUpper + i)) benchmark.addCase("Test ++", 10) { _: Int => for (_ <- 0L until count) { initialSet ++= setToAddWithSameDeterministicExpression } } benchmark.run() } } ``` before this change: ``` OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64 Intel Core Processor (Skylake, IBRS) Test ExpressionSetV2 ++ : Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Test ++ 1577 1691 61 0.0 5255516.0 1.0X ``` after this change: ``` OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64 Intel Core Processor (Skylake, IBRS) Test ExpressionSetV2 ++ : Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Test ++ 14 14 0 0.0 45395.2 1.0X ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46114 from wForget/SPARK-47897. Authored-by: Zhen Wang <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit afd99d1) Signed-off-by: Kent Yao <[email protected]>
…cala 2.12 (apache#382) ### What changes were proposed in this pull request? Fix `ExpressionSet` performance regression in scala 2.12. ### Why are the changes needed? The implementation of the `SetLike.++` method in scala 2.12 is to iteratively execute the `+` method. The `ExpressionSet.+` method first clones a new object and then adds element, which is very expensive. https://github.com/scala/scala/blob/ceaf7e68ac93e9bbe8642d06164714b2de709c27/src/library/scala/collection/SetLike.scala#L186 After apache#36121, the `++` and `--` methods in ExpressionSet of scala 2.12 were removed, causing performance regression. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Benchmark code: ``` object TestBenchmark { def main(args: Array[String]): Unit = { val count = 300 val benchmark = new Benchmark("Test ExpressionSetV2 ++ ", count) val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1)) var initialSet = ExpressionSet((0 until 300).map(i => aUpper + i)) val setToAddWithSameDeterministicExpression = ExpressionSet((0 until 300).map(i => aUpper + i)) benchmark.addCase("Test ++", 10) { _: Int => for (_ <- 0L until count) { initialSet ++= setToAddWithSameDeterministicExpression } } benchmark.run() } } ``` before this change: ``` OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64 Intel Core Processor (Skylake, IBRS) Test ExpressionSetV2 ++ : Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Test ++ 1577 1691 61 0.0 5255516.0 1.0X ``` after this change: ``` OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64 Intel Core Processor (Skylake, IBRS) Test ExpressionSetV2 ++ : Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Test ++ 14 14 0 0.0 45395.2 1.0X ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46114 from wForget/SPARK-47897. Authored-by: Zhen Wang <[email protected]> Signed-off-by: Kent Yao <[email protected]> Co-authored-by: Zhen Wang <[email protected]>
What changes were proposed in this pull request?
This PR does a refactor of the ExpressionSet class. It does two things:
Why are the changes needed?
It can dramatically improve the performance of ExpressionSet.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
It is a performance improvement, code review should be a safe way.