Skip to content

Conversation

@wForget
Copy link
Member

@wForget wForget commented Apr 18, 2024

What changes were proposed in this pull request?

Fix ExpressionSet performance regression in scala 2.12.

Why are the changes needed?

The implementation of the SetLike.++ method in scala 2.12 is to iteratively execute the + method. The ExpressionSet.+ method first clones a new object and then adds element, which is very expensive.

https://github.com/scala/scala/blob/ceaf7e68ac93e9bbe8642d06164714b2de709c27/src/library/scala/collection/SetLike.scala#L186

After #36121, the ++ and -- methods in ExpressionSet of scala 2.12 were removed, causing performance regression.

Does this PR introduce any user-facing change?

How was this patch tested?

Benchmark code:

import org.apache.spark.benchmark.Benchmark
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.expressions.{AttributeReference, ExpressionSet, ExprId}
import org.apache.spark.sql.types.IntegerType

object TestBenchmark {
  def main(args: Array[String]): Unit = {
    val count = 300
    val benchmark = new Benchmark("Test ExpressionSetV2 ++ ", count)
    val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1))

    var initialSet = ExpressionSet((0 until 300).map(i => aUpper + i))
    val setToAddWithSameDeterministicExpression = ExpressionSet((0 until 300).map(i => aUpper + i))

    benchmark.addCase("Test ++", 10) { _: Int =>
      for (_ <- 0L until count) {
        initialSet ++= setToAddWithSameDeterministicExpression
      }
    }
    benchmark.run()
  }
}

before this change:

OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64
Intel Core Processor (Skylake, IBRS)
Test ExpressionSetV2 ++ :                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Test ++                                            1577           1691          61          0.0     5255516.0       1.0X

after this change:

OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64
Intel Core Processor (Skylake, IBRS)
Test ExpressionSetV2 ++ :                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Test ++                                              14             14           0          0.0       45395.2       1.0X

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Apr 18, 2024
@wForget
Copy link
Member Author

wForget commented Apr 18, 2024

cc @minyyy @cloud-fan could you please take a look?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-47897][SQL] Fix ExpressionSet performance regression in scala 2.12 [SPARK-47897][SQL][3.5] Fix ExpressionSet performance regression in scala 2.12 Apr 18, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a regression at SPARK-38836, do we need to to fix this at branch-3.4, too, @wForget ?

@dongjoon-hyun
Copy link
Member

cc @LuciferYang , @viirya , too

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

newSet
}

override def ++(elems: GenTraversableOnce[Expression]): ExpressionSet = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment on the method about why we don't use the SetLike.default here? Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, added.

@wForget
Copy link
Member Author

wForget commented Apr 19, 2024

If this is a regression at SPARK-38836, do we need to to fix this at branch-3.4, too, @wForget ?

I think it's needed.

yaooqinn pushed a commit that referenced this pull request Apr 19, 2024
…cala 2.12

### What changes were proposed in this pull request?

Fix `ExpressionSet` performance regression in scala 2.12.

### Why are the changes needed?

The implementation of the `SetLike.++` method in scala 2.12 is to iteratively execute the `+` method. The `ExpressionSet.+` method first clones a new object and then adds element, which is very expensive.

https://github.com/scala/scala/blob/ceaf7e68ac93e9bbe8642d06164714b2de709c27/src/library/scala/collection/SetLike.scala#L186

After #36121, the `++` and `--` methods in ExpressionSet of scala 2.12 were removed, causing performance regression.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Benchmark code:

```
object TestBenchmark {
  def main(args: Array[String]): Unit = {
    val count = 300
    val benchmark = new Benchmark("Test ExpressionSetV2 ++ ", count)
    val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1))

    var initialSet = ExpressionSet((0 until 300).map(i => aUpper + i))
    val setToAddWithSameDeterministicExpression = ExpressionSet((0 until 300).map(i => aUpper + i))

    benchmark.addCase("Test ++", 10) { _: Int =>
      for (_ <- 0L until count) {
        initialSet ++= setToAddWithSameDeterministicExpression
      }
    }
    benchmark.run()
  }
}
```

before this change:

```
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64
Intel Core Processor (Skylake, IBRS)
Test ExpressionSetV2 ++ :                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Test ++                                            1577           1691          61          0.0     5255516.0       1.0X
```

after this change:

```
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64
Intel Core Processor (Skylake, IBRS)
Test ExpressionSetV2 ++ :                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Test ++                                              14             14           0          0.0       45395.2       1.0X
```

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46114 from wForget/SPARK-47897.

Authored-by: Zhen Wang <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
yaooqinn pushed a commit that referenced this pull request Apr 19, 2024
…cala 2.12

### What changes were proposed in this pull request?

Fix `ExpressionSet` performance regression in scala 2.12.

### Why are the changes needed?

The implementation of the `SetLike.++` method in scala 2.12 is to iteratively execute the `+` method. The `ExpressionSet.+` method first clones a new object and then adds element, which is very expensive.

https://github.com/scala/scala/blob/ceaf7e68ac93e9bbe8642d06164714b2de709c27/src/library/scala/collection/SetLike.scala#L186

After #36121, the `++` and `--` methods in ExpressionSet of scala 2.12 were removed, causing performance regression.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Benchmark code:

```
object TestBenchmark {
  def main(args: Array[String]): Unit = {
    val count = 300
    val benchmark = new Benchmark("Test ExpressionSetV2 ++ ", count)
    val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1))

    var initialSet = ExpressionSet((0 until 300).map(i => aUpper + i))
    val setToAddWithSameDeterministicExpression = ExpressionSet((0 until 300).map(i => aUpper + i))

    benchmark.addCase("Test ++", 10) { _: Int =>
      for (_ <- 0L until count) {
        initialSet ++= setToAddWithSameDeterministicExpression
      }
    }
    benchmark.run()
  }
}
```

before this change:

```
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64
Intel Core Processor (Skylake, IBRS)
Test ExpressionSetV2 ++ :                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Test ++                                            1577           1691          61          0.0     5255516.0       1.0X
```

after this change:

```
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64
Intel Core Processor (Skylake, IBRS)
Test ExpressionSetV2 ++ :                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Test ++                                              14             14           0          0.0       45395.2       1.0X
```

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46114 from wForget/SPARK-47897.

Authored-by: Zhen Wang <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit afd99d1)
Signed-off-by: Kent Yao <[email protected]>
@yaooqinn
Copy link
Member

Thank you @wForget, and @dongjoon-hyun @viirya @minyyy @cloud-fan

Merged to '3.5.2', '3.4.4'

@LuciferYang
Copy link
Contributor

Thanks for your fix @wForget

But for the TestBenchmark shown in the current pr description, there are some compilation errors when I manually copy it for testing:

[ERROR] [Error] /spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/TestBenchmark.scala:29: type mismatch;
 found   : Int
 required: String
[ERROR] [Error] /spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/TestBenchmark.scala:31: type mismatch;
 found   : Int
 required: String
[ERROR] two errors found

Could you please correct it in pr description?

@yaooqinn yaooqinn closed this Apr 19, 2024
@wForget
Copy link
Member Author

wForget commented Apr 19, 2024

But for the TestBenchmark shown in the current pr description, there are some compilation errors when I manually copy it for testing:

I guess you may be missing import org.apache.spark.sql.catalyst.dsl.expressions._

Could you please correct it in pr description?

I have added imports.

@LuciferYang
Copy link
Contributor

But for the TestBenchmark shown in the current pr description, there are some compilation errors when I manually copy it for testing:

I guess you may be missing import org.apache.spark.sql.catalyst.dsl.expressions._

Thanks. Could you please add this import to the benchmark code in the PR description?

@LuciferYang
Copy link
Contributor

Thanks @wForget

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…cala 2.12 (apache#382)

### What changes were proposed in this pull request?

Fix `ExpressionSet` performance regression in scala 2.12.

### Why are the changes needed?

The implementation of the `SetLike.++` method in scala 2.12 is to iteratively execute the `+` method. The `ExpressionSet.+` method first clones a new object and then adds element, which is very expensive.

https://github.com/scala/scala/blob/ceaf7e68ac93e9bbe8642d06164714b2de709c27/src/library/scala/collection/SetLike.scala#L186

After apache#36121, the `++` and `--` methods in ExpressionSet of scala 2.12 were removed, causing performance regression.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Benchmark code:

```
object TestBenchmark {
  def main(args: Array[String]): Unit = {
    val count = 300
    val benchmark = new Benchmark("Test ExpressionSetV2 ++ ", count)
    val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1))

    var initialSet = ExpressionSet((0 until 300).map(i => aUpper + i))
    val setToAddWithSameDeterministicExpression = ExpressionSet((0 until 300).map(i => aUpper + i))

    benchmark.addCase("Test ++", 10) { _: Int =>
      for (_ <- 0L until count) {
        initialSet ++= setToAddWithSameDeterministicExpression
      }
    }
    benchmark.run()
  }
}
```

before this change:

```
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64
Intel Core Processor (Skylake, IBRS)
Test ExpressionSetV2 ++ :                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Test ++                                            1577           1691          61          0.0     5255516.0       1.0X
```

after this change:

```
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-957.el7.x86_64
Intel Core Processor (Skylake, IBRS)
Test ExpressionSetV2 ++ :                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Test ++                                              14             14           0          0.0       45395.2       1.0X
```

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46114 from wForget/SPARK-47897.

Authored-by: Zhen Wang <[email protected]>

Signed-off-by: Kent Yao <[email protected]>
Co-authored-by: Zhen Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants