[SPARK-32914][SQL] Avoid constructing dataType multiple times by wangyum · Pull Request #29790 · apache/spark

wangyum · 2020-09-17T15:48:51Z

What changes were proposed in this pull request?

Some expression's data type not a static value. It needs to be constructed a new object when calling dataType function. E.g.: CaseWhen.
We should avoid constructing dataType multiple times because it may be used many times. E.g.: HyperLogLogPlusPlus.update.

Why are the changes needed?

Improve query performance. for example:

spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").show

Profiling result:

-- Execution profile ---
Total samples       : 18365

Frame buffer usage  : 2.6688%

--- 58443254327 ns (31.82%), 5844 samples
  [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::steal_best_of_2(unsigned int, int*, StarTask&)
  [ 1] StealTask::do_it(GCTaskManager*, unsigned int)
  [ 2] GCTaskThread::run()
  [ 3] java_start(Thread*)
  [ 4] start_thread

--- 6140668667 ns (3.34%), 614 samples
  [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::peek()
  [ 1] ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
  [ 2] StealTask::do_it(GCTaskManager*, unsigned int)
  [ 3] GCTaskThread::run()
  [ 4] java_start(Thread*)
  [ 5] start_thread

--- 5679994036 ns (3.09%), 568 samples
  [ 0] scala.collection.generic.Growable.$plus$plus$eq
  [ 1] scala.collection.generic.Growable.$plus$plus$eq$
  [ 2] scala.collection.mutable.ListBuffer.$plus$plus$eq
  [ 3] scala.collection.mutable.ListBuffer.$plus$plus$eq
  [ 4] scala.collection.generic.GenericTraversableTemplate.$anonfun$flatten$1
  [ 5] scala.collection.generic.GenericTraversableTemplate$$Lambda$107.411506101.apply
  [ 6] scala.collection.immutable.List.foreach
  [ 7] scala.collection.generic.GenericTraversableTemplate.flatten
  [ 8] scala.collection.generic.GenericTraversableTemplate.flatten$
  [ 9] scala.collection.AbstractTraversable.flatten
  [10] org.apache.spark.internal.config.ConfigEntry.readString
  [11] org.apache.spark.internal.config.ConfigEntryWithDefault.readFrom
  [12] org.apache.spark.sql.internal.SQLConf.getConf
  [13] org.apache.spark.sql.internal.SQLConf.caseSensitiveAnalysis
  [14] org.apache.spark.sql.types.DataType.sameType
  [15] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1
  [16] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted
  [17] org.apache.spark.sql.catalyst.analysis.TypeCoercion$$$Lambda$1527.1975399904.apply
  [18] scala.collection.IndexedSeqOptimized.prefixLengthImpl
  [19] scala.collection.IndexedSeqOptimized.forall
  [20] scala.collection.IndexedSeqOptimized.forall$
  [21] scala.collection.mutable.ArrayBuffer.forall
  [22] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType
  [23] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck
  [24] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$
  [25] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataTypeCheck
  [26] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType
  [27] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$
  [28] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataType
  [29] org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus.update
  [30] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2
  [31] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2$adapted
  [32] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$Lambda$1534.1383512673.apply
  [33] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7
  [34] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7$adapted
  [35] org.apache.spark.sql.execution.aggregate.AggregationIterator$$Lambda$1555.725788712.apply

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test and benchmark test:

Benchmark code	Before this PR(Milliseconds)	After this PR(Milliseconds)
spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").collect()	56462	3794

SparkQA · 2020-09-17T15:57:03Z

Test build #128834 has finished for PR 29790 at commit 906d2e0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-17T22:07:06Z

Test build #128836 has finished for PR 29790 at commit 6a9c01f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-18T12:03:58Z

do we really need such an invasive change? If there is a specific expression that calls dataType many times, let's fix that expression only. Or if this can bring significant end-to-end perf speedup, we can consider accepting it.

cloud-fan · 2020-09-18T12:05:20Z

Or if an expression has a very complicated def dataType, can we change it to lazy val dataType?

SparkQA · 2020-09-18T12:36:27Z

Test build #128864 has finished for PR 29790 at commit f5f3af5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-22T07:05:02Z

Test build #128957 has finished for PR 29790 at commit 5755273.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-09-22T07:14:52Z

retest this please

cloud-fan · 2020-09-22T08:15:59Z

  extends SubqueryExpression(plan, children, exprId) with Unevaluable {
-  override def dataType: DataType = {
+
+  private lazy val internalDataType: DataType = {


does this need to be a lazy val? seems a very cheap method.

I reverted this change because I did not find an expression to call this method many times.

cloud-fan · 2020-09-22T08:18:27Z

  @transient private lazy val childDataType: MapType = child.dataType.asInstanceOf[MapType]

-  override def dataType: DataType = {
+  private lazy val internalDataType: DataType = {


is it expensive? it just creates a few objects.

This is to improve this case:

Benchmark code Before this PR(Milliseconds) After this PR(Milliseconds)

spark.range(100000000L).selectExpr("approx_count_distinct(map_entries(map(1, id)))").collect() 21787 15551

SparkQA · 2020-09-22T11:43:05Z

Test build #128970 has finished for PR 29790 at commit 5755273.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-22T15:37:46Z

Test build #128983 has finished for PR 29790 at commit 649d3c2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-22T20:26:28Z

Test build #128985 has finished for PR 29790 at commit f2dc664.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-28T09:13:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33789/

SparkQA · 2020-09-28T09:30:56Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33789/

SparkQA · 2020-09-28T13:21:43Z

Test build #129174 has finished for PR 29790 at commit 6a8877d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-29T12:08:29Z

  @transient
  lazy val inputTypesForMerging: Seq[DataType] = children.map(_.dataType)

+  private lazy val internalDataType: DataType = {


can we put it right before the line of override def dataType: DataType?

SparkQA · 2020-09-29T13:38:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33855/

SparkQA · 2020-09-29T14:01:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33855/

SparkQA · 2020-09-29T18:03:36Z

Test build #129238 has finished for PR 29790 at commit 37d0786.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-10-05T13:00:33Z

Merged to master.

probot-autolabeler Bot added the SQL label Sep 17, 2020

maropu reviewed Sep 18, 2020

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala Outdated

probot-autolabeler Bot added the AVRO label Sep 18, 2020

fix

5755273

wangyum commented Sep 22, 2020

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala Outdated

cloud-fan reviewed Sep 22, 2020

View reviewed changes

Comment thread ...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

cloud-fan reviewed Sep 22, 2020

View reviewed changes

Comment thread ...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

cloud-fan reviewed Sep 22, 2020

View reviewed changes

wangyum added 2 commits September 22, 2020 22:31

Address comment

649d3c2

revert ScalarSubquery

f2dc664

cloud-fan reviewed Sep 25, 2020

View reviewed changes

Comment thread ...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Fix

6a8877d

wangyum changed the title ~~[SPARK-32914][SQL] Avoid calling dataType multiple times for each expression~~ [SPARK-32914][SQL] Avoid constructing dataType multiple times Sep 28, 2020

cloud-fan reviewed Sep 29, 2020

View reviewed changes

cloud-fan approved these changes Sep 29, 2020

View reviewed changes

fix

37d0786

maropu approved these changes Oct 2, 2020

View reviewed changes

HyukjinKwon approved these changes Oct 5, 2020

View reviewed changes

HyukjinKwon closed this in 023eb48 Oct 5, 2020

wangyum deleted the SPARK-32914 branch October 6, 2020 04:10

Conversation

wangyum commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 17, 2020

Uh oh!

SparkQA commented Sep 17, 2020

Uh oh!

Uh oh!

cloud-fan commented Sep 18, 2020

Uh oh!

cloud-fan commented Sep 18, 2020

Uh oh!

SparkQA commented Sep 18, 2020

Uh oh!

Uh oh!

SparkQA commented Sep 22, 2020

Uh oh!

wangyum commented Sep 22, 2020

Uh oh!

cloud-fan Sep 22, 2020

Choose a reason for hiding this comment

Uh oh!

wangyum Sep 22, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cloud-fan Sep 22, 2020

Choose a reason for hiding this comment

Uh oh!

wangyum Sep 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 22, 2020

Uh oh!

SparkQA commented Sep 22, 2020

Uh oh!

SparkQA commented Sep 22, 2020

Uh oh!

Uh oh!

SparkQA commented Sep 28, 2020

Uh oh!

SparkQA commented Sep 28, 2020

Uh oh!

SparkQA commented Sep 28, 2020

Uh oh!

cloud-fan Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 29, 2020

Uh oh!

SparkQA commented Sep 29, 2020

Uh oh!

SparkQA commented Sep 29, 2020

Uh oh!

HyukjinKwon commented Oct 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangyum commented Sep 17, 2020 •

edited

Loading

wangyum Sep 22, 2020 •

edited

Loading