[SPARK-26155] Optimizing the performance of LongToUnsafeRowMap #23214

LuciferYang · 2018-12-04T06:11:51Z

What changes were proposed in this pull request?

To slove @JkSelf report problem at SPARK-26155, use LongAdder instead of Long of numKeyLookups and numProbes to reduce add operation times. @JkSelf test this patch in Intel performance testing environment and run TPCDS sqls after this patch with Spark-2.3 and master no longer slower than Spark-2.1.

How was this patch tested?

N/A

LuciferYang · 2018-12-04T06:16:03Z

cc @JkSelf help to check this patch.

LuciferYang · 2018-12-04T06:18:54Z

cc @cloud-fan , help to review this patch?

LuciferYang · 2018-12-04T06:27:19Z

ping @viirya

adrian-wang · 2018-12-04T06:29:36Z

maybe add some detailed test result in description and explain the reason for this in code comment?

JkSelf · 2018-12-04T06:44:50Z

@LuciferYang the patch is fine in my test environment.
@adrian-wang I will run all the tpcds queries in spark2.3 and spark2.3 with this patch later.

LuciferYang · 2018-12-04T06:45:28Z

@JkSelf thx~

LuciferYang · 2018-12-04T06:45:56Z

@adrian-wang ok~ I will add some comments to explain the reason

cloud-fan · 2018-12-04T07:13:07Z

ok to test

cloud-fan · 2018-12-04T07:20:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

-  private var numKeyLookups = 0L
-  private var numProbes = 0L
+  private var numKeyLookups = new LongAdder
+  private var numProbes = new LongAdder


I'm surprised. I think LongToUnsafeRowMap is used in a single thread environment and multi-thread contend should not be an issue here. Do you have any insights about how this fixes the perf issue?

Initially, I thought these two variables in class scope will affect SIMD optimization of JIT(after java8), we try to add -XX: -UseSuperWord to executor java opts to vertify this view, but no affect with spark-2.1, although this patch can improve performance....

cloud-fan · 2018-12-04T07:30:11Z

I might know the root cause: LongToUnsafeRowMap is acutally accessed by multiple threads.

For broadcast hash join, we will copy the broadcasted hash relation to avoid multi-thread problem, via HashedRelation.asReadOnlyCopy. However, this is a shallow copy, the LongToUnsafeRowMap is not copied and likely shared by multiple HashedRelations.

The metrics is per-task, so I think a better fix is to track the hash probe metrics per HashedRelation, instead of per LongToUnsafeRowMap. It's too costly to copy the LongToUnsafeRowMap, we should think about how to do it efficiently. cc @hvanhovell

cloud-fan · 2018-12-04T07:35:25Z

It's easy to track numKeyLookups at HashedRelation, but it's hard to track numProbes. One idea is, we pass a MutableInt to LongToUnsafeRowMap.getValue as a parameter, and in the method we set the actual numProbes of this look up to the MutableInt parameter.

SparkQA · 2018-12-04T08:05:01Z

Test build #99651 has finished for PR 23214 at commit a267e6b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-12-04T08:05:43Z

Thanks for doing this. I think we are more close to the root cause.

LuciferYang · 2018-12-04T08:08:31Z

For broadcast hash join, we will copy the broadcasted hash relation to avoid multi-thread problem, via HashedRelation.asReadOnlyCopy. However, this is a shallow copy, the LongToUnsafeRowMap is not copied and likely shared by multiple HashedRelations.

Was there no problems of data correctness in the past use unthread-safe Long type?

cloud-fan · 2018-12-04T08:22:00Z

I think there is a problem, but no one found out because it's only about metrics.

LuciferYang · 2018-12-04T08:27:28Z

On the other hand, if is only a multi-thread problem, may not affect performance because there is no synchronized code part ...

LuciferYang · 2018-12-10T06:36:43Z

As @cloud-fan said the hash join metrics is wrongly implemented, we will partial revert # SPARK-21052, no longer need this patch, close it ~

use LongAdder instead of Long

34decea

LuciferYang mentioned this pull request Dec 4, 2018

Revert "[SPARK-21052][SQL] Add hash map metrics to join" #23204

Closed

write part

a267e6b

cloud-fan reviewed Dec 4, 2018

View reviewed changes

LuciferYang closed this Dec 10, 2018

LuciferYang deleted the spark-26155 branch December 21, 2018 03:51

[SPARK-26155] Optimizing the performance of LongToUnsafeRowMap #23214

[SPARK-26155] Optimizing the performance of LongToUnsafeRowMap #23214

Uh oh!

Conversation

LuciferYang commented Dec 4, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

LuciferYang commented Dec 4, 2018

Uh oh!

LuciferYang commented Dec 4, 2018

Uh oh!

LuciferYang commented Dec 4, 2018

Uh oh!

adrian-wang commented Dec 4, 2018

Uh oh!

JkSelf commented Dec 4, 2018

Uh oh!

LuciferYang commented Dec 4, 2018

Uh oh!

LuciferYang commented Dec 4, 2018

Uh oh!

cloud-fan commented Dec 4, 2018

Uh oh!

cloud-fan Dec 4, 2018

Choose a reason for hiding this comment

Uh oh!

LuciferYang Dec 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 4, 2018

Uh oh!

cloud-fan commented Dec 4, 2018

Uh oh!

SparkQA commented Dec 4, 2018

Uh oh!

viirya commented Dec 4, 2018

Uh oh!

LuciferYang commented Dec 4, 2018

Uh oh!

cloud-fan commented Dec 4, 2018

Uh oh!

LuciferYang commented Dec 4, 2018

Uh oh!

LuciferYang commented Dec 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LuciferYang Dec 4, 2018 •

edited

Loading