Skip to content

Conversation

@LuciferYang
Copy link
Contributor

What changes were proposed in this pull request?

To slove @JkSelf report problem at SPARK-26155, use LongAdder instead of Long of numKeyLookups and numProbes to reduce add operation times. @JkSelf test this patch in Intel performance testing environment and run TPCDS sqls after this patch with Spark-2.3 and master no longer slower than Spark-2.1.

How was this patch tested?

N/A

@LuciferYang
Copy link
Contributor Author

cc @JkSelf help to check this patch.

@LuciferYang
Copy link
Contributor Author

cc @cloud-fan , help to review this patch?

@LuciferYang
Copy link
Contributor Author

ping @viirya

@adrian-wang
Copy link
Contributor

maybe add some detailed test result in description and explain the reason for this in code comment?

@JkSelf
Copy link
Contributor

JkSelf commented Dec 4, 2018

@LuciferYang the patch is fine in my test environment.
@adrian-wang I will run all the tpcds queries in spark2.3 and spark2.3 with this patch later.

@LuciferYang
Copy link
Contributor Author

@JkSelf thx~

@LuciferYang
Copy link
Contributor Author

@adrian-wang ok~ I will add some comments to explain the reason

@cloud-fan
Copy link
Contributor

ok to test

private var numKeyLookups = 0L
private var numProbes = 0L
private var numKeyLookups = new LongAdder
private var numProbes = new LongAdder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised. I think LongToUnsafeRowMap is used in a single thread environment and multi-thread contend should not be an issue here. Do you have any insights about how this fixes the perf issue?

Copy link
Contributor Author

@LuciferYang LuciferYang Dec 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I thought these two variables in class scope will affect SIMD optimization of JIT(after java8), we try to add -XX: -UseSuperWord to executor java opts to vertify this view, but no affect with spark-2.1, although this patch can improve performance....

@cloud-fan
Copy link
Contributor

I might know the root cause: LongToUnsafeRowMap is acutally accessed by multiple threads.

For broadcast hash join, we will copy the broadcasted hash relation to avoid multi-thread problem, via HashedRelation.asReadOnlyCopy. However, this is a shallow copy, the LongToUnsafeRowMap is not copied and likely shared by multiple HashedRelations.

The metrics is per-task, so I think a better fix is to track the hash probe metrics per HashedRelation, instead of per LongToUnsafeRowMap. It's too costly to copy the LongToUnsafeRowMap, we should think about how to do it efficiently. cc @hvanhovell

@cloud-fan
Copy link
Contributor

It's easy to track numKeyLookups at HashedRelation, but it's hard to track numProbes. One idea is, we pass a MutableInt to LongToUnsafeRowMap.getValue as a parameter, and in the method we set the actual numProbes of this look up to the MutableInt parameter.

@SparkQA
Copy link

SparkQA commented Dec 4, 2018

Test build #99651 has finished for PR 23214 at commit a267e6b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Dec 4, 2018

Thanks for doing this. I think we are more close to the root cause.

@LuciferYang
Copy link
Contributor Author

For broadcast hash join, we will copy the broadcasted hash relation to avoid multi-thread problem, via HashedRelation.asReadOnlyCopy. However, this is a shallow copy, the LongToUnsafeRowMap is not copied and likely shared by multiple HashedRelations.

Was there no problems of data correctness in the past use unthread-safe Long type?

@cloud-fan
Copy link
Contributor

I think there is a problem, but no one found out because it's only about metrics.

@LuciferYang
Copy link
Contributor Author

On the other hand, if is only a multi-thread problem, may not affect performance because there is no synchronized code part ...

@LuciferYang
Copy link
Contributor Author

As @cloud-fan said the hash join metrics is wrongly implemented, we will partial revert # SPARK-21052, no longer need this patch, close it ~

@LuciferYang LuciferYang deleted the spark-26155 branch December 21, 2018 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants