Skip to content

Conversation

@peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Aug 28, 2020

What changes were proposed in this pull request?

LeftSemi and Existence SortMergeJoin should not buffer all matching right side rows when bound condition is empty, this is unnecessary and can lead to performance degradation especially when spilling happens.

Why are the changes needed?

Performance improvement.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT and TPCDS benchmarks.

@SparkQA
Copy link

SparkQA commented Aug 29, 2020

Test build #128000 has finished for PR 29572 at commit 1a49356.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait AppendOnlyUnsafeRowArray

@peter-toth peter-toth changed the title [WIP][SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering [SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering Aug 29, 2020

// LEFT SEMI JOIN without bound condition does not use [[ExternalAppendOnlyUnsafeRowArray]]
// so should not cause any spill
assertNotSpilled(sparkContext, "left semi join") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this fix this UT fails.

@SparkQA
Copy link

SparkQA commented Aug 29, 2020

Test build #128014 has finished for PR 29572 at commit acc6646.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@peter-toth
Copy link
Contributor Author

cc @cloud-fan, @maropu, @viirya

}

override def add(row: UnsafeRow): Unit = {
assert(buffer == null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that ExternalAppendOnlyUnsafeRowArray will do spill even if we only add one row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, its threshold parameters do work as expected. Just ExternalAppendOnlyUnsafeRowArray looked a bit heavy weight for this case when we want to store only one row. But we can also use new ExternalAppendOnlyUnsafeRowArray(1, 1) for this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you test it if ExternalAppendOnlyUnsafeRowArray doesn't spill either?

pageSizeBytes: Long,
numRowsInMemoryBufferThreshold: Int,
numRowsSpillThreshold: Int) extends Logging {
numRowsSpillThreshold: Int) extends AppendOnlyUnsafeRowArray with Logging {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For core component like this, I remember we rarely change its inheritance. It is easily to have performance regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, in that case let's drop that new trait and stick to the important part.

@peter-toth peter-toth force-pushed the SPARK-32730-improve-leftsemi-sortmergejoin branch from 689b5b7 to d893580 Compare September 2, 2020 19:15
@peter-toth peter-toth force-pushed the SPARK-32730-improve-leftsemi-sortmergejoin branch from d893580 to 037b876 Compare September 2, 2020 19:17
@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128212 has finished for PR 29572 at commit 037b876.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private[this] val bufferedMatches =
new ExternalAppendOnlyUnsafeRowArray(inMemoryThreshold, spillThreshold)
private[this] val bufferedMatches: ExternalAppendOnlyUnsafeRowArray =
new ExternalAppendOnlyUnsafeRowArray(if (bufferFirstOnly) 1 else inMemoryThreshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this change avoid spilling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not this change does it, but the one in bufferMatchingRows(). This change just avoids creating a buffer larger than 1.

spillThreshold: Int,
eagerCleanupResources: () => Unit) {
eagerCleanupResources: () => Unit,
bufferFirstOnly: Boolean) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: bufferFirstOnly -> matchedBufferFirstOnly? And, please add @param, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bufferFirstOnly: Boolean = false to avoid the unnecessary changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

private[this] val bufferedMatches =
new ExternalAppendOnlyUnsafeRowArray(inMemoryThreshold, spillThreshold)
private[this] val bufferedMatches: ExternalAppendOnlyUnsafeRowArray =
new ExternalAppendOnlyUnsafeRowArray(if (bufferFirstOnly) 1 else inMemoryThreshold,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spillThreshold: Int,
eagerCleanupResources: () => Unit) {
eagerCleanupResources: () => Unit,
matchedBufferFirstOnly: Boolean = false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about onlyBufferFirstMatch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, renamed.

@maropu
Copy link
Member

maropu commented Sep 3, 2020

Ur one more; please update the title/description (this PR is not only for leftsemi, right?), too.

inMemoryThreshold,
spillThreshold,
cleanupResources
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please avoid the unnecessary changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, reverted them.

@peter-toth peter-toth changed the title [SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering [SPARK-32730][SQL] Improve LeftSemi and Existence SortMergeJoin right side buffering Sep 3, 2020
@peter-toth
Copy link
Contributor Author

Ur one more; please update the title/description (this PR is not only for leftsemi, right?), too.

Updated, thanks.

@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128238 has finished for PR 29572 at commit 3937e4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128243 has finished for PR 29572 at commit f699118.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128242 has finished for PR 29572 at commit 5cf3ab3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in ffd5227 Sep 3, 2020
@peter-toth
Copy link
Contributor Author

Thanks for the review @cloud-fan, @maropu, @viirya.

@gatorsmile
Copy link
Member

@peter-toth Nice fix! Could you share the perf difference when you run the TPC-DS ?

spillThreshold,
cleanupResources
cleanupResources,
condition.isEmpty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @peter-toth !
I think this could be also added to LeftAnti join, which is also only interested in the existence of a match and doesn't need to buffer them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @juliuszsompolski, I think you are right. Shall I open a follow-up PR or a different ticket?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a followup PR is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #29727

@peter-toth
Copy link
Contributor Author

@peter-toth Nice fix! Could you share the perf difference when you run the TPC-DS ?

Thanks @gatorsmile. Yes, I will try to run some benchmarks for this particular change and share the results.

BTW, I have another PR open that brings ~30% improvement to some of the TPCDS queries: #28885

cloud-fan pushed a commit that referenced this pull request Sep 11, 2020
…de buffering

### What changes were proposed in this pull request?

This is a follow-up to #29572.

LeftAnti SortMergeJoin should not buffer all matching right side rows when bound condition is empty, this is unnecessary and can lead to performance degradation especially when spilling happens.

### Why are the changes needed?

Performance improvement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New UT.

Closes #29727 from peter-toth/SPARK-32730-improve-leftsemi-sortmergejoin-followup.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Sep 11, 2020
…de buffering

### What changes were proposed in this pull request?

This is a follow-up to apache/spark#29572.

LeftAnti SortMergeJoin should not buffer all matching right side rows when bound condition is empty, this is unnecessary and can lead to performance degradation especially when spilling happens.

### Why are the changes needed?

Performance improvement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New UT.

Closes #29727 from peter-toth/SPARK-32730-improve-leftsemi-sortmergejoin-followup.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants