Optimize distinct inner join to use set `find` instead of `retrieve` #17278

PointKernel · 2024-11-08T00:02:04Z

Description

This PR introduces a minor optimization for distinct inner joins by using the find results to selectively copy matches to the output. This approach eliminates the need for the costly retrieve operation, which relies on expensive atomic operations.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

PointKernel · 2024-11-08T00:04:47Z

Performance comparison with RTX8000

# distinct_inner_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  I32  |     0      |    1000     |     1000     |  54.068 us |      10.72% |  56.935 us |       8.24% |     2.867 us |   5.30% |   PASS   |
|  I32  |     0      |   100000    |     1000     |  63.800 us |       1.40% |  62.924 us |       2.12% |    -0.875 us |  -1.37% |   PASS   |
|  I32  |     0      |  10000000   |     1000     |   1.264 ms |       1.15% |   1.021 ms |       0.84% |  -242.908 us | -19.22% |   FAIL   |
|  I32  |     0      |   100000    |    100000    |  72.797 us |       1.68% |  72.148 us |       1.23% |    -0.649 us |  -0.89% |   PASS   |
|  I32  |     0      |  10000000   |    100000    |   1.177 ms |       0.94% | 988.159 us |       1.04% |  -188.749 us | -16.04% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |  13.695 ms |       0.09% |  12.629 ms |       0.09% | -1066.347 us |  -7.79% |   FAIL   |
|  I32  |     1      |    1000     |     1000     |  57.981 us |       1.93% |  60.755 us |       1.78% |     2.774 us |   4.78% |   FAIL   |
|  I32  |     1      |   100000    |     1000     |  67.446 us |       5.73% |  69.453 us |      19.22% |     2.008 us |   2.98% |   PASS   |
|  I32  |     1      |  10000000   |     1000     | 595.085 us |       0.71% | 466.375 us |       0.37% |  -128.710 us | -21.63% |   FAIL   |
|  I32  |     1      |   100000    |    100000    |  70.074 us |       1.94% |  73.968 us |       7.78% |     3.893 us |   5.56% |   FAIL   |
|  I32  |     1      |  10000000   |    100000    | 658.062 us |       1.84% | 571.187 us |       0.63% |   -86.876 us | -13.20% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   3.172 ms |       0.18% |   3.202 ms |       0.13% |    29.974 us |   0.95% |   FAIL   |
|  I64  |     0      |    1000     |     1000     |  47.797 us |       2.97% |  57.524 us |      21.72% |     9.726 us |  20.35% |   FAIL   |
|  I64  |     0      |   100000    |     1000     |  63.696 us |       1.39% |  66.554 us |      17.85% |     2.858 us |   4.49% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     |   1.340 ms |       0.94% |   1.063 ms |       0.92% |  -277.467 us | -20.70% |   FAIL   |
|  I64  |     0      |   100000    |    100000    |  73.252 us |       1.24% |  73.468 us |       1.71% |     0.216 us |   0.29% |   PASS   |
|  I64  |     0      |  10000000   |    100000    |   1.279 ms |       1.10% |   1.019 ms |       1.40% |  -260.429 us | -20.36% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |  13.792 ms |       0.14% |  12.713 ms |       0.19% | -1078.653 us |  -7.82% |   FAIL   |
|  I64  |     1      |    1000     |     1000     |  60.127 us |      16.06% |  64.713 us |      24.12% |     4.587 us |   7.63% |   PASS   |
|  I64  |     1      |   100000    |     1000     |  67.320 us |       1.64% |  66.628 us |       2.15% |    -0.692 us |  -1.03% |   PASS   |
|  I64  |     1      |  10000000   |     1000     | 621.816 us |       0.77% | 496.121 us |       3.41% |  -125.695 us | -20.21% |   FAIL   |
|  I64  |     1      |   100000    |    100000    |  70.121 us |       1.60% |  76.659 us |      13.20% |     6.538 us |   9.32% |   FAIL   |
|  I64  |     1      |  10000000   |    100000    | 675.382 us |       0.65% | 583.454 us |       0.44% |   -91.928 us | -13.61% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   3.234 ms |       0.15% |   3.235 ms |       0.15% |     1.249 us |   0.04% |   PASS   |

…tinct-inner-join

vuule · 2024-11-08T22:33:40Z

Thank you for posting the benchmark results!
Is it okay to make some cases significantly slower? I don't have a sense of which cases are most relevant.

PointKernel · 2024-11-08T22:53:12Z

Is it okay to make some cases significantly slower?

Valid concern. In the worst case, the slowdown may reach up to 20%, but the additional runtime is only a few microseconds. On the other hand, in cases the new implementations perform better, the speedups are well over 10 microseconds in most cases, so I believe the overall optimization is still worthwhile.

I don't have a sense of which cases are most relevant.

Good question. It's not super obvious from the performance results but the new implementation outperforms the previous one in most cases, except when dealing with small data, such as when both the left and right tables contain no more than 10'000 rows of integers.

cpp/src/join/distinct_hash_join.cu

…tinct-inner-join

vuule

💯

karthikeyann

LGTM 👍

How does retrieve_all compare to retrieve and find?
is retrieve_all slower than find?

PointKernel · 2024-11-19T00:16:04Z

How does retrieve_all compare to retrieve and find?
is retrieve_all slower than find?

Good question.

The two methods use different algorithms. retrieve_all examines all slots in the hash table and writes to the output if a slot is not empty, using cub::DeviceSelect::If. In contrast, find checks each element in the query keys to return either a match or a sentinel. Evaluating their performance without a specific use case is challenging.

PointKernel · 2024-11-19T18:08:28Z

/merge

GregoryKimball · 2024-12-05T23:13:28Z

@abellina have you observed any performance impacts from this change in Spark-RAPIDS?

PointKernel added 3 commits November 7, 2024 15:41

Optimize distinct inner join to use find instead of retrieve

3c020f5

Minor cleanups

0a23180

Add comment

c910df7

PointKernel added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 8, 2024

PointKernel self-assigned this Nov 8, 2024

PointKernel mentioned this pull request Nov 8, 2024

Migrate set retrieve to use the OA implementation NVIDIA/cuCollections#637

Merged

PointKernel added 2 commits November 7, 2024 17:41

Use proclaim_return_type for CUDA 11 compatibility

d62b33f

Merge remote-tracking branch 'upstream/branch-24.12' into improve-dis…

49bcd34

…tinct-inner-join

PointKernel marked this pull request as ready for review November 8, 2024 17:10

PointKernel requested a review from a team as a code owner November 8, 2024 17:10

PointKernel requested review from karthikeyann and vuule November 8, 2024 17:10

PointKernel added the 3 - Ready for Review Ready for review by team label Nov 8, 2024

vuule reviewed Nov 12, 2024

View reviewed changes

cpp/src/join/distinct_hash_join.cu Outdated Show resolved Hide resolved

cpp/src/join/distinct_hash_join.cu Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/branch-24.12' into improve-dis…

a4e1f48

…tinct-inner-join

PointKernel requested a review from vuule November 13, 2024 17:42

Clean up leftovers

5df0403

vuule approved these changes Nov 14, 2024

View reviewed changes

PointKernel added 2 commits November 15, 2024 10:25

Merge branch 'branch-24.12' into improve-distinct-inner-join

a71c9e6

Merge branch 'branch-24.12' into improve-distinct-inner-join

b8e5914

karthikeyann approved these changes Nov 18, 2024

View reviewed changes

PointKernel added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 19, 2024

Merge branch 'branch-24.12' into improve-distinct-inner-join

9e4800d

rapids-bot bot merged commit 56061bd into rapidsai:branch-24.12 Nov 19, 2024
104 checks passed

PointKernel deleted the improve-distinct-inner-join branch November 19, 2024 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize distinct inner join to use set `find` instead of `retrieve` #17278

Optimize distinct inner join to use set `find` instead of `retrieve` #17278

PointKernel commented Nov 8, 2024

PointKernel commented Nov 8, 2024

vuule commented Nov 8, 2024

PointKernel commented Nov 8, 2024

vuule left a comment

karthikeyann left a comment

PointKernel commented Nov 19, 2024

PointKernel commented Nov 19, 2024

GregoryKimball commented Dec 5, 2024

Optimize distinct inner join to use set find instead of retrieve #17278

Optimize distinct inner join to use set find instead of retrieve #17278

Conversation

PointKernel commented Nov 8, 2024

Description

Checklist

PointKernel commented Nov 8, 2024

vuule commented Nov 8, 2024

PointKernel commented Nov 8, 2024

vuule left a comment

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

PointKernel commented Nov 19, 2024

PointKernel commented Nov 19, 2024

GregoryKimball commented Dec 5, 2024

Optimize distinct inner join to use set `find` instead of `retrieve` #17278

Optimize distinct inner join to use set `find` instead of `retrieve` #17278