Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize distinct inner join to use set find instead of retrieve #17278

Merged

Conversation

PointKernel
Copy link
Member

Description

This PR introduces a minor optimization for distinct inner joins by using the find results to selectively copy matches to the output. This approach eliminates the need for the costly retrieve operation, which relies on expensive atomic operations.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 8, 2024
@PointKernel PointKernel self-assigned this Nov 8, 2024
@PointKernel
Copy link
Member Author

Performance comparison with RTX8000

# distinct_inner_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  I32  |     0      |    1000     |     1000     |  54.068 us |      10.72% |  56.935 us |       8.24% |     2.867 us |   5.30% |   PASS   |
|  I32  |     0      |   100000    |     1000     |  63.800 us |       1.40% |  62.924 us |       2.12% |    -0.875 us |  -1.37% |   PASS   |
|  I32  |     0      |  10000000   |     1000     |   1.264 ms |       1.15% |   1.021 ms |       0.84% |  -242.908 us | -19.22% |   FAIL   |
|  I32  |     0      |   100000    |    100000    |  72.797 us |       1.68% |  72.148 us |       1.23% |    -0.649 us |  -0.89% |   PASS   |
|  I32  |     0      |  10000000   |    100000    |   1.177 ms |       0.94% | 988.159 us |       1.04% |  -188.749 us | -16.04% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |  13.695 ms |       0.09% |  12.629 ms |       0.09% | -1066.347 us |  -7.79% |   FAIL   |
|  I32  |     1      |    1000     |     1000     |  57.981 us |       1.93% |  60.755 us |       1.78% |     2.774 us |   4.78% |   FAIL   |
|  I32  |     1      |   100000    |     1000     |  67.446 us |       5.73% |  69.453 us |      19.22% |     2.008 us |   2.98% |   PASS   |
|  I32  |     1      |  10000000   |     1000     | 595.085 us |       0.71% | 466.375 us |       0.37% |  -128.710 us | -21.63% |   FAIL   |
|  I32  |     1      |   100000    |    100000    |  70.074 us |       1.94% |  73.968 us |       7.78% |     3.893 us |   5.56% |   FAIL   |
|  I32  |     1      |  10000000   |    100000    | 658.062 us |       1.84% | 571.187 us |       0.63% |   -86.876 us | -13.20% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   3.172 ms |       0.18% |   3.202 ms |       0.13% |    29.974 us |   0.95% |   FAIL   |
|  I64  |     0      |    1000     |     1000     |  47.797 us |       2.97% |  57.524 us |      21.72% |     9.726 us |  20.35% |   FAIL   |
|  I64  |     0      |   100000    |     1000     |  63.696 us |       1.39% |  66.554 us |      17.85% |     2.858 us |   4.49% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     |   1.340 ms |       0.94% |   1.063 ms |       0.92% |  -277.467 us | -20.70% |   FAIL   |
|  I64  |     0      |   100000    |    100000    |  73.252 us |       1.24% |  73.468 us |       1.71% |     0.216 us |   0.29% |   PASS   |
|  I64  |     0      |  10000000   |    100000    |   1.279 ms |       1.10% |   1.019 ms |       1.40% |  -260.429 us | -20.36% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |  13.792 ms |       0.14% |  12.713 ms |       0.19% | -1078.653 us |  -7.82% |   FAIL   |
|  I64  |     1      |    1000     |     1000     |  60.127 us |      16.06% |  64.713 us |      24.12% |     4.587 us |   7.63% |   PASS   |
|  I64  |     1      |   100000    |     1000     |  67.320 us |       1.64% |  66.628 us |       2.15% |    -0.692 us |  -1.03% |   PASS   |
|  I64  |     1      |  10000000   |     1000     | 621.816 us |       0.77% | 496.121 us |       3.41% |  -125.695 us | -20.21% |   FAIL   |
|  I64  |     1      |   100000    |    100000    |  70.121 us |       1.60% |  76.659 us |      13.20% |     6.538 us |   9.32% |   FAIL   |
|  I64  |     1      |  10000000   |    100000    | 675.382 us |       0.65% | 583.454 us |       0.44% |   -91.928 us | -13.61% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   3.234 ms |       0.15% |   3.235 ms |       0.15% |     1.249 us |   0.04% |   PASS   |

@PointKernel PointKernel marked this pull request as ready for review November 8, 2024 17:10
@PointKernel PointKernel requested a review from a team as a code owner November 8, 2024 17:10
@PointKernel PointKernel added the 3 - Ready for Review Ready for review by team label Nov 8, 2024
@vuule
Copy link
Contributor

vuule commented Nov 8, 2024

Thank you for posting the benchmark results!
Is it okay to make some cases significantly slower? I don't have a sense of which cases are most relevant.

@PointKernel
Copy link
Member Author

Is it okay to make some cases significantly slower?

Valid concern. In the worst case, the slowdown may reach up to 20%, but the additional runtime is only a few microseconds. On the other hand, in cases the new implementations perform better, the speedups are well over 10 microseconds in most cases, so I believe the overall optimization is still worthwhile.

I don't have a sense of which cases are most relevant.

Good question. It's not super obvious from the performance results but the new implementation outperforms the previous one in most cases, except when dealing with small data, such as when both the left and right tables contain no more than 10'000 rows of integers.

@PointKernel PointKernel requested a review from vuule November 13, 2024 17:42
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

How does retrieve_all compare to retrieve and find?
is retrieve_all slower than find?

@PointKernel
Copy link
Member Author

How does retrieve_all compare to retrieve and find?
is retrieve_all slower than find?

Good question.

The two methods use different algorithms. retrieve_all examines all slots in the hash table and writes to the output if a slot is not empty, using cub::DeviceSelect::If. In contrast, find checks each element in the query keys to return either a match or a sentinel. Evaluating their performance without a specific use case is challenging.

@PointKernel PointKernel added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 19, 2024
@PointKernel
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 56061bd into rapidsai:branch-24.12 Nov 19, 2024
104 checks passed
@PointKernel PointKernel deleted the improve-distinct-inner-join branch November 19, 2024 18:08
@GregoryKimball
Copy link
Contributor

@abellina have you observed any performance impacts from this change in Spark-RAPIDS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants