Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve thrust::equals #1870

Closed
wants to merge 2 commits into from

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Jun 14, 2024

It's Friday and I am upset with the performance of comparing two thrust vectors. A good deal of unit tests is affected by this. This PR:

  1. Adds a benchmark for thrust::equals Done in Add a benchmark for thrust::equal #1944
  2. Reimplements thrust::equals in terms of thrust::all_of
  3. Reimplements thrust::all_of in terms of thrust::count_if, which according to thrust::all_of is slower than a naive reduction #720 is >10x faster.

Results of the new thrust::equals benchmark from before and after applying 2. and 3.:

## [0] NVIDIA H100 PCIe

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|---------------|---------|----------|
|   I8    |    2^16    | 101.896 us |       1.75% |  72.727 us |       1.98% |    -29.169 us | -28.63% |   FAIL   |
|   I8    |    2^20    | 111.107 us |       1.16% |  76.573 us |       1.76% |    -34.534 us | -31.08% |   FAIL   |
|   I8    |    2^24    | 936.143 us |       0.35% | 142.694 us |       0.63% |   -793.450 us | -84.76% |   FAIL   |
|   I8    |    2^28    |  13.675 ms |       0.10% |   1.119 ms |       0.15% | -12555.718 us | -91.82% |   FAIL   |
|   I16   |    2^16    | 101.747 us |       1.30% |  73.365 us |       1.66% |    -28.381 us | -27.89% |   FAIL   |
|   I16   |    2^20    | 112.437 us |       1.13% |  80.526 us |       4.80% |    -31.910 us | -28.38% |   FAIL   |
|   I16   |    2^24    | 946.016 us |       0.36% | 195.473 us |       0.66% |   -750.543 us | -79.34% |   FAIL   |
|   I16   |    2^28    |  13.895 ms |       0.14% |   1.872 ms |       0.10% | -12022.463 us | -86.52% |   FAIL   |
|   I32   |    2^16    | 102.312 us |       1.40% |  74.359 us |       1.89% |    -27.953 us | -27.32% |   FAIL   |
|   I32   |    2^20    | 116.969 us |       0.99% |  86.583 us |       1.11% |    -30.386 us | -25.98% |   FAIL   |
|   I32   |    2^24    | 990.614 us |       0.34% | 296.035 us |       0.46% |   -694.579 us | -70.12% |   FAIL   |
|   I32   |    2^28    |  14.445 ms |       0.08% |   3.372 ms |       0.08% | -11073.339 us | -76.66% |   FAIL   |
|   I64   |    2^16    | 103.238 us |       1.78% |  75.475 us |       2.38% |    -27.763 us | -26.89% |   FAIL   |
|   I64   |    2^20    | 129.095 us |       1.17% |  98.486 us |       0.83% |    -30.610 us | -23.71% |   FAIL   |
|   I64   |    2^24    |   1.095 ms |       0.28% | 503.365 us |       0.29% |   -591.867 us | -54.04% |   FAIL   |
|   I64   |    2^28    |  16.248 ms |       0.07% |   6.610 ms |       0.06% |  -9637.774 us | -59.32% |   FAIL   |
|  I128   |    2^16    | 106.196 us |       1.40% |  80.382 us |       2.34% |    -25.814 us | -24.31% |   FAIL   |
|  I128   |    2^20    | 162.204 us |       0.73% | 130.275 us |       0.69% |    -31.929 us | -19.68% |   FAIL   |
|  I128   |    2^24    |   1.327 ms |       0.23% | 913.127 us |       0.20% |   -413.895 us | -31.19% |   FAIL   |
|  I128   |    2^28    |  19.532 ms |       0.10% |  13.122 ms |       0.05% |  -6410.285 us | -32.82% |   FAIL   |
|   F32   |    2^16    | 104.041 us |       1.11% |  74.187 us |       1.53% |    -29.854 us | -28.69% |   FAIL   |
|   F32   |    2^20    | 119.163 us |       1.04% |  86.922 us |       0.95% |    -32.241 us | -27.06% |   FAIL   |
|   F32   |    2^24    |   1.002 ms |       0.44% | 307.830 us |       3.02% |   -693.705 us | -69.26% |   FAIL   |
|   F32   |    2^28    |  14.671 ms |       0.12% |   3.373 ms |       0.13% | -11297.900 us | -77.01% |   FAIL   |
|   F64   |    2^16    | 104.693 us |       1.49% |  75.549 us |       1.86% |    -29.144 us | -27.84% |   FAIL   |
|   F64   |    2^20    | 130.312 us |       0.88% |  98.466 us |       1.44% |    -31.846 us | -24.44% |   FAIL   |
|   F64   |    2^24    |   1.105 ms |       0.30% | 503.757 us |       0.28% |   -601.484 us | -54.42% |   FAIL   |
|   F64   |    2^28    |  16.256 ms |       0.08% |   6.608 ms |       0.06% |  -9648.092 us | -59.35% |   FAIL   |

@bernhardmgruber bernhardmgruber added the thrust For all items related to Thrust. label Jun 14, 2024
@bernhardmgruber bernhardmgruber requested review from a team as code owners June 14, 2024 16:58
@bernhardmgruber bernhardmgruber changed the title Improv equals Benchmark and improve thrust::equals Jun 14, 2024
Comment on lines 43 to 52
// TODO(bgruber): we could implement this even better using an early exit
return thrust::count_if(exec, first, last, thrust::detail::not1(pred)) == 0;
}

template <typename ExecutionPolicy, typename InputIterator, typename Predicate>
_CCCL_HOST_DEVICE bool
any_of(thrust::execution_policy<ExecutionPolicy>& exec, InputIterator first, InputIterator last, Predicate pred)
{
return thrust::find_if(exec, first, last, pred) != last;
// TODO(bgruber): we could implement this even better using an early exit
return thrust::count_if(exec, first, last, pred) > 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: using reduction has performance advantages, but also introduces significant performance regressions compared to short circuiting that we have today. If there's a match (or mismatch, depending on the algorithm) early in the sequence, there's no need to read the remaining vector or perform comparison operator. Here are some performance results:

image

Performance data above compares find vs count on 2^28 int32s. Since int32 is easy to load and comparison is fast to perform, we see that count becomes faster if the match (mismatch) is after 4% of the input size. But if we compare different vectors, find will bump into difference immediately and we won't load the reminder of the vector. This makes find ~9x faster in this scenario.

But both speedup and inflection point depend on the price of loading data + price of computing the binary operator. Here is the same comparison for double with a bit more fancy comparison operator (comparing absolute difference with an epsilon):

image

Here we already see 16x slowdown from transitioning to count, and the inflection point shifted much further to 6% of the input.

The performance regressions seem to be severe enough for us not to rush with merging this PR. We wanted to ask @gonidelis to explore a bit different reduction algorithm, where we'd store atomic flag upon satisfying a condition, so that later thread blocks would early exist instead of loading data. The idea is that this approach would get us more moderate performance regressions in mentioned cases while providing comparable speedups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly useful data and shows again that there are likely different strategies to be employed for different data types or sequence lengths. This PR does by no mean solve the performance issues of thrust::equals or even address it somewhat satisfactory and I am super happy that @gonidelis will address this properly in the future! However, in many CUB/Thrust unit tests I have touched so far, the compile and runtime of thrust::vector_base<...>::operator== is a significant bottleneck, and it depends on thrust::equals for comparison. Also specific to the tests, we usually expect both ranges inside a CHECK to actually be equal, so we tend to run into the worst case scenario of find_if. I understand this bias and users will also have different workloads. But I saw a low hanging fruit here to improve test suite compile and runtime, plus exercise how to write a small benchmark, in about an hour, so it made perfect sense for me to create this changeset.

Btw, the compile time of the benchmark in this PR, basic.cu, changed from 22.611s to 16.241s with the new implementation in this PR. I don't have numbers on the improvements of compiling and running our entire test suite, but I can produce the numbers if you are interested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I took these two issues, #720 and #712, as inspiration to go for count_if, since both of them claim that it showed a several-factor improvement. It also made a lot of sense to me to route the implementation through all_of, since all_of does not need to compute any kind of "first occurence" of a predicate match, like find_if, and all_of would not require reading the entire input, like count_if. So if you really want to block this PR, I would like to at least merge the reimplementation of equals based on all_of and not mismatch (which implies an order).

@bernhardmgruber bernhardmgruber marked this pull request as draft June 17, 2024 12:09
@bernhardmgruber bernhardmgruber force-pushed the improv_equals branch 2 times, most recently from 4ff24a4 to ac329d5 Compare June 24, 2024 22:40
@bernhardmgruber bernhardmgruber force-pushed the improv_equals branch 2 times, most recently from abdb32e to 974b0bb Compare July 5, 2024 09:46
@bernhardmgruber bernhardmgruber changed the title Benchmark and improve thrust::equals Improve thrust::equals Jul 17, 2024
@bernhardmgruber
Copy link
Contributor Author

Closing since most of this PR was merged in #1944 and @gonidelis is following up on this work elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thrust For all items related to Thrust.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants