-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve thrust::equals #1870
Improve thrust::equals #1870
Conversation
// TODO(bgruber): we could implement this even better using an early exit | ||
return thrust::count_if(exec, first, last, thrust::detail::not1(pred)) == 0; | ||
} | ||
|
||
template <typename ExecutionPolicy, typename InputIterator, typename Predicate> | ||
_CCCL_HOST_DEVICE bool | ||
any_of(thrust::execution_policy<ExecutionPolicy>& exec, InputIterator first, InputIterator last, Predicate pred) | ||
{ | ||
return thrust::find_if(exec, first, last, pred) != last; | ||
// TODO(bgruber): we could implement this even better using an early exit | ||
return thrust::count_if(exec, first, last, pred) > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: using reduction has performance advantages, but also introduces significant performance regressions compared to short circuiting that we have today. If there's a match (or mismatch, depending on the algorithm) early in the sequence, there's no need to read the remaining vector or perform comparison operator. Here are some performance results:

Performance data above compares find vs count on 2^28 int32s. Since int32 is easy to load and comparison is fast to perform, we see that count becomes faster if the match (mismatch) is after 4% of the input size. But if we compare different vectors, find will bump into difference immediately and we won't load the reminder of the vector. This makes find ~9x faster in this scenario.
But both speedup and inflection point depend on the price of loading data + price of computing the binary operator. Here is the same comparison for double with a bit more fancy comparison operator (comparing absolute difference with an epsilon):

Here we already see 16x slowdown from transitioning to count, and the inflection point shifted much further to 6% of the input.
The performance regressions seem to be severe enough for us not to rush with merging this PR. We wanted to ask @gonidelis to explore a bit different reduction algorithm, where we'd store atomic flag upon satisfying a condition, so that later thread blocks would early exist instead of loading data. The idea is that this approach would get us more moderate performance regressions in mentioned cases while providing comparable speedups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is certainly useful data and shows again that there are likely different strategies to be employed for different data types or sequence lengths. This PR does by no mean solve the performance issues of thrust::equals
or even address it somewhat satisfactory and I am super happy that @gonidelis will address this properly in the future! However, in many CUB/Thrust unit tests I have touched so far, the compile and runtime of thrust::vector_base<...>::operator==
is a significant bottleneck, and it depends on thrust::equals
for comparison. Also specific to the tests, we usually expect both ranges inside a CHECK
to actually be equal, so we tend to run into the worst case scenario of find_if
. I understand this bias and users will also have different workloads. But I saw a low hanging fruit here to improve test suite compile and runtime, plus exercise how to write a small benchmark, in about an hour, so it made perfect sense for me to create this changeset.
Btw, the compile time of the benchmark in this PR, basic.cu
, changed from 22.611s
to 16.241s
with the new implementation in this PR. I don't have numbers on the improvements of compiling and running our entire test suite, but I can produce the numbers if you are interested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I took these two issues, #720 and #712, as inspiration to go for count_if
, since both of them claim that it showed a several-factor improvement. It also made a lot of sense to me to route the implementation through all_of
, since all_of
does not need to compute any kind of "first occurence" of a predicate match, like find_if
, and all_of
would not require reading the entire input, like count_if
. So if you really want to block this PR, I would like to at least merge the reimplementation of equals
based on all_of
and not mismatch
(which implies an order).
4ff24a4
to
ac329d5
Compare
abdb32e
to
974b0bb
Compare
974b0bb
to
1daea9e
Compare
Closing since most of this PR was merged in #1944 and @gonidelis is following up on this work elsewhere. |
It's Friday and I am upset with the performance of comparing two thrust vectors. A good deal of unit tests is affected by this. This PR:
Adds a benchmark forDone in Add a benchmark for thrust::equal #1944thrust::equals
thrust::equals
in terms ofthrust::all_of
thrust::all_of
in terms ofthrust::count_if
, which according tothrust::all_of
is slower than a naive reduction #720 is >10x faster.Results of the new
thrust::equals
benchmark from before and after applying 2. and 3.: