Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New hash function alternatives #310

Merged
merged 28 commits into from
May 25, 2023
Merged

Conversation

sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented May 19, 2023

This PR introduces a set of new hash function alternatives, namely

  • xxhash_32 and xxhash_64: 32- and 64-bit versions of the famous xxHash hash function.
  • fmix_32 and fmix_64: 32- and 64-bit versions of the Murmur3 integer/avalanche finalizer.

xxhash_32 is now the default hash function for all cuco data structures, due to its better performance in scenarios where the data structure fits into L1$ or L2$. For larger data structures, the performance advantage over Murmurhash3 is negligible.

Benchmark scenario:
Run cuco::static_set::contains() multiple times in a row to ensure the set is resident in the cache. Target occupancy is set to 50%. We vary the table size to see the effects of different cache levels. Ref Time shows the runtime for murmurhash3_32 and Cmp Time shows the time for xxhash_32. All experiments have been conducted on a NVIDIA A100-SXM4-80GB.

Key Distribution NumInputs Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff
I32 UNIQUE 8000 654.880 us 0.44% 658.101 us 0.40% 3.220 us 0.49%
I32 UNIQUE 80000 850.838 us 0.36% 813.983 us 0.40% -36.854 us -4.33%
I32 UNIQUE 800000 2.372 ms 2.98% 2.319 ms 0.10% -52.546 us -2.22%
I32 UNIQUE 8000000 28.026 ms 0.16% 27.976 ms 0.02% -50.574 us -0.18%
I32 UNIQUE 80000000 319.231 ms 0.03% 318.746 ms 0.01% -484.905 us -0.15%
I64 UNIQUE 8000 680.446 us 0.40% 646.500 us 0.45% -33.946 us -4.99%
I64 UNIQUE 80000 854.186 us 0.28% 827.175 us 0.32% -27.010 us -3.16%
I64 UNIQUE 800000 2.628 ms 0.10% 2.523 ms 0.93% -104.811 us -3.99%
I64 UNIQUE 8000000 32.144 ms 0.03% 31.949 ms 0.01% -195.392 us -0.61%
I64 UNIQUE 80000000 346.017 ms 0.00% 344.892 ms 0.01% -1124.911 us -0.33%

In addition, here are some isolated benchmarks calling each hash function in a tight loop on the device:

Hash Samples CPU Time Noise GPU Time Noise Elem/s
cuco::detail::MurmurHash3_32<int> 332x 1.520 ms 0.68% 1.510 ms 0.26% 662.091G
cuco::detail::MurmurHash3_32<long> 174x 2.891 ms 0.33% 2.882 ms 0.02% 347.000G
cuco::detail::MurmurHash3_32<key_128_bytes> 39x 12.910 ms 0.08% 12.900 ms 0.01% 77.521G
cuco::detail::XXHash_32<int> 405x 1.245 ms 0.78% 1.236 ms 0.04% 809.233G
cuco::detail::XXHash_32<long> 206x 2.443 ms 0.40% 2.434 ms 0.02% 410.887G
cuco::detail::XXHash_32<key_128_bytes> 48x 10.444 ms 0.10% 10.434 ms 0.01% 95.839G
cuco::detail::XXHash_64<int> 193x 2.614 ms 0.37% 2.604 ms 0.02% 384.051G
cuco::detail::XXHash_64<long> 156x 3.231 ms 0.30% 3.221 ms 0.02% 310.452G
cuco::detail::XXHash_64<key_128_bytes> 21x 24.833 ms 0.04% 24.822 ms 0.01% 40.286G
cuco::detail::MurmurHash3_fmix32<int> 464x 1.089 ms 0.88% 1.080 ms 0.05% 926.083G
cuco::detail::MurmurHash3_fmix64<long> 277x 1.819 ms 0.54% 1.809 ms 0.02% 552.668G

Closes #290

@sleeepyjack sleeepyjack added helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue Needs Review Awaiting reviews before merging labels May 19, 2023
benchmarks/defaults.hpp Outdated Show resolved Hide resolved
benchmarks/hash_bench.cu Outdated Show resolved Hide resolved
include/cuco/detail/hash_functions/murmurhash3.cuh Outdated Show resolved Hide resolved
include/cuco/detail/hash_functions/murmurhash3.cuh Outdated Show resolved Hide resolved
include/cuco/detail/hash_functions/xxhash.cuh Outdated Show resolved Hide resolved
include/cuco/dynamic_map.cuh Outdated Show resolved Hide resolved
include/cuco/hash_functions.cuh Show resolved Hide resolved
include/cuco/hash_functions.cuh Outdated Show resolved Hide resolved
include/cuco/hash_functions.cuh Outdated Show resolved Hide resolved
@sleeepyjack
Copy link
Collaborator Author

@bdice would you be so kind to leave a review if you have bandwidth?

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@sleeepyjack sleeepyjack merged commit 8e49022 into NVIDIA:dev May 25, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work here @sleeepyjack! I see this PR was already merged but I wanted to look it over just so I am familiar with the implementation. I have a few comments here that you can apply in a follow-up PR if you would like. 👍

@@ -78,3 +78,6 @@ ConfigureBench(DYNAMIC_MAP_BENCH
hash_table/dynamic_map/find_bench.cu
hash_table/dynamic_map/contains_bench.cu
hash_table/dynamic_map/erase_bench.cu)

ConfigureBench(HASH_BENCH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match existing style so this doesn't look like it belongs under the dynamic map benchmarks:

Suggested change
ConfigureBench(HASH_BENCH
###################################################################################################
# - hash benchmarks -------------------------------------------------------------------------------
ConfigureBench(HASH_BENCH

Comment on lines +29 to +30
using namespace cuco::benchmark;
using namespace cuco::utility;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to avoid using declarations. If you write the full types, that will help readers who are unfamiliar with cuco internals (like me) know what namespaces contain the identifiers being used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we consider benchmarks as user-facing code so we should not make these kinds of shortcuts. Since these namespaces are a bit wordy, how about two-character namespace aliases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the verbosity at all, honestly. 😄 I'd vote cuco::utility over ut or cu or util any day.

Comment on lines +90 to +91
char const* const bytes = (char const*)&key; ///< per-byte access
uint32_t const* const blocks = (uint32_t const*)&key; ///< 4-byte word access
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid C-style casts.

}
}

// the following loop is only needed if the size of the key is no multiple of the block size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// the following loop is only needed if the size of the key is no multiple of the block size
// the following loop is only needed if the size of the key is not a multiple of the block size

template <typename Key>
struct XXHash_32 {
private:
static constexpr uint32_t prime1 = 0x9E3779B1U;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a weak preference for lowercase hex literals / literal suffixes like 0x9e3779b1u because I find them more readable, but that's totally a matter of taste.

Comment on lines +240 to +242
char const* const bytes = (char const*)&key; ///< per-byte access
uint32_t const* const blocks4 = (uint32_t const*)&key; ///< 4-byte word access
uint64_t const* const blocks8 = (uint64_t const*)&key; ///< 8-byte word access
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid C-style casts.

char const* const bytes = (char const*)&key; ///< per-byte access
uint32_t const* const blocks = (uint32_t const*)&key; ///< 4-byte word access

uint32_t offset = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be a std::size_t to match nbytes.

uint32_t const* const blocks4 = (uint32_t const*)&key; ///< 4-byte word access
uint64_t const* const blocks8 = (uint64_t const*)&key; ///< 8-byte word access

uint64_t offset = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be a std::size_t to match nbytes.

CHECK(h10(k10) == 2031761887105658523ULL);
}

// TODO SECTION("Check if device-generated hash values match the reference implementation.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still a TODO?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, good catch! Yes, I'll add them in the next PR.

@sleeepyjack
Copy link
Collaborator Author

@bdice thank you so much! Reading through your comments, they are all valid and I will address them in a follow-up PR.

@bdice
Copy link
Contributor

bdice commented May 25, 2023

Also, I didn't leave a comment about this but wanted to share my work: in cudf I chose to use an intrinsic to rotate left:

https://github.com/rapidsai/cudf/blob/37f76c820ddf833a80b4ce706b9c3e84908e51ff/cpp/include/cudf/detail/utilities/hash_functions.cuh#L62-L66

However, it looks like this compiles out to shf.l.wrap.b32 either way so it doesn't matter which you choose.
https://godbolt.org/z/8b199szzK

There is no 64-bit rotate left intrinsic (afaik) so I would not recommend a change here (so that 32-bit and 64-bit rotation functions look the same).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helps: rapids Helps or needed by RAPIDS Needs Review Awaiting reviews before merging topic: performance Performance related issue
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[ENHANCEMENT]: Add XXHash
3 participants