[FEA] Support `XXHash64` hashing #12829

ttnghia · 2023-02-23T00:35:17Z

Many operations/algorithms in Apache Spark use hashing for computation. Most of them (if not all) rely on XXHash64 hasher, such as Bloom filter, approx_count_distinct and more. We want to port these operations to run on the GPU but currently we don't yet have a GPU version of XXHash64 implemented in cudf. We can use other alternative (GPU) hashers to implement these operations instead, however, if we want to match with the output from Spark CPU then we need to have XXHash64 support in cudf.

Sometimes, it is important to match exactly the output of Spark CPU so we can have the output from cudf can be fed into some existing Spark CPU functions without data corruption. In addition, having that function allows us to debug and compare the results with Spark CPU easier in some cases, such as implementing approx_count_distinct. Thus, supporting XXHash64 is necessary.

The text was updated successfully, but these errors were encountered:

bdice · 2023-02-23T00:51:54Z

This has been on my radar before, for different reasons. I believe XXHash32 (note, not 64) might be more efficient than our default MurmurHash3 implementation while providing similar quality guarantees (such as avalanching). According to benchmarks, the CPU performance of XXHash32 is significantly higher than MurmurHash3 (of course, we would expect different performance characteristics on GPU as it would likely be memory-bound).

However, throughput is not the only characteristic we might care about here. If the register usage is lower for XXHash32 than MurmurHash3_32, it might benefit some of cudf's most complex kernels. Mixed joins, for instance, combine hashing and AST execution. Similar ideas have been mentioned before in issues like #10587. I'm not confident that this would make a positive difference (I expect no significant effect, based on our past explorations of mixed joins), but it would be worth checking.

I am definitely interested in exploring this, but I have a lot on my plate right now. If someone is interested in giving this a shot or wants any guidance, let me know. Here's a helpful reference paper: https://richardstartin.github.io/posts/xxhash

bdice · 2023-02-23T00:52:40Z

cuCollections might also benefit from this. cc: @PointKernel

sleeepyjack · 2023-04-05T23:57:32Z

FYI I'm going to pick this up from the cuCollections side once we have SWIPAT approval: NVIDIA/cuCollections#290

ttnghia · 2023-04-06T00:14:36Z

@sleeepyjack Please clarify if you are going to implement both 32 and 64 bit hashing. Here we need the 64 bit version.

sleeepyjack · 2023-04-06T00:46:10Z

@ttnghia I was going to implement both XXH32 and XXH64. I'm reading through the code and some blog posts right now and it doesn't seem to be that hard to do (I know, famous last words). The vectorized version, i.e. XXH3_64 needs a bit more thought on how to map this efficiently onto the CUDA arch, so I would I would look at this afterward.

sleeepyjack · 2023-04-06T00:48:44Z

What is the typical key length libcudf is optimizing for? Knowing this would help generate some isolated benchmarks. For cuco itself, I would like to also have good performance for small keys, e.g., 4-8 byte, as it fits many usecases outside of libcudf.

bdice · 2023-04-06T00:51:21Z

Absent other knowledge, a fixed-width input like int64 would be a great candidate. libcudf does hash strings of arbitrary size, too, but most columns are fixed-width types and we compute hashes of those fixed-width types very frequently.

If you wanted a variable size input (like strings), aim for a distribution between 20-200 bytes per string? I'm making this up but it's probably a decent representation.

ttnghia · 2023-04-06T00:51:50Z

@jlowe probably knows more about that to answer, for our need (Spark).

sleeepyjack · 2023-04-06T00:53:40Z

So, if your input is 64bit in most cases, using a 32bit hash value in the current implementation probably generates a lot of otherwise avoidable collisions I guess?

sleeepyjack · 2023-04-06T01:04:58Z

For 4/8 byte types I typically just use the murmur3 integer finalizer, which consists of only a handful of xor-shifts. It passes the DieHarder test and gets rid of the expensive loop traversing the input. Also, it's a true permutation of the underlying domain.
Here is how it looks for both 4 and 8 byte types: https://github.com/sleeepyjack/warpcore/blob/ba5a6a84a8797464e2ce8dd03b269907ee70e7c2/include/warpcore/hashers.cuh#L116

If that is of interest, I can create a template specialization for that in cuco.

bdice · 2023-04-06T01:17:07Z

So, if your input is 64bit in most cases, using a 32bit hash value in the current implementation probably generates a lot of otherwise avoidable collisions I guess?

Yes, that is theoretically possible, but usually this is not an issue at all due to low cardinality (far less than 2^32 unique elements) or limitations of libcudf itself: we only support up to 2^31 rows in a column in libcudf anyway.

bdice · 2023-04-06T01:19:38Z

For 4/8 byte types I typically just use the murmur3 integer finalizer

That's pretty awesome. 👍 I'd like to know the avalanching properties, too, if you have knowledge/references on that. i.e. does changing 1 input bit tend to produce changes in half of the output bits? That's an important characteristic. (Maybe passing DieHarder covers this? Not sure.)

sleeepyjack · 2023-04-06T01:25:51Z

I'd like to know the avalanching properties, too, if you have knowledge/references on that.

Of course, that's an important property for hash maps. Let me reach out to @gravitino, who did some experiments on this in the past, which led me to choose this hash func as the default hasher for my previous GPU hash map library.

ttnghia · 2023-04-06T02:09:23Z

So, if your input is 64bit in most cases, using a 32bit hash value in the current implementation probably generates a lot of otherwise avoidable collisions I guess?

Yes, that is theoretically possible, but usually this is not an issue at all due to low cardinality (far less than 2^32 unique elements) or limitations of libcudf itself: we only support up to 2^31 rows in a column in libcudf anyway.

Please remember that this is in the context of just cudf. For others like Spark, we can have very big columns which will be partitioned into multiple small chunks to fit within cudf limit before passing them into cudf.

jlowe · 2023-04-06T14:05:45Z

For the Spark use-case, it can hash arbitrary types and for the bloom filter assisted join case, which is one of the use-cases for this hash function, the types being hashed would be any type Spark can use as a join key (e.g.: strings, arrays, structs, decimals, ints, ...).

Regarding key length, I would recommend the following benchmark cases:

32-bit key
64-bit key
"small" (e.g.: 16 bytes or less per row) string column
"medium" (e.g.: around 256 bytes per row) string column
"large" (e.g.: > 8192 bytes per row) string column
array/list column variants of the above types, which is like a segmented hash kernel on the underlying datatype using the list column offsets to compute the counts per segment

vyasr · 2024-05-17T18:34:01Z

@bdice @sleeepyjack @ttnghia NVIDIA/cuCollections#310 was merged a while ago, so I think we could move forward here. Is this issue actionable now? No worries if it's just not high priority, but I wanted to bring attention to it again if it's something that should be prioritize but we just missed that it was unblocked.

davidwendt · 2024-05-18T01:53:55Z

The xxhash64 has been implemented in libcudf

cudf/cpp/include/cudf/hashing.hpp

Line 181 in e6e6761

std::unique_ptr<column> xxhash_64(

But I believe Spark had unique requirements and so I believe @nvdbaranec implemented a Spark-specific version in the Spark RAPIDS repo.

So I think this issue can be closed.

bdice · 2024-05-18T03:36:37Z

@davidwendt is right, I believe this can be closed. Someday I’d still like to try swapping our default hasher from Murmur to XXHash as I alluded to in this comment: #12829 (comment)

That can be done separately from this issue so I’ll go ahead and close this.

ttnghia added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Feb 23, 2023

ttnghia mentioned this issue Feb 23, 2023

[FEA] Accelerate Bloom filtered joins NVIDIA/spark-rapids#7803

Closed

4 tasks

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 2, 2023

sleeepyjack mentioned this issue Apr 5, 2023

[ENHANCEMENT]: Add XXHash NVIDIA/cuCollections#290

Closed

bdice closed this as completed May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support `XXHash64` hashing #12829

[FEA] Support `XXHash64` hashing #12829

ttnghia commented Feb 23, 2023

bdice commented Feb 23, 2023 •

edited

Loading

bdice commented Feb 23, 2023 •

edited

Loading

sleeepyjack commented Apr 5, 2023

ttnghia commented Apr 6, 2023

sleeepyjack commented Apr 6, 2023

sleeepyjack commented Apr 6, 2023 •

edited

Loading

bdice commented Apr 6, 2023 •

edited

Loading

ttnghia commented Apr 6, 2023 •

edited

Loading

sleeepyjack commented Apr 6, 2023 •

edited

Loading

sleeepyjack commented Apr 6, 2023 •

edited

Loading

bdice commented Apr 6, 2023

bdice commented Apr 6, 2023 •

edited

Loading

sleeepyjack commented Apr 6, 2023 •

edited

Loading

ttnghia commented Apr 6, 2023

jlowe commented Apr 6, 2023

vyasr commented May 17, 2024

davidwendt commented May 18, 2024

bdice commented May 18, 2024

[FEA] Support XXHash64 hashing #12829

[FEA] Support XXHash64 hashing #12829

Comments

ttnghia commented Feb 23, 2023

bdice commented Feb 23, 2023 • edited Loading

bdice commented Feb 23, 2023 • edited Loading

sleeepyjack commented Apr 5, 2023

ttnghia commented Apr 6, 2023

sleeepyjack commented Apr 6, 2023

sleeepyjack commented Apr 6, 2023 • edited Loading

bdice commented Apr 6, 2023 • edited Loading

ttnghia commented Apr 6, 2023 • edited Loading

sleeepyjack commented Apr 6, 2023 • edited Loading

sleeepyjack commented Apr 6, 2023 • edited Loading

bdice commented Apr 6, 2023

bdice commented Apr 6, 2023 • edited Loading

sleeepyjack commented Apr 6, 2023 • edited Loading

ttnghia commented Apr 6, 2023

jlowe commented Apr 6, 2023

vyasr commented May 17, 2024

davidwendt commented May 18, 2024

bdice commented May 18, 2024

[FEA] Support `XXHash64` hashing #12829

[FEA] Support `XXHash64` hashing #12829

bdice commented Feb 23, 2023 •

edited

Loading

bdice commented Feb 23, 2023 •

edited

Loading

sleeepyjack commented Apr 6, 2023 •

edited

Loading

bdice commented Apr 6, 2023 •

edited

Loading

ttnghia commented Apr 6, 2023 •

edited

Loading

sleeepyjack commented Apr 6, 2023 •

edited

Loading

sleeepyjack commented Apr 6, 2023 •

edited

Loading

bdice commented Apr 6, 2023 •

edited

Loading

sleeepyjack commented Apr 6, 2023 •

edited

Loading