Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support XXHash64 hashing #12829

Closed
ttnghia opened this issue Feb 23, 2023 · 18 comments
Closed

[FEA] Support XXHash64 hashing #12829

ttnghia opened this issue Feb 23, 2023 · 18 comments
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Feb 23, 2023

Many operations/algorithms in Apache Spark use hashing for computation. Most of them (if not all) rely on XXHash64 hasher, such as Bloom filter, approx_count_distinct and more. We want to port these operations to run on the GPU but currently we don't yet have a GPU version of XXHash64 implemented in cudf. We can use other alternative (GPU) hashers to implement these operations instead, however, if we want to match with the output from Spark CPU then we need to have XXHash64 support in cudf.

Sometimes, it is important to match exactly the output of Spark CPU so we can have the output from cudf can be fed into some existing Spark CPU functions without data corruption. In addition, having that function allows us to debug and compare the results with Spark CPU easier in some cases, such as implementing approx_count_distinct. Thus, supporting XXHash64 is necessary.

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Feb 23, 2023
@bdice
Copy link
Contributor

bdice commented Feb 23, 2023

This has been on my radar before, for different reasons. I believe XXHash32 (note, not 64) might be more efficient than our default MurmurHash3 implementation while providing similar quality guarantees (such as avalanching). According to benchmarks, the CPU performance of XXHash32 is significantly higher than MurmurHash3 (of course, we would expect different performance characteristics on GPU as it would likely be memory-bound).

However, throughput is not the only characteristic we might care about here. If the register usage is lower for XXHash32 than MurmurHash3_32, it might benefit some of cudf's most complex kernels. Mixed joins, for instance, combine hashing and AST execution. Similar ideas have been mentioned before in issues like #10587. I'm not confident that this would make a positive difference (I expect no significant effect, based on our past explorations of mixed joins), but it would be worth checking.

I am definitely interested in exploring this, but I have a lot on my plate right now. If someone is interested in giving this a shot or wants any guidance, let me know. Here's a helpful reference paper: https://richardstartin.github.io/posts/xxhash

@bdice
Copy link
Contributor

bdice commented Feb 23, 2023

cuCollections might also benefit from this. cc: @PointKernel

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 2, 2023
@sleeepyjack
Copy link

FYI I'm going to pick this up from the cuCollections side once we have SWIPAT approval: NVIDIA/cuCollections#290

@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 6, 2023

@sleeepyjack Please clarify if you are going to implement both 32 and 64 bit hashing. Here we need the 64 bit version.

@sleeepyjack
Copy link

@ttnghia I was going to implement both XXH32 and XXH64. I'm reading through the code and some blog posts right now and it doesn't seem to be that hard to do (I know, famous last words). The vectorized version, i.e. XXH3_64 needs a bit more thought on how to map this efficiently onto the CUDA arch, so I would I would look at this afterward.

@sleeepyjack
Copy link

sleeepyjack commented Apr 6, 2023

What is the typical key length libcudf is optimizing for? Knowing this would help generate some isolated benchmarks. For cuco itself, I would like to also have good performance for small keys, e.g., 4-8 byte, as it fits many usecases outside of libcudf.

@bdice
Copy link
Contributor

bdice commented Apr 6, 2023

Absent other knowledge, a fixed-width input like int64 would be a great candidate. libcudf does hash strings of arbitrary size, too, but most columns are fixed-width types and we compute hashes of those fixed-width types very frequently.

If you wanted a variable size input (like strings), aim for a distribution between 20-200 bytes per string? I'm making this up but it's probably a decent representation.

@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 6, 2023

@jlowe probably knows more about that to answer, for our need (Spark).

@sleeepyjack
Copy link

sleeepyjack commented Apr 6, 2023

So, if your input is 64bit in most cases, using a 32bit hash value in the current implementation probably generates a lot of otherwise avoidable collisions I guess?

@sleeepyjack
Copy link

sleeepyjack commented Apr 6, 2023

For 4/8 byte types I typically just use the murmur3 integer finalizer, which consists of only a handful of xor-shifts. It passes the DieHarder test and gets rid of the expensive loop traversing the input. Also, it's a true permutation of the underlying domain.
Here is how it looks for both 4 and 8 byte types: https://github.com/sleeepyjack/warpcore/blob/ba5a6a84a8797464e2ce8dd03b269907ee70e7c2/include/warpcore/hashers.cuh#L116

If that is of interest, I can create a template specialization for that in cuco.

@bdice
Copy link
Contributor

bdice commented Apr 6, 2023

So, if your input is 64bit in most cases, using a 32bit hash value in the current implementation probably generates a lot of otherwise avoidable collisions I guess?

Yes, that is theoretically possible, but usually this is not an issue at all due to low cardinality (far less than 2^32 unique elements) or limitations of libcudf itself: we only support up to 2^31 rows in a column in libcudf anyway.

@bdice
Copy link
Contributor

bdice commented Apr 6, 2023

For 4/8 byte types I typically just use the murmur3 integer finalizer

That's pretty awesome. 👍 I'd like to know the avalanching properties, too, if you have knowledge/references on that. i.e. does changing 1 input bit tend to produce changes in half of the output bits? That's an important characteristic. (Maybe passing DieHarder covers this? Not sure.)

@sleeepyjack
Copy link

sleeepyjack commented Apr 6, 2023

I'd like to know the avalanching properties, too, if you have knowledge/references on that.

Of course, that's an important property for hash maps. Let me reach out to @gravitino, who did some experiments on this in the past, which led me to choose this hash func as the default hasher for my previous GPU hash map library.

@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 6, 2023

So, if your input is 64bit in most cases, using a 32bit hash value in the current implementation probably generates a lot of otherwise avoidable collisions I guess?

Yes, that is theoretically possible, but usually this is not an issue at all due to low cardinality (far less than 2^32 unique elements) or limitations of libcudf itself: we only support up to 2^31 rows in a column in libcudf anyway.

Please remember that this is in the context of just cudf. For others like Spark, we can have very big columns which will be partitioned into multiple small chunks to fit within cudf limit before passing them into cudf.

@jlowe
Copy link
Member

jlowe commented Apr 6, 2023

For the Spark use-case, it can hash arbitrary types and for the bloom filter assisted join case, which is one of the use-cases for this hash function, the types being hashed would be any type Spark can use as a join key (e.g.: strings, arrays, structs, decimals, ints, ...).

Regarding key length, I would recommend the following benchmark cases:

  • 32-bit key
  • 64-bit key
  • "small" (e.g.: 16 bytes or less per row) string column
  • "medium" (e.g.: around 256 bytes per row) string column
  • "large" (e.g.: > 8192 bytes per row) string column
  • array/list column variants of the above types, which is like a segmented hash kernel on the underlying datatype using the list column offsets to compute the counts per segment

@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

@bdice @sleeepyjack @ttnghia NVIDIA/cuCollections#310 was merged a while ago, so I think we could move forward here. Is this issue actionable now? No worries if it's just not high priority, but I wanted to bring attention to it again if it's something that should be prioritize but we just missed that it was unblocked.

@davidwendt
Copy link
Contributor

The xxhash64 has been implemented in libcudf

std::unique_ptr<column> xxhash_64(

But I believe Spark had unique requirements and so I believe @nvdbaranec implemented a Spark-specific version in the Spark RAPIDS repo.

So I think this issue can be closed.

@bdice
Copy link
Contributor

bdice commented May 18, 2024

@davidwendt is right, I believe this can be closed. Someday I’d still like to try swapping our default hasher from Murmur to XXHash as I alluded to in this comment: #12829 (comment)

That can be done separately from this issue so I’ll go ahead and close this.

@bdice bdice closed this as completed May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

7 participants