-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support XXHash64
hashing
#12829
Comments
This has been on my radar before, for different reasons. I believe XXHash32 (note, not 64) might be more efficient than our default MurmurHash3 implementation while providing similar quality guarantees (such as avalanching). According to benchmarks, the CPU performance of XXHash32 is significantly higher than MurmurHash3 (of course, we would expect different performance characteristics on GPU as it would likely be memory-bound). However, throughput is not the only characteristic we might care about here. If the register usage is lower for XXHash32 than MurmurHash3_32, it might benefit some of cudf's most complex kernels. Mixed joins, for instance, combine hashing and AST execution. Similar ideas have been mentioned before in issues like #10587. I'm not confident that this would make a positive difference (I expect no significant effect, based on our past explorations of mixed joins), but it would be worth checking. I am definitely interested in exploring this, but I have a lot on my plate right now. If someone is interested in giving this a shot or wants any guidance, let me know. Here's a helpful reference paper: https://richardstartin.github.io/posts/xxhash |
cuCollections might also benefit from this. cc: @PointKernel |
FYI I'm going to pick this up from the cuCollections side once we have SWIPAT approval: NVIDIA/cuCollections#290 |
@sleeepyjack Please clarify if you are going to implement both 32 and 64 bit hashing. Here we need the 64 bit version. |
@ttnghia I was going to implement both XXH32 and XXH64. I'm reading through the code and some blog posts right now and it doesn't seem to be that hard to do (I know, famous last words). The vectorized version, i.e. XXH3_64 needs a bit more thought on how to map this efficiently onto the CUDA arch, so I would I would look at this afterward. |
What is the typical key length libcudf is optimizing for? Knowing this would help generate some isolated benchmarks. For cuco itself, I would like to also have good performance for small keys, e.g., 4-8 byte, as it fits many usecases outside of libcudf. |
Absent other knowledge, a fixed-width input like If you wanted a variable size input (like strings), aim for a distribution between 20-200 bytes per string? I'm making this up but it's probably a decent representation. |
@jlowe probably knows more about that to answer, for our need (Spark). |
So, if your input is 64bit in most cases, using a 32bit hash value in the current implementation probably generates a lot of otherwise avoidable collisions I guess? |
For 4/8 byte types I typically just use the murmur3 integer finalizer, which consists of only a handful of xor-shifts. It passes the DieHarder test and gets rid of the expensive loop traversing the input. Also, it's a true permutation of the underlying domain. If that is of interest, I can create a template specialization for that in cuco. |
Yes, that is theoretically possible, but usually this is not an issue at all due to low cardinality (far less than 2^32 unique elements) or limitations of libcudf itself: we only support up to 2^31 rows in a column in libcudf anyway. |
That's pretty awesome. 👍 I'd like to know the avalanching properties, too, if you have knowledge/references on that. i.e. does changing 1 input bit tend to produce changes in half of the output bits? That's an important characteristic. (Maybe passing DieHarder covers this? Not sure.) |
Of course, that's an important property for hash maps. Let me reach out to @gravitino, who did some experiments on this in the past, which led me to choose this hash func as the default hasher for my previous GPU hash map library. |
Please remember that this is in the context of just cudf. For others like Spark, we can have very big columns which will be partitioned into multiple small chunks to fit within cudf limit before passing them into cudf. |
For the Spark use-case, it can hash arbitrary types and for the bloom filter assisted join case, which is one of the use-cases for this hash function, the types being hashed would be any type Spark can use as a join key (e.g.: strings, arrays, structs, decimals, ints, ...). Regarding key length, I would recommend the following benchmark cases:
|
@bdice @sleeepyjack @ttnghia NVIDIA/cuCollections#310 was merged a while ago, so I think we could move forward here. Is this issue actionable now? No worries if it's just not high priority, but I wanted to bring attention to it again if it's something that should be prioritize but we just missed that it was unblocked. |
The xxhash64 has been implemented in libcudf cudf/cpp/include/cudf/hashing.hpp Line 181 in e6e6761
But I believe Spark had unique requirements and so I believe @nvdbaranec implemented a Spark-specific version in the Spark RAPIDS repo. So I think this issue can be closed. |
@davidwendt is right, I believe this can be closed. Someday I’d still like to try swapping our default hasher from Murmur to XXHash as I alluded to in this comment: #12829 (comment) That can be done separately from this issue so I’ll go ahead and close this. |
Many operations/algorithms in Apache Spark use hashing for computation. Most of them (if not all) rely on
XXHash64
hasher, such as Bloom filter,approx_count_distinct
and more. We want to port these operations to run on the GPU but currently we don't yet have a GPU version ofXXHash64
implemented in cudf. We can use other alternative (GPU) hashers to implement these operations instead, however, if we want to match with the output from Spark CPU then we need to haveXXHash64
support in cudf.Sometimes, it is important to match exactly the output of Spark CPU so we can have the output from cudf can be fed into some existing Spark CPU functions without data corruption. In addition, having that function allows us to debug and compare the results with Spark CPU easier in some cases, such as implementing
approx_count_distinct
. Thus, supportingXXHash64
is necessary.The text was updated successfully, but these errors were encountered: