Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Allow more than 2^32 entries #16

Closed
bigerl opened this issue Aug 1, 2022 · 6 comments · Fixed by #17
Closed

RFC: Allow more than 2^32 entries #16

bigerl opened this issue Aug 1, 2022 · 6 comments · Fixed by #17
Assignees
Labels
enhancement New feature or request

Comments

@bigerl
Copy link

bigerl commented Aug 1, 2022

Thank you for sharing this hash-map implementation. It's a joy to read through the code.

Is your feature request related to a problem? Please describe.
Currently, because of Bucket.value_idx being a uint32_t the map/set can store at max 2^32 entries. That is quite tight. Many applications will hit that limit. In java, where collections have a similar limitation, I run into it regularly.

Describe the solution you'd like
The readme says that only 1 Byte + 3 Bits of Bucket.dist_and_fingerprintare payload. If the remaining 21 Bits would be used to extend Bucket.value_idx, much more entries could be stored (up to 2^(32+21) ~ 64 PB of uint64_t). I would suggest something like:

struct __attribute__((__packed__)) Bucket {
			uint8_t dist:3;
			uint8_t fingerprint:8;
			uint64_t value_idx:53;
		};

Describe alternatives you've considered
Instead of non-standard attribute packing, standard bit masks and shifts can be used.

@bigerl bigerl added the enhancement New feature or request label Aug 1, 2022
@martinus
Copy link
Owner

martinus commented Aug 1, 2022

Is 2^32 entries really not enough? For uint64_t key & value plus bucket overhead filling up all 2^32 entries will require at least 24*2^32 ~ 100 gigabyte of RAM. Is that really too small for your application?

@bigerl
Copy link
Author

bigerl commented Aug 1, 2022

I agree that it is enough for most applications.

An example for a problematic use-case would be storing an index for nodes in Wikidata. Wikidata contains more than 2^32 distinct nodes. So, storing a mapping from uint64_t ID to some RDF node object wouldn't work.

@Philippe91
Copy link

We can't win on all counts. Extreme cases should not decrease the performance of the common case.

@martinus
Copy link
Owner

martinus commented Aug 1, 2022

We can't win on all counts. Extreme cases should not decrease the performance of the common case.

Right, but I still think it would be nice to at least optionally allow such extreme use cases.

@bigerl by the way, only allowing 3 bits for dist is certainly not enough. If dist gets an overflow the behavior is undefined. That's why it's as large as 24bit.

martinus added a commit that referenced this issue Aug 1, 2022
@martinus
Copy link
Owner

martinus commented Aug 1, 2022

I have a version that should work for you: https://github.com/martinus/unordered_dense/blob/2022-08-custom-bucket-types/include/ankerl/unordered_dense.h

the bucket's type can now be customized. E.g. this is how you can use the big bucket type:

using MapBig = ankerl::unordered_dense::map<std::string,
                                            size_t,
                                            ankerl::unordered_dense::hash<std::string>,
                                            std::equal_to<std::string>,
                                            std::allocator<std::pair<std::string, size_t>>,
                                            ankerl::unordered_dense::bucket_type::big>;

Bucket size will be 12 byte, and max_size is 2^64-1. Would this work for you?

@bigerl
Copy link
Author

bigerl commented Aug 2, 2022

That works. Thanks for the fast solution. I like the idea of making it configurable.

Just a final note: I would expect the big variant to perform better with clang than with gcc. I had in the past several cases where clang handled structs with sizes that are not multitudes of a machine word better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants