-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Was swap-remove behavior ever considered when removing entries? #503
Comments
Comment from joaquintides, general hash map guru, and more recently one of the authors of
So, as expected, my idea isn't original, though it could potentially be beneficial. |
I'm hesitant to add this because it adds an extra cost when removing an element, and the benefit is not obvious. I would be more inclined to accept this if you have a benchmark that shows a clear performance benefit. |
I submitted #507 as an attempt to measure the performance loss from leaving holes behind. I can't be certain that the benchmark is doing the right thing. In particular, I attempted to construct benchmarks which precisely place elements right where I'd want them for maximum effect by exploiting the particulars of the implementation, and it's a bit complicated to judge whether I was successful. The benchmark results, on my machine (11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz) are as follow:
Where:
Due to the structure of the benchmark, there's simply more elements as the element size is lower, in order to get to 16 times the L2 cache size, so benchmarks cannot be compared between element sizes. Within an element size, however, the trend is fairly clear:
The fact that I would appreciate a review of the strategy and implementation, if you have time. Unless I made a mistake -- which is not all that unlikely -- there doesn't seem to be any reason to try to improve locality of elements within a group any further, however, and thus this issue is moot. |
I don't expect cache locality to have much of an impact in practice because the access pattern of hash table lookup is effectively random due to the hash function. Additionally, due to the way we scan control bytes with SIMD instructions, we will immediately "seek" to the first entry with a matching H2 hash (matching top 7 bits of the hash value) without looking at other elements. This means that unless we get a false positive on the H2 value (<1% chance per lookup) then the layout of elements within a group is irrelevant because we are only ever accessing a single element in that group. The only case where this could conceivably make a difference might be iterating over all the elements of a hash table, but this is a relatively rare use case that we shouldn't be optimizing for: removing an entry is a much more common operation than iteration. |
It may actually be worth looking into the design of boost's new hash table to see if it's worth adopting it instead of SwissTable for hashbrown. They claim that it's faster in benchmarks. |
I don't think iteration should be prioritized either.
I disagree with the assessment. In essence, we can think of hash-table access as randomly accessing elements of an array. Let's imagine two access patterns:
Both access patterns access the same number of elements, but the first accesses twice as less cache-lines as the second. If the access are infrequent, then it shouldn't matter indeed: whatever cache-line was cached is most likely long gone. If accesses are very frequent, I would expect the cache lines to be cached, unless there's too many of them for the cache, and thus I'd expect an access pattern which minimizes the number of accessed cache lines to also minimize the number of cache misses. Benchmarks seem to disprove the effect -- so either I'm bad at intuiting, or I'm bad at designing benchmarks.
Despite being random, the accesses are clustered in time, which should lead to cache hits for a certain proportion of elements.
The effect of H2 is quite different depending on scenarios. In particular, the scenario benchmarked in the above PR was for known-present elements, with distinct H2 within a group, in which case the only benefit of H2 is in helping locating exact offset of the element. No benefit for "skipping" elements -- as found in looking up absent elements, or "colliding" present elements -- should be seen in such a case. |
There are two key differences, from what I can tell, which I would guess are independent from one another. See slide 24 (Mechanics):
I have no idea which of the two differences could have the bigger impact. With that said, the first is easy to try, the second, not so much. The logic for the first can be found at https://github.com/boostorg/unordered/blob/develop/include/boost/unordered/detail/foa/core.hpp#L335-L390 for SSE2, and a bit further down for Neon. In short:
The 8 seems fairly arbitrary, though I do note that it means that for the bit-marker of overflow choosing a multiple of 8 maintains the "group" the residual would fall in, so that all 256 values are equally split across the 8 bits of the overflow marker. |
boost's flat hash map is quite faster than swiss table from more than 10 different third-party benchmarks. |
@ktprime Yes, unfortunately that's not that helpful. Firstly, hashbrown is inspired by Swiss Table, but not an exact port as far as I can tell. This means that while at a high-level it roughly follows Swiss Table, in a benchmark its performance is not necessarily quite the same, so the benchmarks provided against the C++ implementation of Swiss Table are not an exact match. Still, I've been investigating the most obvious differences (254 values residuals and overflow tracking) in an attempt to check whether they could be improving performance. Overall, though, the patches are not that great performance wise. It's not that unexpected, as both only offer a performance boost is relatively rare situations, while having a potential cost for every element. Unfortunately, my rough attempts did not show any great improvement. Performance is roughly on par, with often a more regressions than improvements. This suggests that while in theory they could improve performance, in practice they may require significant micro-optimization efforts, or more large-scale integration changes, for those improvements to kick in. But it's hard to guess which path would be more fruitful, and costly (time-wise) to investigate them all. Still, the PRs are out there for anyone who wishes to investigate them further:
|
I agree with you. I have also ported a Swiss table implementation, which has slightly slower performance compared to the Boost version. Most benchmarks consist of simple loops that focus on testing find, hit/miss, and insertion operations. However, in real-world code, the scenario is different, especially when metadata is frequently accessed and needs to be kept hot in the cache. In such cases, I prefer to use a flat hash map where the key-value pair and metadata are located in the same slot, even if its performance may not be as good as the benchmark results suggest. Based on my benchmark results, if the data is fully located in the CPU cache or if the metadata cannot fully reside in the cache (e.g., when the size of the hash map is less than 10,000 or greater than 500,000,000), a hash map based on SIMD does not have any priority over other hash map designs. |
Motivation
One of the scenarios were Swiss Tables tend NOT to fare too well in are scenarios involving deletions.
A Swiss Table in which entries are only inserted -- a fairly common usecase, for example dumping all the HTTP parameters or headers of a an incoming request -- will have contiguous entries in each group of 16. This leads to good cache line density, and helps in reducing cache misses.
On the other hand, once an entry is deleted, a hole is left in the run of entries of the group it was deleted from -- unless it was last, of course -- and in the presence of a good hash, further insertions into the table will likely end up in different groups for a while.
This leads to tables in which a significant number of deletions have been performed looking like... Swiss Cheese. Appropriate, I suppose, but not cache-friendly.
Backward Shifting Deletion
Prior to hashbrown, the standard hashmap in Rust used Robin Hood with Backward Shifting Deletion: upon deletion, any entry immediately following the deleted entry (recursively) which had been displaced due to the now deleted entry, was shifted back. This helped avoiding tombstones in the probe sequence, and led to good cache density.
The cost of shifting all those entries back is however none too good. Deletion is not really O(1) any longer when an unspecified number of elements may need to be shifted back.
So I would not recommend backward shifting deletion...
Swap Remove
But what about
swap_remove
?The idea would be that if the entry deleted from a group is not the last entry of the group, then that last entry is swapped in the now available spot. Just like
Vec::swap_remove
this is O(1), while preserving the contiguity of the entries in the group.This is a slight additional cost for deletion, in exchange for better cache density for the entire period of time following the deletion up until a new element is inserted back that would have plugged the hole.
Is it worth it?
I don't know! I've never heard of it being used.
Before I try to implement the behavior, however, I thought I'd ask if:
I am quite keen on hearing folks thoughts on the matter.
The text was updated successfully, but these errors were encountered: