-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HashSet[uint64] slow insertion depending on values #11764
Comments
In general I have to say that you can always construct a sequence of keys that works very bad for the hash set implementation no matter what the hashing function does. But this does indeed look strange, as these keys don't look like as if they would cause particularly many collisions. Did you measure what part exactly causes this slowdown? EDIT: I did the benchmark, this loop here is where all the time is wasted: Nim/lib/pure/collections/hashcommon.nim Line 40 in 210988c
|
I added 3 more experiments, with resizing the hashSet first: import sets
import times
when isMainModule:
var hs1: HashSet[uint64]
var hs2: HashSet[uint64]
var hs3: HashSet[uint64]
var hs4 = initHashSet[uint64](rightSize(100_000))
var hs5 = initHashSet[uint64](rightSize(200_000))
var hs6 = initHashSet[uint64](rightSize(1_100_000))
# insert 0..200k
var time = cpuTime()
for i in 0..200_000:
let k1 = uint64(i)
hs1.incl(k1)
echo "time ", (cpuTime() - time)
# interleave insert 0..100k and 100k..200k
time = cpuTime()
for i in 0..100_000:
let k1 = uint64(i)
let k2 = uint64(i + 100_000)
hs2.incl(k1)
hs2.incl(k2)
echo "time ", (cpuTime() - time)
# interleave insert 0..100k and 1.0M..1.1M
time = cpuTime()
for i in 0..100_000:
let k1 = uint64(i)
let k2 = uint64(i + 1_000_000)
hs3.incl(k1)
hs3.incl(k2)
echo "time ", (cpuTime() - time)
# interleave insert 0..100k and 1.0M..1.1M
# but insert into a hashSet with space for100k
time = cpuTime()
for i in 0..100_000:
let k1 = uint64(i)
let k2 = uint64(i + 1_000_000)
hs4.incl(k1)
hs4.incl(k2)
echo "time ", (cpuTime() - time)
# interleave insert 0..100k and 1.0M..1.1M
# but insert into a hashSet with space for 200K
time = cpuTime()
for i in 0..100_000:
let k1 = uint64(i)
let k2 = uint64(i + 1_000_000)
hs5.incl(k1)
hs5.incl(k2)
echo "time ", (cpuTime() - time)
# interleave insert 0..100k and 1.0M..1.1M
# but insert into a hashSet with space for 1.1M
time = cpuTime()
for i in 0..100_000:
let k1 = uint64(i)
let k2 = uint64(i + 1_000_000)
hs6.incl(k1)
hs6.incl(k2)
echo "time ", (cpuTime() - time)
By resizing the hashSet large enough to cover the largest expected integer, we recover all the performance (loop 6) and then some.
It's been a while since I took my CS algorithms course, but is it possible that the logic which controls when to resize the HashSet as it grows has some problems? |
Hi, now that 1.0.0 has come out I re-ran this https://github.com/cwpearson/nim-issue-11764:
I'd expect 1-3 to be roughly the same performance, and 4-6 to be somewhat better than 1-3 due to the pre-allocation. In practice, that is still not what we see. |
Looks like you crafted the numbers to produce collisions, that is always a danger when using hashing. |
Thanks! It looks like the hash for int is just that int value: Lines 115 to 117 in e9f7455
It seems like the hashset implementation tries the next consecutive bucket when there is a collision: Nim/lib/pure/collections/hashcommon.nim Lines 29 to 30 in e9f7455
The general wisdom for open addressing sets seems to be that the hash function needs to avoid placing consecutive values in consecutive buckets (wikipedia). "Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent."
As an aside, I gave IntSet from the Nim lib a try and the performance is much better (case (8) in https://github.com/cpwearson/nim-issue-11764). |
But my point is that
is not common. Usually IDs are mostly consecutive. May I ask where your integers come from? |
I noticed the same problem with the sets package and tables. In order not to open a new issue, I decided to comment here. For testing I used the developer version 2019-11-14 of Nim. I compiled with gcc, clang and vcc. I created a repository with demo code and problem reporting. https://github.com/rockcavera/nim-problem I believe the issue should be reopened. |
I believe the C# HashSet implementation is good and can be used as an example for Nim. I believe this is the source code Here the highu64.txt file was processed in 0.220885200 seconds against 88.57006502151489 seconds for my Nim implementation with Sets. |
I'm already working on it, and the Python's way of probing (link) seems to work nice with the current Nim's implementation. The remaining problem is: I was not able to make Nim to bootstrap anymore with the new probing. |
This reverts commit 48975bb.
Well, inlining is crucial for hash functions, so they need to be small. wyhash is not really that small. Swiss tables and friends do make a huge difference, esp. for such VM's. The only problem that all 3 good implementations do need C++ so far. In my VM's I was not yet comfortable with that requirement. But it's only a matter of time until someone ports it over to C. Improvements in C++ had been amazing, I used the best one for SMHasher, and this is even thread-safe |
@rurban, the fallback is big, but the have-128-bit-double-product-on-CPU is not. It's not Rotate-Multiply small, but it's less than 3x more. You may be thinking of the Wang Yi string hash which is much more ornate through much manual unrolling activity, not my specialized 8-byte int adaptation. Re: Swiss tables, it's all about "compared to what along what dimension?". Time/memory/etc. I get L1 int-keyed hot-everything lookups of like 1/2 a pipeline depth with vanilla LP soup to nuts whole lookup time. On the giant side, time is dominated by the initial uncached memory load or 65-120ns unless you "cook the books" with hot loop benchmarks. In the middle is a muddle of partial prediction/partial caching. Work prediction/prefetching in general creates an enormous amount of confusion in benchmarkers, even among pros, and is a big conversation best had elsewhere (e.g. over at adix). |
(I do realize I was kind of talking about "both" string hashes & int hashes and so no offense, and that conversation going afield is what makes me keep saying to take the topic elsewhere from this |
Unless you know, as you well might!, of an 8 byte integer hash that passes SMHasher distro tests and is smaller/faster than roughly: proc hiXorLo(a, b: uint64): uint64 {.inline.} =
{.emit: "__uint128_t r=a; r*=b; `result` = (r>>64)^r;".}
proc hashWangYi1*(x: int64|uint64|Hash): Hash {.inline.} =
const P0 = 0xa0761d6478bd642f'u64
const P1 = 0xe7037ed1a0b428db'u64
const P5x8 = 0xeb44accab455d165'u64 xor 8'u64
Hash(hiXorLo(hiXorLo(P0, uint64(x) xor P1), P5x8)) |
I'd definitely do some double-checking and make sure that hashWangYi1 is a bijection; if it isn't a bijection but has 64-bit input, 64-bit output, then some outputs are not possible. Most testing suites have a hard time detecting gaps (or repeats) in 64-bit output ranges, and it may not be an issue because the 64-bit number space is so vast. I usually estimate functions like hiXorLo as being unable to produce about 1/3 of all possible results, based on testing with a 32-bit version of a similar wyhash-derived function that can be fully explored. If that's the case here, then two hiXorLo calls would not be able to produce roughly 5/9 of all possible outputs, and the remaining 4/9 of possible outputs would be unevenly distributed. This may be acceptable for some usages. EDIT: It is definitely not a bijection. I ran a test by M.E. O'Neill meant for detecting random number generators that can't produce duplicates, which is a problem for some PRNGs, but a necessity for hash functions that should be able to return any hash. After about 45 minutes of testing, it had generated 50916406 64-bit values given unique inputs to hashWangYi1. Of those, there were 8 repeats, which is twice as many as it expected for a fairly-distributed PRNG (it gave a p-value of just over 0.95 when treating it as a PRNG), but as a unary hash it should have none. |
@tommyettinger - Solid point. The xor folding definitely makes it non-invertible. { Psychology Replication Crisis-wise, I'm obligated to request a statistically independent second trial to confirm that p=0.95 result. :-) :-) What was your input set? 1..50916406? } A hot spot in the table only 2x overloaded isn't bad if those whole-64-bit repeats don't translate into >10s of times expected collisions in just lower bits. It's not Cuckoo hashing where slightly weak hashes wreak havoc. ;-) But, of course, they could be collide-y. We do and-mask reduction of hash outputs to table address. So, we care most about the lower bits. Anyway, I absolutely see how invertibility would add safety if we don't have to give up the SMHasher avalanche-y/high-sensitivity-to-input property or much speed. Do you (or anyone else) know of a comparably small, fast, mixing/sensitive integer hash that is invertible? I have always thought the right answer was to provide a few alternatives with easy switching. So, if you have a better recommendation I am all ears. Thanks! |
(and I know about the Thomas Wang one already mentioned in this thread, but that seems like an awful lot of ALU operations..like 15..22 compared to the 5 of that WangYi1, and I don't know anyone has ever even run the relevant SMHasher 8-byte pattern tests on it or elsewise really vetted it. Someone should do an SMHasher/BigCrush suite specialized for integer hashes..Maybe you!) |
Pelle Evensen has an incredibly hard-core routine he evaluates unary hashes with, based on many runs of PractRand (for testing random number generators) with altered input patterns. Anything that passes it is probably more expensive than 5 operations (although I counted more in the wyhash-based hash, some may not run on the ALU: 3 xor, 2 shift, 2 128-bit multiply). There is a nice middle ground with Moremur, which is invertible and only requires 3 xor, 3 shift, 2 64-bit multiply, although they're all sequential. Moremur doesn't pass Evensen's full test battery, but it passes a lot more of it than similar hashes, like MurmurHash3's mixer. uint64_t moremur(uint64_t x) {
x ^= x >> 27;
x *= 0x3C79AC492BA7B653UL;
x ^= x >> 33;
x *= 0x1C69B3F74AC4AE35UL;
x ^= x >> 27;
return x;
} One that does pass the full several-day battery of tests is NASAM, which looks like uint64_t nasam(uint64_t x) {
// ror64(a, r) is a 64-bit right rotation of a by r bits.
// the xor of an odd number of rotations of the same term is invertible.
// here, one of the rotations is effectively a rotation by 0.
x ^= ror64(x, 25) ^ ror64(x, 47);
x *= 0x9E6C63D0676A9A99UL;
// all xorshifts in the same direction, even with multiple xors and shifts, are invertible.
x ^= x >> 23 ^ x >> 51;
x *= 0x9E6D62D06F6A9A9BUL;
x ^= x >> 23 ^ x >> 51;
return x;
} Unless you anticipate extremely strange input sets, Moremur probably provides more than enough speed and collision resistance for int hashing. |
In this respect, I am aware of the Hash Prospector (by @skeeto), which does a computer search for good hash functions that follow a given sequence of reversible operations. See also the author's article "Prospecting for Hash Functions". |
@tommyettinger, hiXorLo is just 1 multiply (with a double register output) and one xor. In C you "spell register access" with a 64-bit shift and a mask. So, I was counting it as 1 initial xor with P0, then two products and two xors in the machine code. Just reading at these other two make them seem quite a bit more costly, like Thomas Wang's, but maybe more scrambling/uniform. (There are insn latency/dependency questions, too.) As per @peteroupc 's concern (thanks also for those links), spending too much time to get safety can undermine overall goals and be a false sense of safety against intelligent attack. True attacks remain possible until you "mix in" information an attacker cannot access (which I try to address in |
Also, @tommyettinger that only 4/9 of output effect you mention may be "mosty canceled" by the two rounds of hiXorLo to get 8/9 with avalanche with low bias..No? If SMHasher is not detecting this bias you suspect that is probably an issue to report. Also, what was your sequence giving p just under .05 (p-value is conventionally P(by chance|null) not 1-that)? |
For example/food for thought, this program: # Just writes 8 byte binary hashes of 0..<n to stdout for input to PractRand.
import althash, bitops, cligen, cligen/osUt
type HashFun = enum WangYi, MoreMur
proc writeHash(n=99, r=0, fun=WangYi) =
var h: Hash
for i in 0 ..< n:
let x = rotateLeftBits(uint64(i), r)
case fun
of WangYi: h = hashWangYi1(x)
of MoreMur: h = hashMoreMur(x)
discard stdout.uriteBuffer(cast[cstring](h.addr), Hash.sizeof.Natural)
dispatch(writeHash, help = { "n": "samples", "r": "rotation",
"fun": "hash function"}) and this command (with PractRand-pre0.95 from sourceforge):
Produce "no anomalies" every time for the first 4 GiNumbers. A good initial indicator, anyway. Just something I ran in 7 minutes. I realize it is not as thorough Pelle Evensen's tests who does like 1000x more than that and then with a bunch of input pattern transforms on top. I am working on reproducing those and tracking down Melissa O'Neill's code (Harvey Mudd's entire CS website was down for me). |
So, using that same program and doing
does detect 7 "anomalies" but no actual "failures" (not even "mildly suspicious") at ROTATION's 5, 6, 19, 20, 34, 58, and 60:
Meanwhile your proc hashMoreMur*(x: int64|uint64|Hash): Hash {.inline.} =
## This is from https://github.com/tommyettinger
var x = uint64(x)
x = x xor (x shr 27)
x = x * 0x3C79AC492BA7B653'u64
x = x xor (x shr 33)
x = x * 0x1C69B3F74AC4AE35'u64
x = x xor (x shr 27)
result = Hash(x) and this:
produces 247 warnings and errors..too many to list outright, but a summary is:
I would have to say, that WangYi seems both faster, smaller, and stronger than MoreMur from this preliminary evaluation. (unless I screwed up translating to Nim. I get |
Actually, that |
Also, I'm not sure if any of these PractRand tests (and possibly Melissa's) are Bonferroni corrected. I did |
With "Melissa O'Neill's tests", I think you may be referring to the Birthday Problem test at https://github.com/imneme/birthday-problem-test (see also: https://www.pcg-random.org/posts/birthday-test.html ; @imneme). |
I'm referring to whatever @tommyettinger was referring to. :-) I suspect it was a false positive for badness. It sounded borderline at worst, and the multiple comparisons point definitely matters. But thanks for those links, Peter. I'll take a look later! Cheers |
I realize I made a performance comment I should maybe have backed up. So, here is a little table I have for 1e6 numbers on 3 computers. Time is min(5 runs) in ns/hash. Compiled with Nim ...d0942d42a1; gcc-10.0.1 with PGO. Source code is in
(That time actually also counts the bit rotation that is part of the test for output randomness and whatever loop overheads if you look at |
is there a reason you're not comparing splitmix? that is quite fast for my tests (which include the hashtable insert+lookup). i'm curious how it performs in yours. |
Nope. No good reason. :-) { Well, other than that there are a lot of hashes out there and I had no ambition of ever doing an exhaustive comparison..just finding one quick, well mixing one that passed a lot of statistical tests. } |
Speed-wise, splitmix64 from (http://xorshift.di.unimi.it/splitmix64.c) is slower than NASAM on that Skylake/i7-6700k (1.7123 ns = 8 cycles). Not sure I see the point in running stat tests on it since it's the slowest of the bunch. OTOH, you might find some modest 2 cycle gain (in some amortized sense) by moving to hashWangYi1. (EDIT: I got 12 "unusual" PractRand reports for that splitmix64. So, the same as NASAM and more than WangYi1, but none of them are even at the "mildly suspicious" level, except Moremur which breaks all over the place.) I say "amortized" because, of course, a hot loop like that is really overlaying as much parallel execution as the CPU can discover/use which could be quite a lot. It may be that, e.g., it takes 12 cycles to do one Wang Yi hash but on average there are enough CPU execution units to be doing 2 at a time. I am very aware that almost all CPU/memory system engineering since the early 1990s has made interpreting microbenchmarks harder. So, I try to use careful language. The best benchmark is always your actual application. |
Moremur is meant to be an improved set of constants for the same algo as SplitMix64, it just doesn't include the state increment by 0x9e3779b97f4a7c15 that Vigna's code uses. SplitMix64 has known PractRand failures when that state increment (AKA gamma) is too small (1 and 3 are known problem gammas, as is The set of numbers I used for the birthday problem test doesn't really matter in a test for the presence of repeated outputs, as long as the inputs are distinct. It does matter that it found any repeats, because it means there are fewer outputs than inputs. I'm trying to find the non-cryptographic hash construction guidelines I've been going by, and I think this was it, from Bob Jenkins: https://burtleburtle.net/bob/hash/evahash.html#Funneling (I'm considering hashWangYi1 a mixing function). I'll have to check on a different computer to see what initial seed the birthday problem test used (it was 32-bit), but it incremented that seed repeatedly by |
I cannot generate a single repeat using that same Also, this follow-up makes your initial report:
hard to understand - 8 repeats being 2x expected, but now you are saying we want zero repeats which is not at all 8/2=4. So, I feel like that test must be "repeats when range-reduced/masked" which is, of course, a different story than the full 64-bit range and definitely requires "repeated trials". And definitely be careful about Melissa's confusing (or enlightend?!?!) tendency to write p-values as their complements (p-value here, 1 - p-value there, etc.). |
Also, for a non-repeating sequence of inputs mod 2**anyPower, can't we simply use an odd number increment mod 2**thatPower? So, for 2**64 an set of large |
Ok. I added a couple lines to
(Related to the above - if you have an 8B-int binary external merge sort, it will use 8/17 the RAM/disk space/bandwidth, but coreutils sort does not do this, AFAIK, and you want "." in the So, on average for 2**32 entries (2+5+0)/3= 2.33 slots with a collision when I believe one expects k**2/(2N) = 1/2 for the full 64-bit range. So, about 4x too many in this very limited measurement (which took 3 hours, but I think is about a 4 sigma result to @tommyettinger 's 2sigma result). So, this slight elevation of 2-5x is likely real. Maybe that 4/9 gets squared (from nested application) and it's 81./16 =~ 5x. Anyhow, I don't think it's disqualifying unless it's the tip of a very nasty iceberg which seems unlikely given how it passes SMHasher and PractRand (and how many hours it takes to even measure the defect reliably). If we were targeting Cuckoo hashing with 32-bit hash codes then it might be catastrophic, but it should be a fine default for linear probing/robin hood LP which are robust to collisions in hash codes. It still seems better in a vague/omnibus sense than Tommy's MoreMur recommendation and I'm very sympathetic (before he ever mentioned it) to @peteroupc 's concern about unnecessary randomness optimization. If there weren't 64-bit CPUs where NASAM was 4x slower then that could be the default, but as-is, it makes the most sense (to me) to provide NASAM as a fallback and a few less-random-on-purpose "fall forwards" like identity & RoMu1. (As mentioned, I'm not personally against RoMu1 or identity as defaults if there is an auto-fallback mechanism, as I am working on in Thanks for everyone's participation/help on this, btw, especially @tommyettinger (who I hope to have saved a little time/work by following up like this). Open source is too often thankless. |
My concern was less about "over-optimization" of hash functions and more about using alternative approaches than the all-too-fashionable SipHash and other keyed hash functions to mitigate security attacks, as long as performance is maintained by using one or more of these approaches. In this sense, I am favorable of the OpenJDK strategy of switching a hash bucket to a binary tree (which has logarithmic time complexity) when it detects a high number of collisions in that bucket. |
Well, we agree 100% and adix is all about such auto/staged mitigations. I should stop trying to characterize the opinions of others. Sorry. It's often hard to voice support without some pithy reference. :-) |
Since #13823 is merged, this can be closed. |
The rate of inserting
uint64
s into a hash set varies wildly with the order and the range of integers insertedExample
Current Output
Compiled with
nim c -d:release -r slow_set.nim
.Expected Output
I would expect the three different insertion loops to take roughly the same amount of time.
They are all inserting the same amount of unique values.
In the second loop, I just interleave the insertion order, and in the third loop, I insert some larger numbers.
Possible Solution
Additional Information
This issues #10097 talks about integer hashing and collisions, but all of my uint64s are well below the max uint32.
This also happens with hashTable keys, presumably for similar reasons?
In my actual project I am deduplicating an edge list for a graph where the source and destination vertex for a node are sort of similar to loop 3, so the performance is not good.
The text was updated successfully, but these errors were encountered: