fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410

timotheecour · 2020-02-13T12:25:25Z

revives [superseded] Different Hashing to avoid Collisions. #11767, same idea but further improved
fixes Slow insertion of uint in Tables and Sets Collections depending on the value #13393

using the tests provided by @rockcavera in https://github.com/rockcavera/nim-problem:

nim c -r -d:danger nim-problem/codesetsu64.nim

now completes in 0.075 seconds instead of 75.4 seconds prior to this PR, a speedup of 1000X for this example, which is exactly consistent with what I had already observed in #11767 on my own benchmark, noting a 100 to 1000 speedup (https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0129b.nim)

before PR

before fix #11764, faster hashing of (u)int #12407, the hash for primitive types was hash(x)=x bitand 2^n-1 which is a terrible hash resulting in lots of (trivial) collisions, ignoring all high order bits.
after fix #11764, faster hashing of (u)int #12407 (which aimed at addressing HashSet[uint64] slow insertion depending on values #11764) the hash was cast[Hash](cast[uint](x) * 11) which is still not a good hash, resulting in lots of collisions depending on input distributions, as demonstrated by Slow insertion of uint in Tables and Sets Collections depending on the value #13393; for example when input distributions only vary in high order bits, the hash function collides to a single point
floats were problematic because of formula used (1.0 + x), causing all values to collide if x was small enough (x+1.0 would collapse to 1.0 for x small enough due to FP semantics)

after PR

all primitive types (including pointer etc) with sizeof(x) in {4,8} pass through either a hashUInt64 or hashUInt32 specialized hash that has good hashing properties (cascading effect), as commonly prescribed in articles discussing hashing functions
sizeof(x)<4 was faster using the preexisting identity hash so the fix checks for that sizeof(x)>=4 criterion; indeed in this case the number of unique values is limited so collisions are less of an issue.
float is now properly handled; in particular denormalization is checked so that hash(0.0)==hash(-0.0), and also removes the previous hack based on x+1.0 which caused collisions
performance regression test is added, testing various input distributions, and for nim c, cpp, js, as well as VM for those modes.
It's a tricky problem to get it right: it has to work on 32 bit, 64 bit, with js, c, cpp, with VM during defined(js) (where sizeof(int)=4 even on 64bit), etc; and it has be be performant in all cases.

lib/pure/hashes.nim

Clyybber · 2020-02-13T12:46:15Z

@timotheecour Does this also work for the JS backend?

timotheecour

(can't delete this comment, was a duplicate of #13410 (comment))

timotheecour · 2020-02-13T13:21:01Z

@timotheecour Does this also work for the JS backend?

yes the js backend also improves hashes for primitive types, with improved randomness:

import hashes, strformat

template hash2(x) =
  block:
    let s {.inject.} = astToStr(x)
    let j {.inject.} = hash(x)
    echo &"""{s:>8} => {j}"""

hash2(0.3)
hash2("foo")
hash2(123)
hash2(124)
hash2(125)

import os,strformat
proc main() =
  let file = currentSourcePath / "../t10210.nim"
  for nim in @["/Users/timothee/git_clone//nim//Nim_devel//bin/nim", "/Users/timothee/git_clone//nim//Nim_prs//bin/nim.pr_fix_11764_hash_revived"]:
    for mode in @["c", "js"]:
      let cmd = fmt"{nim} {mode} -r --hints:off {file}"
      echo cmd
      doAssert execShellCmd(cmd) == 0
main()

prints:

/Users/timothee/git_clone//nim//Nim_devel//bin/nim c -r --hints:off /Users/timothee/git_clone/nim/timn/tests/nim/all/t10210.nim
     0.3 => 4608533498688228557
   "foo" => 4138058784
     123 => 1353
     124 => 1364
     125 => 1375
/Users/timothee/git_clone//nim//Nim_devel//bin/nim js -r --hints:off /Users/timothee/git_clone/nim/timn/tests/nim/all/t10210.nim
     0.3 => 1.3
   "foo" => -156908512
     123 => 1353
     124 => 1364
     125 => 1375
/Users/timothee/git_clone//nim//Nim_prs//bin/nim.pr_fix_11764_hash_revived c -r --hints:off /Users/timothee/git_clone/nim/timn/tests/nim/all/t10210.nim
     0.3 => -7272369355435918195
   "foo" => 4138058784
     123 => -508526132817443961
     124 => -1530243182942418667
     125 => -6335688243927128111
/Users/timothee/git_clone//nim//Nim_prs//bin/nim.pr_fix_11764_hash_revived js -r --hints:off /Users/timothee/git_clone/nim/timn/tests/nim/all/t10210.nim
     0.3 => 743435112
   "foo" => -156908512
     123 => -1632341525
     124 => -773518108
     125 => -163873810

note that hashes for nim js are not guaranteed to be same as hashes for nim c, but hashes for vm (whether running in nim js or nim c) are same as hashes for nim c; this behavior is pre-existing to this PR and is really hard/impossible to fix until nim-lang/RFCs#187 is addressed (if anything, because js uses a float type for numbers and can't represent int64)

lib/pure/hashes.nim

lib/pure/collections/sets.nim

timotheecour · 2020-02-14T05:18:36Z

PTAL:

added performance regression test that takes into account different input distributions, and on different targets (c, cpp, vm, js)
improved handling of float and simplified code a bit
introduced hashUInt64, hashUInt32 which is justified by thashes_perf.nim

lib/pure/hashes.nim

testament/important_packages.nim

tests/vm/tcompiletimetable.nim

…gression test

Araq · 2020-02-14T10:39:14Z

Yet there will be other distributions out there that are now much worse than before. The real solution is to add some logic to tables.nim to compensate for the case of very many hash collisions. That's what Python does and Python keeps its hash function for integers really simple.

timotheecour · 2020-02-14T11:48:20Z

Yet there will be other distributions out there that are now much worse than before

seems like a purely theoretical argument; while you can in theory find a (contrived, adversarial) distribution that would be made worse by a given hash function, in practice this shouldn't happen, at least much less often than with current hash (but please prove me wrong with a realistic example that causes lots of collisions with new scheme)

Good hash properties are commonly agreed upon, see for eg https://www.sparknotes.com/cs/searching/hashtables/section2/ and https://gist.github.com/badboy/6267743. Of particular importance is avalanche property:

Avalanche means that a single bit of difference in the input will make about 1/2 of the output bits be different

which the previous hashing scheme we had (bottom_n_bits(x * 11)) did not have. It's very easy to find random input distributions that would cause many (or even only) collisions: for example, inputs where only high order bits differ (eg proc toHighOrderBits(a: int): uint64 = cast[uint64](a) shl 32), see perf test I've included in this PR.

Another one was small floats, which before this PR would all collapse. These cases do occur in practice, which prompted #13393 and #11764 in the first place.

The real solution is to add some logic to tables.nim to compensate for the case of very many hash collisions. That's what Python does and Python keeps its hash function for integers really simple.

we can definitely improve logic for handling collisions, that's not mutually exclusive. But at least with this PR we can avoid those 100x-1000X catastrophic slowdowns from that bug report.

Araq · 2020-02-14T12:00:46Z

while you can in theory find a (contrived, adversarial) distribution

Well you said the relevant word here, adversarial. This is a problem with hash tables in a stdlib, you don't know what it will be used for but inevitably it will end up for critical software and we need to prevent hash collision attacks. Just coming up with a "better hash" function isn't gonna cut it.

we can definitely improve logic for handling collisions, that's not mutually exclusive

Yet Python does not need these complex hash functions for integers because its other mechanisms work out.

Look, I agree that you made collisions less likely for many realworld distributions. But you also made the hashing of integers slower and the bad security is hardly improved.

timotheecour · 2020-02-14T12:20:51Z

Look, I agree that you made collisions less likely for many realworld distributions. But you also made the hashing of integers slower.

actually that's incorrect, only slower for 64 bit integers:

for x with sizeof(T)=8 (eg int64), the hashing computation is now 2.2X slower; in practice I doubt the hash computation would be a bottleneck in a real application, where something needs to be done with those keys anyways; the 100x-1000x however is very likely to make a difference.
for x with sizeof(T)=4 (eg int32), the hash computation is now 1.24X faster (you can verify by adapting thashes_perf.nim); I obtained this by carefully distinguishing between hashUInt64 vs hashUInt32. Thus, that case should be a no-brainer.
for x with sizeof(T)<4, no difference

so for performance minded applications, either you have < 4 billions keys and you can benefit from faster hashing via int32/uint32 (and much less chances of catastrophic collisions) [simplifying a bit the argument but you get the idea], or you have > 4 billion keys and need int64, and in that case yes, pay a price for slower key computations but other factors like memory issues (or collisions) would likely be much more important

Well you said the relevant word here, adversarial.

plenty of applications don't have to deal with adversarial input. And even for the ones that do, there are options to deal with those, eg adding a runtime random seed in hash function, as well as better strategy for dealing with collisions in tables implementation. But reducing chances of collisions in the 1st place should improve performance regardless.

Araq · 2020-02-19T18:29:01Z

Your alternative PR was merged, closing.

* Unwind just the "pseudorandom probing" (whole hash-code-keyed variable stride double hashing) part of recent sets & tables changes (which has still been causing bugs over a month later (e.g., two days ago #13794) as well as still having several "figure this out" implementation question comments in them (see just diffs of this PR). This topic has been discussed in many places: #13393 #13418 #13440 #13794 Alternative/non-mandatory stronger integer hashes (or vice-versa opt-in identity hashes) are a better solution that is more general (no illusion of one hard-coded sequence solving all problems) while retaining the virtues of linear probing such as cache obliviousness and age-less tables under delete-heavy workloads (still untested after a month of this change). The only real solution for truly adversarial keys is a hash keyed off of data unobservable to attackers. That all fits better with a few families of user-pluggable/define-switchable hashes which can be provided in a separate PR more about `hashes.nim`. This PR carefully preserves the better (but still hard coded!) probing of the `intsets` and other recent fixes like `move` annotations, hash order invariant tests, `intsets.missingOrExcl` fixing, and the move of `rightSize` into `hashcommon.nim`. * Fix `data.len` -> `dataLen` problem. * This is an alternate resolution to #13393 (which arguably could be resolved outside the stdlib). Add version1 of Wang Yi's hash specialized to 8 byte integers. This gives simple help to users having trouble with overly colliding hash(key)s. I.e., A) `import hashes; proc hash(x: myInt): Hash = hashWangYi1(int(x))` in the instantiation context of a `HashSet` or `Table` or B) more globally, compile with `nim c -d:hashWangYi1`. No hash can be all things to all use cases, but this one is A) vetted to scramble well by the SMHasher test suite (a necessarily limited but far more thorough test than prior proposals here), B) only a few ALU ops on many common CPUs, and C) possesses an easy via "grade school multi-digit multiplication" fall back for weaker deployment contexts. Some people might want to stampede ahead unbridled, but my view is that a good plan is to A) include this in the stdlib for a release or three to let people try it on various key sets nim-core could realistically never access/test (maybe mentioning it in the changelog so people actually try it out), B) have them report problems (if any), C) if all seems good, make the stdlib more novice friendly by adding `hashIdentity(x)=x` and changing the default `hash() = hashWangYi1` with some `when defined` rearranging so users can `-d:hashIdentity` if they want the old behavior back. This plan is compatible with any number of competing integer hashes if people want to add them. I would strongly recommend they all *at least* pass the SMHasher suite since the idea here is to become more friendly to novices who do not generally understand hashing failure modes. * Re-organize to work around `when nimvm` limitations; Add some tests; Add a changelog.md entry. * Add less than 64-bit CPU when fork. * Fix decl instead of call typo. * First attempt at fixing range error on 32-bit platforms; Still do the arithmetic in doubled up 64-bit, but truncate the hash to the lower 32-bits, but then still return `uint64` to be the same. So, type correct but truncated hash value. Update `thashes.nim` as well. * A second try at making 32-bit mode CI work. * Use a more systematic identifier convention than Wang Yi's code. * Fix test that was wrong for as long as `toHashSet` used `rightSize` (a very long time, I think). `$a`/`$b` depend on iteration order which varies with table range reduced hash order which varies with range for some `hash()`. With 3 elements, 3!=6 is small and we've just gotten lucky with past experimental `hash()` changes. An alternate fix here would be to not stringify but use the HashSet operators, but it is not clear that doesn't alter the "spirit" of the test. * Fix another stringified test depending upon hash order. * Oops - revert the string-keyed test. * Fix another stringify test depending on hash order. * Add a better than always zero `defined(js)` branch. * It turns out to be easy to just work all in `BigInt` inside JS and thus guarantee the same low order bits of output hashes (for `isSafeInteger` input numbers). Since `hashWangYi1` output bits are equally random in all their bits, this means that tables will be safely scrambled for table sizes up to 2**32 or 4 gigaentries which is probably fine, as long as the integer keys are all < 2**53 (also likely fine). (I'm unsure why the infidelity with C/C++ back ends cut off is 32, not 53 bits.) Since HashSet & Table only use the low order bits, a quick corollary of this is that `$` on most int-keyed sets/tables will be the same in all the various back ends which seems a nice-to-have trait. * These string hash tests fail for me locally. Maybe this is what causes the CI hang for testament pcat collections? * Oops. That failure was from me manually patching string hash in hashes. Revert. * Import more test improvements from #13410 * Fix bug where I swapped order when reverting the test. Ack. * Oh, just accept either order like more and more hash tests. * Iterate in the same order. * `return` inside `emit` made us skip `popFrame` causing weird troubles. * Oops - do Windows branch also. * `nimV1hash` -> multiply-mnemonic, type-scoped `nimIntHash1` (mnemonic resolutions are "1 == identity", 1 for Nim Version 1, 1 for first/simplest/fastest in a series of possibilities. Should be very easy to remember.) * Re-organize `when nimvm` logic to be a strict `when`-`else`. * Merge other changes. * Lift constants to a common area. * Fall back to identity hash when `BigInt` is unavailable. * Increase timeout slightly (probably just real-time perturbation of CI system performance).

timotheecour force-pushed the pr_fix_11764_hash_revived branch from e618c78 to e937b25 Compare February 13, 2020 12:26

Clyybber reviewed Feb 13, 2020

View reviewed changes

lib/pure/hashes.nim Outdated Show resolved Hide resolved

timotheecour commented Feb 13, 2020

View reviewed changes

timotheecour changed the title ~~fix #13393 better hash for primitive types~~ fix #13393 better hash for primitive types, resulting in large speedups Feb 13, 2020

Clyybber reviewed Feb 13, 2020

View reviewed changes

lib/pure/hashes.nim Show resolved Hide resolved

Clyybber reviewed Feb 13, 2020

View reviewed changes

lib/pure/collections/sets.nim Show resolved Hide resolved

timotheecour force-pushed the pr_fix_11764_hash_revived branch from 90f5b71 to 92f4f76 Compare February 13, 2020 14:12

timotheecour added a commit to timotheecour/itertools that referenced this pull request Feb 13, 2020

fix for new hashes, refs nim-lang/Nim#13410

c4ea1ea

timotheecour mentioned this pull request Feb 13, 2020

fix for new hashes, refs https://github.com/nim-lang/Nim/pull/13410 narimiran/itertools#9

Merged

narimiran pushed a commit to narimiran/itertools that referenced this pull request Feb 13, 2020

fix for new hashes, refs nim-lang/Nim#13410 (#9)

e41453c

timotheecour force-pushed the pr_fix_11764_hash_revived branch 2 times, most recently from 9ea79a5 to 4cec93a Compare February 14, 2020 02:09

timotheecour marked this pull request as ready for review February 14, 2020 05:16

timotheecour force-pushed the pr_fix_11764_hash_revived branch 3 times, most recently from e58c324 to 698a13f Compare February 14, 2020 07:16

timotheecour changed the title ~~fix #13393 better hash for primitive types, resulting in large speedups~~ fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions Feb 14, 2020

timotheecour force-pushed the pr_fix_11764_hash_revived branch from 698a13f to e0c65bf Compare February 14, 2020 07:53

narimiran reviewed Feb 14, 2020

View reviewed changes

lib/pure/hashes.nim Outdated Show resolved Hide resolved

timotheecour added a commit to timotheecour/itertools that referenced this pull request Feb 14, 2020

fix for new hashes, refs nim-lang/Nim#13410

cde9d9f

narimiran reviewed Feb 14, 2020

View reviewed changes

testament/important_packages.nim Outdated Show resolved Hide resolved

Araq reviewed Feb 14, 2020

View reviewed changes

tests/vm/tcompiletimetable.nim Outdated Show resolved Hide resolved

narimiran pushed a commit to narimiran/itertools that referenced this pull request Feb 14, 2020

fix for new hashes, refs nim-lang/Nim#13410 (#10)

cade511

timotheecour added 4 commits February 14, 2020 01:27

fix nim-lang#13393 better hash for primitive types

c83b5d7

fix tests

305c357

remove comments that showed the rationale for the change

adcc328

fix for js

e2ef1ff

timotheecour added 9 commits February 14, 2020 01:27

fix tests

19b917e

hotfix for itertools

41f3f40

fixup

35b179c

simplify code for handling float 0/-0

fdf8df6

no need for hashBiggestIntVM, just overload in vmops instead

e1867da

better implementation, distinguish int32 vs int64; add performance re…

e0918b3

…gression test

fixup

6acb7c8

hotfix for itertools

c9755a4

remove hotfix after it got merged

326d1c0

timotheecour force-pushed the pr_fix_11764_hash_revived branch from d456421 to 326d1c0 Compare February 14, 2020 09:27

timotheecour mentioned this pull request Feb 16, 2020

pseudorandom probing for hash collision #13418

Merged

5 tasks

Araq closed this Feb 19, 2020

timotheecour mentioned this pull request Feb 20, 2020

hashing collision: improve performance using linear/pseudorandom probing mix #13440

Closed

demotomohiro mentioned this pull request Feb 28, 2020

Slow insertion of uint in Tables and Sets Collections depending on the value #13393

Closed

timotheecour mentioned this pull request Mar 17, 2020

fix hash(HashSet) which was incorrect; fix hash(OrderedTable[string, JsonNode]) which was bad #13649

Closed

c-blake mentioned this pull request Apr 4, 2020

Add hashWangYi1 #13823

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410

fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410

timotheecour commented Feb 13, 2020 •

edited

Loading

Clyybber commented Feb 13, 2020

timotheecour left a comment •

edited

Loading

timotheecour commented Feb 13, 2020 •

edited

Loading

timotheecour commented Feb 14, 2020 •

edited

Loading

Araq commented Feb 14, 2020

timotheecour commented Feb 14, 2020 •

edited

Loading

Araq commented Feb 14, 2020 •

edited

Loading

timotheecour commented Feb 14, 2020 •

edited

Loading

Araq commented Feb 19, 2020

fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410

fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410

Conversation

timotheecour commented Feb 13, 2020 • edited Loading

before PR

after PR

Clyybber commented Feb 13, 2020

timotheecour left a comment • edited Loading

Choose a reason for hiding this comment

timotheecour commented Feb 13, 2020 • edited Loading

timotheecour commented Feb 14, 2020 • edited Loading

Araq commented Feb 14, 2020

timotheecour commented Feb 14, 2020 • edited Loading

Araq commented Feb 14, 2020 • edited Loading

timotheecour commented Feb 14, 2020 • edited Loading

Araq commented Feb 19, 2020

timotheecour commented Feb 13, 2020 •

edited

Loading

timotheecour left a comment •

edited

Loading

timotheecour commented Feb 13, 2020 •

edited

Loading

timotheecour commented Feb 14, 2020 •

edited

Loading

timotheecour commented Feb 14, 2020 •

edited

Loading

Araq commented Feb 14, 2020 •

edited

Loading

timotheecour commented Feb 14, 2020 •

edited

Loading