fix `hash(HashSet)` which was incorrect; fix `hash(OrderedTable[string, JsonNode])` which was bad #13649

timotheecour · 2020-03-14T12:12:44Z

any hash should satisfy:
if a == b then hash(a) == hash(b)
(the reverse being desirable but not always possible)
this PR fixes hash(HashSet), which didn't satisfy that property (see tests) because of tombstones.
Furthermore, xor doesn't have good hash properties so I replaced it with !& (see PR content for full explanation); this was discovered while investigating this Improve deque, tables, and set modules #13620 (comment) /cc @PMunch

If for some reason sort is unacceptable, I can change to using the commented out xor code but keep the !& + sort code for documentation in case someone runs into the issue mentioned

likewise, this PR fixes hash*(n: OrderedTable[string, JsonNode]): Hash which used xor which again has bad mixing properties: any 2 (key,val) pairs satisfying (hash(key) !& hash(val)) having the same value would cancel each other out; instead I used the standard approach based on adding each element via !& (no sort needed here because OrderedTable is ordered).

A particular case of this type of collision is this:

import sets
import tables
import json
var a: OrderedTable[string, JsonNode]
a.add("foo", newJString("1"))
a.add("foo", newJString("1")) # this nullifies previous (key,value)
doAssert hash(a) == 0

note

json spec allows duplicate keys (see https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object), but std/json silently ignores duplicate keys:

  var b = %* {
    "foo": "1",
    "foo": "1",
    "bar": "2",
    "bar": "2",
  }
  echo b

{"foo":"1","bar":"2"}

that's a separate issue though

lib/pure/collections/sets.nim

krux02 · 2020-03-15T23:55:18Z

Now hashing does allocate memory. That is pretty bad. I really prefer the xor operator over that.

lib/pure/collections/sets.nim

timotheecour · 2020-03-17T11:45:59Z

PTAL, this is a tricky problem and I think I've now found a good solution:

that solution is also more general: works for any unordered set of elements (allowing duplicates), so I've added hashes.hashUnordered which works with any iterator (including a seq).
no alloc/sort needed, should be almost as efficient as xor approach (the other costs should dwarf the difference)
this reuses parts of defunct fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410 which introduced "good hashes" for uint32/uint64, which I'm using for nonlinearHash , see below.
as an extra non-linear mixing step, I'm also mixing at the end with number of elements.

The comments in code should explain the fine print, but in short:

sort approach would give the best hash, but yes, had downside of requiring allocations
hash(ai) xor hash(aj) has downsides, eg 2 identical hashes would result in 0 (as also noted in [1]])
hash(ai)+hash(aj) has similar downsides, eg 2 opposite hashes would result in 0 (or simply, hash(@[10, 20]) would equal hash(@[10+a, 20-a]))
hash(ai)*hash(aj) ditto with hash(@[10 div a, 20 * a]))

So I'm instead combining hashes via nonlinearHash(hash(ai)) + nonlinearHash(hash(aj)), which is both simple and removes common collisions (not robust to adversarial though). This is especially important given that hash in stdlib is now the simple identity function for integral types.

links

[1] https://crypto.stackexchange.com/questions/54544/how-to-to-calculate-the-hash-of-an-unordered-set discusses exactly this problem (but not the nonlinearHash approach)
[2] a more robust but more complex / expensive approach for this problem is studied here: http://people.csail.mit.edu/devadas/pubs/mhashes.pdf (provides an online algorithm)

lib/pure/json.nim

PMunch · 2020-03-17T13:11:29Z

Shouldn't xor be fine for sets as you by definition can't have two values that are exactly the same?

timotheecour · 2020-03-18T06:03:48Z

PTAL

Shouldn't xor be fine for sets as you by definition can't have two values that are exactly the same?

No, xor suffers from the exact same problem as + and * that I described above:
these all give the same hash prior to this PR:

toHashSet(@[15,8]).hash
toHashSet(@[14,9]).hash
toHashSet(@[13,10]).hash
toHashSet(@[11,12]).hash

more generally, for any a,b,c:

toHashSet(@[a,b]).hash == toHashSet(@[a xor c, b xor c]).hash

which gives tons of trivial collisions (easy to see by looking at the bit pattern).
xor fails at one of the key hash desirable property: small differences in input should result in large differences in output hash.

after this PR, I'm mitigating this to a large extent (at minimal cost) using nonlinearHash

PMunch · 2020-03-18T08:06:11Z

Ah, I forgot that hashes of numbers in Nim are simply outputting the number itself. IMO this also breaks the desirable property of hashes that "small differences in input should result in large differences in output hash".

timotheecour · 2020-03-18T08:52:18Z

Ah, I forgot that hashes of numbers in Nim are simply outputting the number itself. IMO this also breaks the desirable property of hashes that "small differences in input should result in large differences in output hash".

there's a reason for that, and it follows precedent of python: make the common case fast. See #13440 for a comparison (better hash vs mix(this PR) columns); the identity hash for ordinal types is (in isolation) 2.2X faster than nonlinearHash from #13410 (that I'm re-introducing here just for the specific case of hashUnordered). As shown in #13440, we can still get the best of both worlds between:

fast hash for the common case of integral types and fast table operations (pseudorandom probing for hash collision #13418 and followup hashing collision: improve performance using linear/pseudorandom probing mix #13440)
avoid trivial collisions for unordered sets while having almost same performance (bounded above by 2.2X but likely negligible overhead given all operations in overall hash compuation of an ordered set) for hash computation (this PR)

…refs #13649

timotheecour · 2020-03-18T10:55:37Z

rebased after 3f1a85b ; I'll try tmrw to also reuse hcode for HashSet (avoiding recomputing hash) while also keeping same benefits of nonlinearHash (vs xor, which introduces lots of trivial collisions as explained here #13649 (comment))

Araq · 2020-03-18T15:56:47Z

I don't understand why all of this is needed and btw hardcoding implementations via callbacks in the VM could cause trouble in the future if the compiler ever relies on compile-time hash tables.

Why not simply add combineHashesCommutatively to hashes.nim and use that instead of xor?

timotheecour · 2020-03-19T09:17:03Z

I don't understand why all of this is needed

performance (via reducing trivial collisions)

hashUInt64 (and hashUInt32), (extracted from fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410) are useful building blocks to have in hashes.nim independently of any other consideration, they provide better hashing properties (good mixing) than the builtin hash(SomeOrdinal) which is essentially identity, and can be applied to any type via hashUInt64(hash(Foo)) (on 64bit).
thanks to pseudorandom hashing, the builtin hash is enough for most cases, but not for things like:
- unordered collections => this is where you need hashUInt64 as building block otherwise no matter what combiner you use (xor, +, *) you run into tons of trivial collisions as explained in fix hash(HashSet) which was incorrect; fix hash(OrderedTable[string, JsonNode]) which was bad #13649 (comment)
- if you need a different collision resolution (eg linear probing instead of the new pseudorandom probing) for specific applications (perhaps for delete heavy workloads, see related thread on that topic in other PR), it's important to have a hash with good mixing properties to avoid collision clusters. My experiments in hashing collision: improve performance using linear/pseudorandom probing mix #13440 have shown that hashUInt64 is effective at that (experiments show good collision resolution at the cost of 2.2X slower hash computation, see colon better hash in the performance benchmark).

and btw hardcoding implementations via callbacks in the VM could cause trouble in the future if the compiler ever relies on compile-time hash tables

we're already doing it:

  registerCallback c, "stdlib.hashes.hashVmImplByte", hashVmImplByte
  # + similar hashes

and it's always possible to use the classic bootstrap approach when defined(nimHasHashVersion2)

Why not simply add combineHashesCommutatively to hashes.nim and use that instead of xor

not sure what would be your definition of combineHashesCommutatively, I'm guessing something like this:

# hashes.nim
proc combineHashesCommutatively*(result: Hash, hash: Hash) =
  result += nonlinearHash(hash)

but that proc doesn't really simplify much for defining the other hashes.

Araq · 2020-04-20T08:42:56Z

Too complex for what it accomplishes and the underlying issues have been fixed via other means. A hashes.combineHashesIgnoreOrder proc will still be accepted.

timotheecour mentioned this pull request Mar 14, 2020

Improve deque, tables, and set modules #13620

Closed

timotheecour changed the title ~~fix hash(HashSet) which was incorrect~~ fix hash(HashSet) which was incorrect; fix hash(OrderedTable[string, JsonNode]) which was bad Mar 14, 2020

timotheecour closed this Mar 15, 2020

timotheecour reopened this Mar 15, 2020

krux02 reviewed Mar 15, 2020

View reviewed changes

lib/pure/collections/sets.nim Outdated Show resolved Hide resolved

timotheecour force-pushed the pr_fix_bad_hashes branch from 79d73c0 to a373165 Compare March 15, 2020 23:11

timotheecour mentioned this pull request Mar 16, 2020

json silently ignores duplicate keys, which should be allowed timotheecour/Nim#68

Open

Araq reviewed Mar 16, 2020

View reviewed changes

lib/pure/collections/sets.nim Outdated Show resolved Hide resolved

timotheecour force-pushed the pr_fix_bad_hashes branch from a373165 to 71ec6c2 Compare March 17, 2020 11:20

Araq reviewed Mar 17, 2020

View reviewed changes

lib/pure/json.nim Outdated Show resolved Hide resolved

timotheecour force-pushed the pr_fix_bad_hashes branch from 71ec6c2 to 36c726e Compare March 18, 2020 05:49

Araq added a commit that referenced this pull request Mar 18, 2020

fixes hash(HashSet) which was wrong as it didn't respect tombstones; …

3f1a85b

…refs #13649

timotheecour added 6 commits March 18, 2020 03:36

fix hash(HashSet) which was incorrect

c6995ef

fix hash(OrderedTable) in std/json which used xor instead of !&

5a5fe05

fixup

5f54c90

change when false into commented code

c53f184

use a better algorithm to combine unordered hashes

d1dab38

json hash now uses hashUnordered

ae73e27

timotheecour force-pushed the pr_fix_bad_hashes branch from 36c726e to ae73e27 Compare March 18, 2020 10:44

Araq closed this Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix `hash(HashSet)` which was incorrect; fix `hash(OrderedTable[string, JsonNode])` which was bad #13649

fix `hash(HashSet)` which was incorrect; fix `hash(OrderedTable[string, JsonNode])` which was bad #13649

timotheecour commented Mar 14, 2020 •

edited

Loading

krux02 commented Mar 15, 2020

timotheecour commented Mar 17, 2020 •

edited

Loading

PMunch commented Mar 17, 2020 •

edited

Loading

timotheecour commented Mar 18, 2020 •

edited

Loading

PMunch commented Mar 18, 2020

timotheecour commented Mar 18, 2020 •

edited

Loading

timotheecour commented Mar 18, 2020

Araq commented Mar 18, 2020

timotheecour commented Mar 19, 2020 •

edited

Loading

Araq commented Apr 20, 2020

fix hash(HashSet) which was incorrect; fix hash(OrderedTable[string, JsonNode]) which was bad #13649

fix hash(HashSet) which was incorrect; fix hash(OrderedTable[string, JsonNode]) which was bad #13649

Conversation

timotheecour commented Mar 14, 2020 • edited Loading

note

krux02 commented Mar 15, 2020

timotheecour commented Mar 17, 2020 • edited Loading

links

PMunch commented Mar 17, 2020 • edited Loading

timotheecour commented Mar 18, 2020 • edited Loading

PMunch commented Mar 18, 2020

timotheecour commented Mar 18, 2020 • edited Loading

timotheecour commented Mar 18, 2020

Araq commented Mar 18, 2020

timotheecour commented Mar 19, 2020 • edited Loading

Araq commented Apr 20, 2020

fix `hash(HashSet)` which was incorrect; fix `hash(OrderedTable[string, JsonNode])` which was bad #13649

fix `hash(HashSet)` which was incorrect; fix `hash(OrderedTable[string, JsonNode])` which was bad #13649

timotheecour commented Mar 14, 2020 •

edited

Loading

timotheecour commented Mar 17, 2020 •

edited

Loading

PMunch commented Mar 17, 2020 •

edited

Loading

timotheecour commented Mar 18, 2020 •

edited

Loading

timotheecour commented Mar 18, 2020 •

edited

Loading

timotheecour commented Mar 19, 2020 •

edited

Loading