Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix hash(HashSet) which was incorrect; fix hash(OrderedTable[string, JsonNode]) which was bad #13649

Closed
wants to merge 6 commits into from

Conversation

timotheecour
Copy link
Member

@timotheecour timotheecour commented Mar 14, 2020

  • any hash should satisfy:
    if a == b then hash(a) == hash(b)
    (the reverse being desirable but not always possible)
    this PR fixes hash(HashSet), which didn't satisfy that property (see tests) because of tombstones.

  • Furthermore, xor doesn't have good hash properties so I replaced it with !& (see PR content for full explanation); this was discovered while investigating this Improve deque, tables, and set modules #13620 (comment) /cc @PMunch

If for some reason sort is unacceptable, I can change to using the commented out xor code but keep the !& + sort code for documentation in case someone runs into the issue mentioned

  • likewise, this PR fixes hash*(n: OrderedTable[string, JsonNode]): Hash which used xor which again has bad mixing properties: any 2 (key,val) pairs satisfying (hash(key) !& hash(val)) having the same value would cancel each other out; instead I used the standard approach based on adding each element via !& (no sort needed here because OrderedTable is ordered).

A particular case of this type of collision is this:

import sets
import tables
import json
var a: OrderedTable[string, JsonNode]
a.add("foo", newJString("1"))
a.add("foo", newJString("1")) # this nullifies previous (key,value)
doAssert hash(a) == 0

note

json spec allows duplicate keys (see https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object), but std/json silently ignores duplicate keys:

  var b = %* {
    "foo": "1",
    "foo": "1",
    "bar": "2",
    "bar": "2",
  }
  echo b
{"foo":"1","bar":"2"}

that's a separate issue though

@timotheecour timotheecour changed the title fix hash(HashSet) which was incorrect fix hash(HashSet) which was incorrect; fix hash(OrderedTable[string, JsonNode]) which was bad Mar 14, 2020
@timotheecour timotheecour reopened this Mar 15, 2020
@krux02
Copy link
Contributor

krux02 commented Mar 15, 2020

Now hashing does allocate memory. That is pretty bad. I really prefer the xor operator over that.

lib/pure/collections/sets.nim Outdated Show resolved Hide resolved
@timotheecour
Copy link
Member Author

timotheecour commented Mar 17, 2020

PTAL, this is a tricky problem and I think I've now found a good solution:

  • that solution is also more general: works for any unordered set of elements (allowing duplicates), so I've added hashes.hashUnordered which works with any iterator (including a seq).
  • no alloc/sort needed, should be almost as efficient as xor approach (the other costs should dwarf the difference)
  • this reuses parts of defunct fix #13393 better hash for primitive types, avoiding catastrophic (1000x) slowdowns for certain input distributions #13410 which introduced "good hashes" for uint32/uint64, which I'm using for nonlinearHash , see below.
  • as an extra non-linear mixing step, I'm also mixing at the end with number of elements.

The comments in code should explain the fine print, but in short:

  • sort approach would give the best hash, but yes, had downside of requiring allocations
  • hash(ai) xor hash(aj) has downsides, eg 2 identical hashes would result in 0 (as also noted in [1]])
  • hash(ai)+hash(aj) has similar downsides, eg 2 opposite hashes would result in 0 (or simply, hash(@[10, 20]) would equal hash(@[10+a, 20-a]))
  • hash(ai)*hash(aj) ditto with hash(@[10 div a, 20 * a]))

So I'm instead combining hashes via nonlinearHash(hash(ai)) + nonlinearHash(hash(aj)), which is both simple and removes common collisions (not robust to adversarial though). This is especially important given that hash in stdlib is now the simple identity function for integral types.

links

lib/pure/json.nim Outdated Show resolved Hide resolved
@PMunch
Copy link
Contributor

PMunch commented Mar 17, 2020

Shouldn't xor be fine for sets as you by definition can't have two values that are exactly the same?

@timotheecour
Copy link
Member Author

timotheecour commented Mar 18, 2020

PTAL

Shouldn't xor be fine for sets as you by definition can't have two values that are exactly the same?

No, xor suffers from the exact same problem as + and * that I described above:
these all give the same hash prior to this PR:

toHashSet(@[15,8]).hash
toHashSet(@[14,9]).hash
toHashSet(@[13,10]).hash
toHashSet(@[11,12]).hash

more generally, for any a,b,c:

toHashSet(@[a,b]).hash == toHashSet(@[a xor c, b xor c]).hash

which gives tons of trivial collisions (easy to see by looking at the bit pattern).
xor fails at one of the key hash desirable property: small differences in input should result in large differences in output hash.

after this PR, I'm mitigating this to a large extent (at minimal cost) using nonlinearHash

@PMunch
Copy link
Contributor

PMunch commented Mar 18, 2020

Ah, I forgot that hashes of numbers in Nim are simply outputting the number itself. IMO this also breaks the desirable property of hashes that "small differences in input should result in large differences in output hash".

@timotheecour
Copy link
Member Author

timotheecour commented Mar 18, 2020

Ah, I forgot that hashes of numbers in Nim are simply outputting the number itself. IMO this also breaks the desirable property of hashes that "small differences in input should result in large differences in output hash".

there's a reason for that, and it follows precedent of python: make the common case fast. See #13440 for a comparison (better hash vs mix(this PR) columns); the identity hash for ordinal types is (in isolation) 2.2X faster than nonlinearHash from #13410 (that I'm re-introducing here just for the specific case of hashUnordered). As shown in #13440, we can still get the best of both worlds between:

@timotheecour
Copy link
Member Author

rebased after 3f1a85b ; I'll try tmrw to also reuse hcode for HashSet (avoiding recomputing hash) while also keeping same benefits of nonlinearHash (vs xor, which introduces lots of trivial collisions as explained here #13649 (comment))

@Araq
Copy link
Member

Araq commented Mar 18, 2020

I don't understand why all of this is needed and btw hardcoding implementations via callbacks in the VM could cause trouble in the future if the compiler ever relies on compile-time hash tables.

Why not simply add combineHashesCommutatively to hashes.nim and use that instead of xor?

@timotheecour
Copy link
Member Author

timotheecour commented Mar 19, 2020

I don't understand why all of this is needed

performance (via reducing trivial collisions)

and btw hardcoding implementations via callbacks in the VM could cause trouble in the future if the compiler ever relies on compile-time hash tables

we're already doing it:

  registerCallback c, "stdlib.hashes.hashVmImplByte", hashVmImplByte
  # + similar hashes

and it's always possible to use the classic bootstrap approach when defined(nimHasHashVersion2)

Why not simply add combineHashesCommutatively to hashes.nim and use that instead of xor

not sure what would be your definition of combineHashesCommutatively, I'm guessing something like this:

# hashes.nim
proc combineHashesCommutatively*(result: Hash, hash: Hash) =
  result += nonlinearHash(hash)

but that proc doesn't really simplify much for defining the other hashes.

@Araq
Copy link
Member

Araq commented Apr 20, 2020

Too complex for what it accomplishes and the underlying issues have been fixed via other means. A hashes.combineHashesIgnoreOrder proc will still be accepted.

@Araq Araq closed this Apr 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants