-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix hash(HashSet)
which was incorrect; fix hash(OrderedTable[string, JsonNode])
which was bad
#13649
Conversation
hash(HashSet)
which was incorrect; fix hash(OrderedTable[string, JsonNode])
which was bad
79d73c0
to
a373165
Compare
Now hashing does allocate memory. That is pretty bad. I really prefer the |
a373165
to
71ec6c2
Compare
PTAL, this is a tricky problem and I think I've now found a good solution:
The comments in code should explain the fine print, but in short:
So I'm instead combining hashes via links
|
Shouldn't |
71ec6c2
to
36c726e
Compare
PTAL
No, toHashSet(@[15,8]).hash
toHashSet(@[14,9]).hash
toHashSet(@[13,10]).hash
toHashSet(@[11,12]).hash more generally, for any a,b,c:
which gives tons of trivial collisions (easy to see by looking at the bit pattern). after this PR, I'm mitigating this to a large extent (at minimal cost) using |
Ah, I forgot that hashes of numbers in Nim are simply outputting the number itself. IMO this also breaks the desirable property of hashes that "small differences in input should result in large differences in output hash". |
there's a reason for that, and it follows precedent of python: make the common case fast. See #13440 for a comparison (
|
36c726e
to
ae73e27
Compare
rebased after 3f1a85b ; I'll try tmrw to also reuse |
I don't understand why all of this is needed and btw hardcoding implementations via callbacks in the VM could cause trouble in the future if the compiler ever relies on compile-time hash tables. Why not simply add |
performance (via reducing trivial collisions)
we're already doing it: registerCallback c, "stdlib.hashes.hashVmImplByte", hashVmImplByte
# + similar hashes and it's always possible to use the classic bootstrap approach
not sure what would be your definition of # hashes.nim
proc combineHashesCommutatively*(result: Hash, hash: Hash) =
result += nonlinearHash(hash) but that proc doesn't really simplify much for defining the other hashes. |
Too complex for what it accomplishes and the underlying issues have been fixed via other means. A |
any
hash
should satisfy:if a == b then hash(a) == hash(b)
(the reverse being desirable but not always possible)
this PR fixes hash(HashSet), which didn't satisfy that property (see tests) because of tombstones.
Furthermore,
xor
doesn't have good hash properties so I replaced it with!&
(see PR content for full explanation); this was discovered while investigating this Improve deque, tables, and set modules #13620 (comment) /cc @PMunchIf for some reason
sort
is unacceptable, I can change to using the commented outxor
code but keep the!& + sort
code for documentation in case someone runs into the issue mentionedhash*(n: OrderedTable[string, JsonNode]): Hash
which usedxor
which again has bad mixing properties: any 2 (key,val) pairs satisfying(hash(key) !& hash(val))
having the same value would cancel each other out; instead I used the standard approach based on adding each element via!&
(no sort needed here becauseOrderedTable
is ordered).A particular case of this type of collision is this:
note
json spec allows duplicate keys (see https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object), but std/json silently ignores duplicate keys:
that's a separate issue though