RFC: new approach to efficiently hashing 1, 1.0, big(1), the same. #6624

StefanKarpinski · 2014-04-24T17:33:58Z

The major change is that numeric values that are mathematically equal hash the same. This is tricky to do in a way that allows the hashing of Int64, Float64, Uint64 and such common numeric types to all be fast – nearly as fast as just applying the core hash to them as raw bits.

Although tests pass, this is an inherently half-baked state since isequal and hash are now badly out of sync. Many decisions still need to be made about how to hash collections: does the collection type matter or just the element type of the collection? Or neither?

This also does away with the bitmix function, instead favoring a Merkle-Damgård style of combining arbitrary amounts of data into a single, fixed-size hash value output. In a sense the hash function with two arguments replaces the bitmix function – you can give the result of hashing previous values as the second argument to hash and the result will depend on both in a difficult-to-predict way.

StefanKarpinski · 2014-04-24T17:46:20Z

Oh, I should also note that this is almost certainly really broken on 32-bit systems. I just didn't worry about that while developing this, but should be fixable with a little attention.

JeffBezanson · 2014-04-24T17:53:19Z

+:100:

Oh boy have I been waiting for this to land. This is a really nice piece of work.

Questions about comparing collections should not get in the way of this, since those issues exist anyway. Now we have

=== - our old friend egal
== - equality with respect to the abstraction implemented by the arguments
isequal - hasn't been changed yet, but should be the same as == except for -0.0 and NaN

JeffBezanson · 2014-04-24T17:58:54Z

base/hashing.jl

+## hashing rational values ##
+
+#=
+`decompose(x)`: non-canonical decomposition of rational values as `den*2^pow/num`.


Should this say num*2^pow/den?

Uh, yes. Duh. Fortunately, only the documentation is wrong :-)

JeffBezanson · 2014-04-24T18:16:23Z

Would it suffice to have a single definition

hash(x) = hash(x, uint(0))

?

StefanKarpinski · 2014-04-24T19:04:54Z

Good point. That probably saves typing overall.

IainNZ · 2014-04-24T19:57:45Z

So does this mean

mydict = Dict{Number,Int}()
mydict[1.0] = 100
@assert mydict[BigInt(1)] == 100
@assert mydict[1] = 100

would work?

StefanKarpinski · 2014-04-24T20:25:27Z

Yes, @IainNZ, that's exactly what it means – but I have to fix isequal first.

The major change is that numeric values that are mathematically equal hash the same. This is tricky to do in a way that allows the hashing of Int64, Float64, Uint64 and such common numeric types to all be fast – nearly as fast as just applying the core hash to them as raw bits. Although tests pass, this is an inherently half-baked state since isequal and hash are now badly out of sync. Many decisions still need to be made about how to hash collections: does the collection type matter or just the element type of the collection? Or neither? This also does away with the bitmix function, instead favoring a Merkle-Damgård style of combining arbitrary amounts of data into a single, fixed-size hash value output. In a sense the hash function with two arguments replaces the bitmix function – you can give the result of hashing previous values as the second argument to hash and the result will depend on both in a difficult-to-predict way.

StefanKarpinski · 2014-04-24T22:02:58Z

@JeffBezanson, this code could really benefit from specialization on default arguments since many performance-critical cases have the zero default second argument. How hard would that be to do?

JeffBezanson · 2014-04-24T22:16:12Z

I suppose it is really just inlining; if the function called by hash(x,0) passes inlining heuristics then it should happen already, to the extent that LLVM's constant folding helps. If we could force inlining at call sites, then default arguments could generate

hash(x) = @inline hash(x,uint(0))

but I doubt we want to do that in all cases.

StefanKarpinski · 2014-04-24T22:20:42Z

In this case, I suspect we might really want to do it in all cases – the hash function is expected to be called with just one argument – the second argument is for Merkle-Damgård chaining.

StefanKarpinski · 2014-04-24T23:26:34Z

@JeffBezanson, I can't for the life of me figure out why hash(NaN) != hash(big(NaN)). I've verified that the generic hash(::Real, ::Uint) is returning at this line, but the wrong value is somehow getting returned. I moved code around to try to make sure it wasn't #265 again but it didn't help this situation at all (although I like having all the generic code later so I left it there). Any ideas?

JeffBezanson · 2014-04-25T01:28:38Z

This may be due to LLVM's Demonic Constant Folding™. fptosi is undefined on NaN, so it gives a different answer when the argument is a constant NaN (in which case LLVM knows it is an undefined operation):

julia> bar() = box(Uint64,fptosi(unbox(Float64,NaN)))
bar (generic function with 2 methods)

julia> code_llvm(bar,())

define i64 @julia_bar16866() {
top:
  ret i64 0, !dbg !1324
}

julia> bar(x) = box(Uint64,fptosi(unbox(Float64,x)))
bar (generic function with 2 methods)

julia> bar(NaN)
0x8000000000000000

0 for a constant NaN, 0x8000000000000000 for non-constant. A good fix might be to move the ifelse(isnan(x) to protect the fptosi in hashing.jl:36.

StefanKarpinski · 2014-04-25T03:09:57Z

Go home, LLVM, you're drunk.

StefanKarpinski · 2014-04-25T14:01:11Z

This fucking LLVM undefined bullshit completely trashes the performance of float hashing. Pardon the language, but this is just so infuriating. I had this down to almost matching integer hashing performance and this COMPLETELY FUCKING POINTLESS UNDEFINED VALUES CRAP IS SABOTAGING IT.

LLVM's fptosi intrinsic is undefined for NaN, so LLVM obnoxiously and pointlessly does different things when it gets NaN as a run-time value than as a compile-time value. To avoid this shitty, pointless trap, we have to avoid calling fptosi on NaN by introducing a branch into the hashing function – even though for hashing, we don't care *what* value is produced, just as long as it's consistent. Unfortunately, this affects the performance of Float64 hashing pretty badly. I was not able to figure out any way to recover this lost performance. LLVM really needs to stop doing this.

Apparently the type assertion after the decompose call was sabotaging the ability to inline the call to decompose for Rational{Int}. Removing the type assert gives a 10x boost, bringing the speed of generic hash(Rational) to withing 10x of hash(Int).

JeffBezanson · 2014-04-25T16:32:00Z

But hey, this way LLVM lets you get the wrong answer really fast! You can save one whole instruction!

StefanKarpinski · 2014-04-25T16:36:25Z

I don't think this is actually a performance optimization on their part – I think this is intentionally inconsistent behavior for your own good, so you don't rely on the undefined behaviors, gasp!

ihnorton · 2014-04-25T16:39:51Z

Time to write a LuaJIT bytecode backend? :p

JeffBezanson · 2014-04-25T16:55:07Z

No, it's to make it easier and faster to map languages like C that have undefined behavior. The problem is that the behavior is undefined in the first place. If you're going to have undefined behavior then ok, better make uses of it as obvious as possible. IMO the fix would be to give everything defined behavior by default, and use intrinsics or attributes of some kind to mark that e.g. a denominator can't be zero.

This actually provides less gain over the generic real hashing function than you would think, but it is slightly faster.

While messing around with generic equality checking based on the new decode function introduced in the hashing work, I discovered that LLVM seems to be much better able to analyze expressions that use signbit when it's boolean and explicitly defined as `x < 0` for integer values. Since `true == 1` and `false == 0` this is a pretty benign change, although technically it is breaking. I've wanted to do this for a while and this seems like as good a time as any.

Defining isfinite in terms of decompose is simple. It seems good to insist that only floating-point reals can be NaN – NaN in hardware is simply an unavoidable reality. For any user-defined type that is not implemented in hardware, NaN should not exist since operations that would produce NaNs should raise immediate errors instead.

StefanKarpinski · 2014-05-01T21:38:41Z

I don't think stashing the new key on assignment is going to have a big performance impact. The biggest implication that I could think of was that it might be slightly worse for gc, but that's highly speculative.

JeffBezanson · 2014-05-01T22:15:27Z

This patch works around the error, but I suspect it is more of a bandaid:

diff --git a/src/gf.c b/src/gf.c
index 6d3a168..9fd0bae 100644
--- a/src/gf.c
+++ b/src/gf.c
@@ -1262,7 +1262,10 @@ static jl_tuple_t *arg_type_tuple(jl_value_t **args, size_t nar
     size_t i;
     for(i=0; i < nargs; i++) {
         jl_value_t *a;
-        if (jl_is_type(args[i])) {
+        if (args[i] == (jl_value_t*)jl_null) {
+            a = args[i];
+        }
+        else if (jl_is_type(args[i])) {
             a = (jl_value_t*)jl_wrap_Type(args[i]);
         }
         else if (!jl_is_tuple(args[i])) {

StefanKarpinski · 2014-05-01T23:26:34Z

Ok – want to do the merge? There's an obvious conflict (you probably resolved it already), and then just apply your bandaid.

Conflicts: base/profile.jl

JeffBezanson · 2014-05-02T02:26:37Z

Ok, the bandaid turns out to cause problems. I'll keep working, and I'll merge when I have something decent.

JeffBezanson · 2014-05-02T03:23:31Z

I'm worried about isless. I fully understand wanting to synchronize isequal and isless, but it is really odd to have a function that is a total order on floating point numbers but not a total order in general. Order is quite different from equality, since it doesn't apply to all values, and there are prominent examples of types with partial, total, and both kinds of order.

StefanKarpinski · 2014-05-02T06:48:53Z

Maybe we can do something using cmp for the fallback definition of isless to avoid relying on < directly. Since cmp always has to make a decision, as it were, it avoids some issues. The ordering still need not be total since cmp could return 0 for elements that aren't isequal.

StefanKarpinski · 2014-05-02T07:12:54Z

We don't actually need isless to be a total order for sorting not to break. What we do need is for exactly one of isless(x,y) or isequal(x,y) or isless(y,x) to be true. In other words, the behavior of -0.0 and 0.0 under < is ok for a custom sort – it just means that -0.0 and 0.0 are unordered, which is not what one wants by default, but sorting works correctly. NaN under < is a different story because none of isless(x,NaN) or isequal(x,NaN) or isless(NaN,x) are ever true for any x.

JeffBezanson · 2014-05-02T14:29:22Z

I don't understand --- isless(x,NaN) is true for some x. If one of isless(x,y), isequal(x,y), or isless(y,x) is always true, doesn't that imply that isless is a total order?

StefanKarpinski · 2014-05-02T16:08:26Z

Yes, but none of x < NaN, x == NaN and NaN < x are true, which is why using < to sort floats breaks comparison sorts which assume that one of these has to be the case. There's some confusion going on here because there are many notions of equality in play and the one that is relevant to the mathematical concept of a total ordering doesn't actually make sense because these aren't mathematical objects. Which notion of equality are you using in the antisymmetry axiom?

JeffBezanson · 2014-05-02T16:21:45Z

I was thinking of isequal as defined in this PR.
I don't feel any of this addresses the core issue: what does isless mean? It could mean "an order that works for some implementations of sorting", but I find that odd.
As it is, sort will now happily sort arrays of sets. Is that what we want?

JeffBezanson · 2014-05-02T16:25:50Z

ex:

julia> c
Set{Int64}({2,3})

julia> a
Set{Int64}({4,5})

julia> b
Set{Int64}({2})

julia> sort([c,a,b])
3-element Array{Set{Int64},1}:
 Set{Int64}({2,3})
 Set{Int64}({4,5})
 Set{Int64}({2})

The sorted order is, unsurprisingly, not consistent with the partial order.

Conflicts: base/operators.jl

RFC: new approach to efficiently hashing 1, 1.0, big(1), the same.

Keno · 2014-05-07T21:44:22Z

Yay!

In LLVM (inherited from C), fptosi has undefined behavior if the result does not fit the integer size after rounding down. But by using the same strategy as generic hashing of Real values, we actually can end up with a sitatuion that is faster for the CPU to deal with and avoids the UB. Refs #6624 (3696968) Fixes #37800

JeffBezanson reviewed Apr 24, 2014
View reviewed changes

StefanKarpinski added 4 commits April 24, 2014 17:15

decompose: fix documentation of decompose for rational hashing.

b3a7c84

define hash(x) = hash(x, zero(Uint)) in a single place; fix bugs.

dadc7cf

MathConst: == and hash

316eeda

move generic rational hashing definitions into hashing2.jl

f791029

StefanKarpinski added 2 commits April 25, 2014 11:09

StefanKarpinski added 4 commits April 25, 2014 15:32

More streamlined hashing for smallish rational types (≤ 64-bits).

85afdbb

This actually provides less gain over the generic real hashing function than you would think, but it is slightly faster.

move hash(WeakRef) into base/hashing.jl also (not needed so early).

04908d4

ihnorton added this to the 0.3 milestone Apr 29, 2014

Merge branch 'master' into sk/hashing

5a834b2

Conflicts: base/profile.jl

workaround for the Type{()} error

f9cb1e3

JeffBezanson and others added 4 commits May 2, 2014 13:29

update docs for == and isequal

b5e0b73

doc: some wording adjustments for isequal.

9a48950

restore former behavior of isless as a total order

9738dd3

Merge branch 'master' of github.com:JuliaLang/julia into sk/hashing

607558a

Conflicts: base/operators.jl

JeffBezanson added a commit that referenced this pull request May 7, 2014

Merge pull request #6624 from JuliaLang/sk/hashing

5b20688

RFC: new approach to efficiently hashing 1, 1.0, big(1), the same.

JeffBezanson merged commit 5b20688 into master May 7, 2014

JeffBezanson mentioned this pull request May 7, 2014

numeric equality ignoring type with NaNs equal #5314

Closed

cdsousa mentioned this pull request May 8, 2014

Support new Julia hash methods JuliaStats/DataArrays.jl#85

Closed

simonster added a commit to JuliaData/DataFrames.jl that referenced this pull request May 11, 2014

Fixes for hashing changes (JuliaLang/julia#6624)

3cbe03f

StefanKarpinski deleted the sk/hashing branch June 20, 2014 19:32

pao mentioned this pull request Jun 24, 2015

should in use isequal to test for containment? #9381

Closed

stevengj mentioned this pull request Jul 29, 2019

implement Base.hash(x::DecFP.DecimalFloatingPoint) JuliaMath/DecFP.jl#97

Closed

vtjnash mentioned this pull request Oct 14, 2020

Faster integer hashing (fixes #37800 UB) #38031

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: new approach to efficiently hashing 1, 1.0, big(1), the same. #6624

RFC: new approach to efficiently hashing 1, 1.0, big(1), the same. #6624

StefanKarpinski commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

JeffBezanson commented Apr 24, 2014

JeffBezanson Apr 24, 2014

StefanKarpinski Apr 24, 2014

JeffBezanson commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

IainNZ commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

JeffBezanson commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

JeffBezanson commented Apr 25, 2014

StefanKarpinski commented Apr 25, 2014

StefanKarpinski commented Apr 25, 2014

JeffBezanson commented Apr 25, 2014

StefanKarpinski commented Apr 25, 2014

ihnorton commented Apr 25, 2014

JeffBezanson commented Apr 25, 2014

StefanKarpinski commented May 1, 2014

JeffBezanson commented May 1, 2014

StefanKarpinski commented May 1, 2014

JeffBezanson commented May 2, 2014

JeffBezanson commented May 2, 2014

StefanKarpinski commented May 2, 2014

StefanKarpinski commented May 2, 2014

JeffBezanson commented May 2, 2014

StefanKarpinski commented May 2, 2014

JeffBezanson commented May 2, 2014

JeffBezanson commented May 2, 2014

Keno commented May 7, 2014

RFC: new approach to efficiently hashing 1, 1.0, big(1), the same. #6624

RFC: new approach to efficiently hashing 1, 1.0, big(1), the same. #6624

Conversation

StefanKarpinski commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

JeffBezanson commented Apr 24, 2014

JeffBezanson Apr 24, 2014

Choose a reason for hiding this comment

StefanKarpinski Apr 24, 2014

Choose a reason for hiding this comment

JeffBezanson commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

IainNZ commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

JeffBezanson commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

StefanKarpinski commented Apr 24, 2014

JeffBezanson commented Apr 25, 2014

StefanKarpinski commented Apr 25, 2014

StefanKarpinski commented Apr 25, 2014

JeffBezanson commented Apr 25, 2014

StefanKarpinski commented Apr 25, 2014

ihnorton commented Apr 25, 2014

JeffBezanson commented Apr 25, 2014

StefanKarpinski commented May 1, 2014

JeffBezanson commented May 1, 2014

StefanKarpinski commented May 1, 2014

JeffBezanson commented May 2, 2014

JeffBezanson commented May 2, 2014

StefanKarpinski commented May 2, 2014

StefanKarpinski commented May 2, 2014

JeffBezanson commented May 2, 2014

StefanKarpinski commented May 2, 2014

JeffBezanson commented May 2, 2014

JeffBezanson commented May 2, 2014

Keno commented May 7, 2014