Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: new approach to efficiently hashing 1, 1.0, big(1), the same. #6624

Merged
merged 20 commits into from
May 7, 2014

Conversation

StefanKarpinski
Copy link
Member

The major change is that numeric values that are mathematically equal hash the same. This is tricky to do in a way that allows the hashing of Int64, Float64, Uint64 and such common numeric types to all be fast – nearly as fast as just applying the core hash to them as raw bits.

Although tests pass, this is an inherently half-baked state since isequal and hash are now badly out of sync. Many decisions still need to be made about how to hash collections: does the collection type matter or just the element type of the collection? Or neither?

This also does away with the bitmix function, instead favoring a Merkle-Damgård style of combining arbitrary amounts of data into a single, fixed-size hash value output. In a sense the hash function with two arguments replaces the bitmix function – you can give the result of hashing previous values as the second argument to hash and the result will depend on both in a difficult-to-predict way.

@StefanKarpinski
Copy link
Member Author

Oh, I should also note that this is almost certainly really broken on 32-bit systems. I just didn't worry about that while developing this, but should be fixable with a little attention.

@JeffBezanson
Copy link
Member

+:100:

Oh boy have I been waiting for this to land. This is a really nice piece of work.

Questions about comparing collections should not get in the way of this, since those issues exist anyway. Now we have

=== - our old friend egal
== - equality with respect to the abstraction implemented by the arguments
isequal - hasn't been changed yet, but should be the same as == except for -0.0 and NaN

## hashing rational values ##

#=
`decompose(x)`: non-canonical decomposition of rational values as `den*2^pow/num`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say num*2^pow/den?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, yes. Duh. Fortunately, only the documentation is wrong :-)

@JeffBezanson
Copy link
Member

Would it suffice to have a single definition

hash(x) = hash(x, uint(0))

?

@StefanKarpinski
Copy link
Member Author

Good point. That probably saves typing overall.

@IainNZ
Copy link
Member

IainNZ commented Apr 24, 2014

So does this mean

mydict = Dict{Number,Int}()
mydict[1.0] = 100
@assert mydict[BigInt(1)] == 100
@assert mydict[1] = 100

would work?

@StefanKarpinski
Copy link
Member Author

Yes, @IainNZ, that's exactly what it means – but I have to fix isequal first.

The major change is that numeric values that are mathematically
equal hash the same. This is tricky to do in a way that allows the
hashing of Int64, Float64, Uint64 and such common numeric types to
all be fast – nearly as fast as just applying the core hash to them
as raw bits.

Although tests pass, this is an inherently half-baked state since
isequal and hash are now badly out of sync. Many decisions still
need to be made about how to hash collections: does the collection
type matter or just the element type of the collection? Or neither?

This also does away with the bitmix function, instead favoring a
Merkle-Damgård style of combining arbitrary amounts of data into a
single, fixed-size hash value output. In a sense the hash function
with two arguments replaces the bitmix function – you can give the
result of hashing previous values as the second argument to hash
and the result will depend on both in a difficult-to-predict way.
@StefanKarpinski
Copy link
Member Author

@JeffBezanson, this code could really benefit from specialization on default arguments since many performance-critical cases have the zero default second argument. How hard would that be to do?

@JeffBezanson
Copy link
Member

I suppose it is really just inlining; if the function called by hash(x,0) passes inlining heuristics then it should happen already, to the extent that LLVM's constant folding helps. If we could force inlining at call sites, then default arguments could generate

hash(x) = @inline hash(x,uint(0))

but I doubt we want to do that in all cases.

@StefanKarpinski
Copy link
Member Author

In this case, I suspect we might really want to do it in all cases – the hash function is expected to be called with just one argument – the second argument is for Merkle-Damgård chaining.

@StefanKarpinski
Copy link
Member Author

@JeffBezanson, I can't for the life of me figure out why hash(NaN) != hash(big(NaN)). I've verified that the generic hash(::Real, ::Uint) is returning at this line, but the wrong value is somehow getting returned. I moved code around to try to make sure it wasn't #265 again but it didn't help this situation at all (although I like having all the generic code later so I left it there). Any ideas?

@JeffBezanson
Copy link
Member

This may be due to LLVM's Demonic Constant Folding™. fptosi is undefined on NaN, so it gives a different answer when the argument is a constant NaN (in which case LLVM knows it is an undefined operation):

julia> bar() = box(Uint64,fptosi(unbox(Float64,NaN)))
bar (generic function with 2 methods)

julia> code_llvm(bar,())

define i64 @julia_bar16866() {
top:
  ret i64 0, !dbg !1324
}

julia> bar(x) = box(Uint64,fptosi(unbox(Float64,x)))
bar (generic function with 2 methods)

julia> bar(NaN)
0x8000000000000000

0 for a constant NaN, 0x8000000000000000 for non-constant. A good fix might be to move the ifelse(isnan(x) to protect the fptosi in hashing.jl:36.

@StefanKarpinski
Copy link
Member Author

Go home, LLVM, you're drunk.

@StefanKarpinski
Copy link
Member Author

This fucking LLVM undefined bullshit completely trashes the performance of float hashing. Pardon the language, but this is just so infuriating. I had this down to almost matching integer hashing performance and this COMPLETELY FUCKING POINTLESS UNDEFINED VALUES CRAP IS SABOTAGING IT.

LLVM's fptosi intrinsic is undefined for NaN, so LLVM obnoxiously and
pointlessly does different things when it gets NaN as a run-time value
than as a compile-time value. To avoid this shitty, pointless trap, we
have to avoid calling fptosi on NaN by introducing a branch into the
hashing function – even though for hashing, we don't care *what* value
is produced, just as long as it's consistent. Unfortunately, this
affects the performance of Float64 hashing pretty badly. I was not
able to figure out any way to recover this lost performance. LLVM
really needs to stop doing this.
Apparently the type assertion after the decompose call was sabotaging
the ability to inline the call to decompose for Rational{Int}.
Removing the type assert gives a 10x boost, bringing the speed of
generic hash(Rational) to withing 10x of hash(Int).
@JeffBezanson
Copy link
Member

But hey, this way LLVM lets you get the wrong answer really fast! You can save one whole instruction!

@StefanKarpinski
Copy link
Member Author

I don't think this is actually a performance optimization on their part – I think this is intentionally inconsistent behavior for your own good, so you don't rely on the undefined behaviors, gasp!

@ihnorton
Copy link
Member

Time to write a LuaJIT bytecode backend? :p

@JeffBezanson
Copy link
Member

No, it's to make it easier and faster to map languages like C that have undefined behavior. The problem is that the behavior is undefined in the first place. If you're going to have undefined behavior then ok, better make uses of it as obvious as possible. IMO the fix would be to give everything defined behavior by default, and use intrinsics or attributes of some kind to mark that e.g. a denominator can't be zero.

This actually provides less gain over the generic real hashing
function than you would think, but it is slightly faster.
While messing around with generic equality checking based on the
new decode function introduced in the hashing work, I discovered
that LLVM seems to be much better able to analyze expressions that
use signbit when it's boolean and explicitly defined as `x < 0` for
integer values. Since `true == 1` and `false == 0` this is a pretty
benign change, although technically it is breaking. I've wanted to
do this for a while and this seems like as good a time as any.
Defining isfinite in terms of decompose is simple. It seems good to
insist that only floating-point reals can be NaN – NaN in hardware
is simply an unavoidable reality. For any user-defined type that is
not implemented in hardware, NaN should not exist since operations
that would produce NaNs should raise immediate errors instead.
@ihnorton ihnorton added this to the 0.3 milestone Apr 29, 2014
@StefanKarpinski
Copy link
Member Author

I don't think stashing the new key on assignment is going to have a big performance impact. The biggest implication that I could think of was that it might be slightly worse for gc, but that's highly speculative.

@JeffBezanson
Copy link
Member

This patch works around the error, but I suspect it is more of a bandaid:

diff --git a/src/gf.c b/src/gf.c
index 6d3a168..9fd0bae 100644
--- a/src/gf.c
+++ b/src/gf.c
@@ -1262,7 +1262,10 @@ static jl_tuple_t *arg_type_tuple(jl_value_t **args, size_t nar
     size_t i;
     for(i=0; i < nargs; i++) {
         jl_value_t *a;
-        if (jl_is_type(args[i])) {
+        if (args[i] == (jl_value_t*)jl_null) {
+            a = args[i];
+        }
+        else if (jl_is_type(args[i])) {
             a = (jl_value_t*)jl_wrap_Type(args[i]);
         }
         else if (!jl_is_tuple(args[i])) {

@StefanKarpinski
Copy link
Member Author

Ok – want to do the merge? There's an obvious conflict (you probably resolved it already), and then just apply your bandaid.

Conflicts:
	base/profile.jl
@JeffBezanson
Copy link
Member

Ok, the bandaid turns out to cause problems. I'll keep working, and I'll merge when I have something decent.

@JeffBezanson
Copy link
Member

I'm worried about isless. I fully understand wanting to synchronize isequal and isless, but it is really odd to have a function that is a total order on floating point numbers but not a total order in general. Order is quite different from equality, since it doesn't apply to all values, and there are prominent examples of types with partial, total, and both kinds of order.

@StefanKarpinski
Copy link
Member Author

Maybe we can do something using cmp for the fallback definition of isless to avoid relying on < directly. Since cmp always has to make a decision, as it were, it avoids some issues. The ordering still need not be total since cmp could return 0 for elements that aren't isequal.

@StefanKarpinski
Copy link
Member Author

We don't actually need isless to be a total order for sorting not to break. What we do need is for exactly one of isless(x,y) or isequal(x,y) or isless(y,x) to be true. In other words, the behavior of -0.0 and 0.0 under < is ok for a custom sort – it just means that -0.0 and 0.0 are unordered, which is not what one wants by default, but sorting works correctly. NaN under < is a different story because none of isless(x,NaN) or isequal(x,NaN) or isless(NaN,x) are ever true for any x.

@JeffBezanson
Copy link
Member

I don't understand --- isless(x,NaN) is true for some x. If one of isless(x,y), isequal(x,y), or isless(y,x) is always true, doesn't that imply that isless is a total order?

@StefanKarpinski
Copy link
Member Author

Yes, but none of x < NaN, x == NaN and NaN < x are true, which is why using < to sort floats breaks comparison sorts which assume that one of these has to be the case. There's some confusion going on here because there are many notions of equality in play and the one that is relevant to the mathematical concept of a total ordering doesn't actually make sense because these aren't mathematical objects. Which notion of equality are you using in the antisymmetry axiom?

@JeffBezanson
Copy link
Member

I was thinking of isequal as defined in this PR.
I don't feel any of this addresses the core issue: what does isless mean? It could mean "an order that works for some implementations of sorting", but I find that odd.
As it is, sort will now happily sort arrays of sets. Is that what we want?

@JeffBezanson
Copy link
Member

ex:

julia> c
Set{Int64}({2,3})

julia> a
Set{Int64}({4,5})

julia> b
Set{Int64}({2})

julia> sort([c,a,b])
3-element Array{Set{Int64},1}:
 Set{Int64}({2,3})
 Set{Int64}({4,5})
 Set{Int64}({2})  

The sorted order is, unsurprisingly, not consistent with the partial order.

JeffBezanson added a commit that referenced this pull request May 7, 2014
RFC: new approach to efficiently hashing 1, 1.0, big(1), the same.
@JeffBezanson JeffBezanson merged commit 5b20688 into master May 7, 2014
@Keno
Copy link
Member

Keno commented May 7, 2014

Yay!

simonster added a commit to JuliaData/DataFrames.jl that referenced this pull request May 11, 2014
@StefanKarpinski StefanKarpinski deleted the sk/hashing branch June 20, 2014 19:32
vtjnash added a commit that referenced this pull request Oct 14, 2020
In LLVM (inherited from C), fptosi has undefined behavior if the result
does not fit the integer size after rounding down. But by using the same
strategy as generic hashing of Real values, we actually can end up with
a sitatuion that is faster for the CPU to deal with and avoids the UB.

Refs #6624 (3696968)
Fixes #37800
vtjnash added a commit that referenced this pull request Oct 15, 2020
In LLVM (inherited from C), fptosi has undefined behavior if the result
does not fit the integer size after rounding down. But by using the same
strategy as generic hashing of Real values, we actually can end up with
a sitatuion that is faster for the CPU to deal with and avoids the UB.

Refs #6624 (3696968)
Fixes #37800
vtjnash added a commit that referenced this pull request Oct 26, 2020
In LLVM (inherited from C), fptosi has undefined behavior if the result
does not fit the integer size after rounding down. But by using the same
strategy as generic hashing of Real values, we actually can end up with
a sitatuion that is faster for the CPU to deal with and avoids the UB.

Refs #6624 (3696968)
Fixes #37800
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants