Make arrays and ranges hash and compare equal #16401

nalimilan · 2016-05-17T09:52:40Z

Instead of hashing the values themselves, hash the first value and
differences between subsequent elements using run-length encoding.
This allows for O(1) hashing of ranges consistent with AbstractArrays,
which means they can now compare equal.

Elements for which the - operator is not defined are hashed directly.
This assumes that types which can be used for ranges implement -.

This is based on the suggestion by @StefanKarpinski and @mbauman at #12226 (comment). It fixes two issues I repeatedly filed about the same problem (#13565 and #16364).

The hashing tests currently pass except for sparse matrices, for which I didn't adapt the algorithm yet. I'd like to get some feedback first. I'm a total newbie as regards hashing, so don't trust me on this -- though the code was already surprisingly close to what I needed, so changes are limited.

The main limitation is the handling of the case where the difference between two subsequent elements cannot be computed. Since the goal is only to apply the same hashing to arrays and ranges, it doesn't really matter what we do when - doesn't work, since ranges shouldn't work in that case either. But we still need to hash Any[1,2,3], [1,2,3] and 1:3 the same, which means we cannot rely on eltype to find out whether - works. I guess the best solution would be to have a fast path for when eltype is concrete (or precise enough to know whether - is applicable or not), and the current slow path for other cases. Suggestions welcome.

Fixes #16364.

timholy · 2016-05-17T10:48:39Z

base/hashing.jl

+    h = hash(x1, h)
+    done(a, state) && return h
+
+    # Then hash the difference between two subsequent elements when - is supported,


You could have a fast-path version wrapped in isleaftype(eltype(a)), which gets elided at compile-time (see #7088). Would be lovely to have applicable work the same way, whenever all passed types are leaf-types.

Demo for isleaftype:

julia> function testelt{T}(a::Array{T}) if isleaftype(T) return "leaf" end return "nonleaf" end testelt (generic function with 1 method) julia> @code_warntype testelt(Int[1]) Variables: #self#::#testelt a::Array{Int64,1} Body: begin # REPL[1], line 2: $(QuoteNode(true)) # REPL[1], line 3: return "leaf" end::String julia> @code_warntype testelt(Integer[1]) Variables: #self#::#testelt a::Array{Integer,1} Body: begin # REPL[1], line 2: $(QuoteNode(false)) goto 4 4: # REPL[1], line 5: return "nonleaf" end::String

Yeah, that's more or less what I have in mind. Thanks for the precisions!

nalimilan · 2016-05-17T21:18:15Z

OK, I've updated the PR with an adapted hash method for sparse matrices, and with an attempt at using isleaftype to make the common case efficient. Not sure whether that's the best solution. Anyway, the tests now pass!

There's still an issue with some float ranges like 1.0:0.3:2.0 and linspace(1, 3, 10) (which are not in the tests currently). @StefanKarpinski Do you think we can do something about this? Looks like == is smarter than my hash.

EDIT: for now I had to comment out a test where a sparse matrix starts with -0.0. Needs some tweaking.

mbauman · 2016-05-17T21:32:35Z

base/hashing.jl

+            val = (x1, x2) -> x2
+        end
+    else
+        val = (x1, x2) -> applicable(-, x2, x1) ? x2 - x1 : x2


I wonder if this would be a little better defined if we just did RLE of the diff for Number subtypes. If a number doesn't support subtraction, I think it's pretty broken. But I could imagine a non-numeric type whose subtraction method doesn't definitively identify distinct elements.

That way you could simply express this as regular old dispatch. It'll automatically be fast for well-typed collections… and you won't need to worry about type stability yourself.

I'd thought about that, but one also needs to support non-Numbers like Char, Date, etc. Some kind of trait indicating whether a type can be used with ranges (i.e. supports subtraction and addition) would definitely help.

But there's also the problem that you need to return the same hash for untyped arrays, and I'm not sure how you'd dispatch on that...

I'm just imagining a generic function:

diffhash(a, b) = b diffhash(a::Number, b::Number) = a-b diffhash(a::DateTime, b::DateTime) = a-b # … etc …

Custom types can directly hook into it themselves. And you don't need to worry about container types since Julia will just do the dispatch on the actual elements; well-typed containers will automatically be type-stable.

Good idea. But without a trait to identify types for which ranges work, authors of custom types could easily forget to add that definition, since hashing issues aren't always glaring. So I guess the Any fallback should still call applicable to be safe. Anyway with @timholy's PR we might not even need this.

StefanKarpinski · 2016-05-17T22:25:34Z

How does this handle floating-point range like 0:0.1:3?

StefanKarpinski · 2016-05-17T22:26:45Z

Eh, sorry, I just saw that you asked me about that in the above comment. I think that's the really hard part and I can't really see a way to solve it – I suspect something really clever would be required.

StefanKarpinski · 2016-05-17T22:30:46Z

I guess one approach would be to just punt on ranges where the actual steps are not exactly equal to the requested step. However, I'm not entirely sure what that criterion should be. Conservatively, I would guess that if the endpoints and the step are all integral and have epsilons ≤ 1 it should be safe to do this. Otherwise, you'd have to just enumerate the values and compute the full hash like you would for a vector.

nalimilan · 2016-05-18T14:10:43Z

Yeah, that would be a possible fallback if we don't find any other strategy. Though getting an O(n) operation for some float ranges and not others isn't ideal.

Another strategy I tried is to change the hashing of arrays to consider subsequent v2-v1 steps as equal when they are below a few epsilon (but how many exactly?). That works great when the first step is equal to the range's step, but of course not in other cases like 0.3:-0.1:0.0. I wonder whether we could reliably compute the "real" step from its approximations. Doesn't sound easy: do we have any guaranty that it would be e.g. the most frequent of the approximations? or to their average when considering a particular number of approximations?

toivoh · 2016-05-23T19:48:47Z

About float ranges: could we infer for a given vector which range it would be if it were a range, eg from the first and last element and the length, and then hash the differences between that range's elements and the ones given?

StefanKarpinski · 2016-05-23T20:06:56Z

Ah, that's a clever idea, @toivoh. That seems like a feasible solution.

nalimilan · 2016-05-23T20:36:15Z

I don't get it. "Which range it would be if it were a range" is more or less what I was asking, but so far I don't know how to do that unfortunately. It would be great if you have a solution...

StefanKarpinski · 2016-05-23T20:48:42Z

You hash v::Vector by comparing it to the range, r with start v[1] and end v[end] with as many elements as v has (i.e. r = v[1]:(v[end]-v[1])/(length(v)-1):v[end]). That way if v is equal to r they hash the same – namely by hashing the difference which is all zeros.

StefanKarpinski · 2016-05-23T20:51:36Z

I think this requires RLE hashing of the differences between v and r. So to hash an actual range, we know that we can just hash the start and stop and the number of elements between and then zero.

nalimilan · 2016-05-23T21:02:49Z

Ah, that's clever indeed. So we could compute these differences, and then do the hashing on the result.

Though we also need to hash sparse arrays efficiently. For these, the difference between a range and actual values will be equal to a range for each sequence of zero elements; in order to hash these efficiently, we need to hash the difference/step between two subsequent elements IIUC. This is getting pretty complex, but could work.

StefanKarpinski · 2016-05-23T21:35:28Z

Yeah, I'm not sure how to make it handle sparse arrays efficiently too...

toivoh · 2016-05-24T04:48:28Z

For range-like arrays, the only hashing case that is really important is when the array turns out to match a range exactly.

So another way could be to encode the number of initial elements that match the range, and then to go on and hash the rest as normal. That should allow to hash sparse vectors efficiently as well.

StefanKarpinski · 2016-05-24T14:14:27Z

Also a clever idea.

nalimilan · 2016-05-25T15:19:28Z

I have something almost ready locally. This approach gives a much simpler code, with only quite limited changes.

But it still doesn't work in corner cases. For example:

julia> r=0.1:0.1:0.3;

julia> first(r):(last(r)-first(r))/(length(r)-1):last(r)
0.1:0.09999999999999999:0.3

julia> (last(r)-first(r))/(length(r)-1) - step(r)
-1.3877787807814457e-17

Since this is exactly eps, can something be done about it?

StefanKarpinski · 2016-05-25T15:33:31Z

Ugh. Ranges are the worst.

nalimilan · 2016-05-25T21:16:37Z

Maybe the same kind of trick you use to choose the best bounds could be used here?

StefanKarpinski · 2016-05-25T21:29:49Z

Yeah, it needs to be something similar.

StefanKarpinski · 2016-09-13T19:17:24Z

This is close but not quite there. We should figure this out for 0.6.

nalimilan · 2016-09-13T19:32:21Z

I really need help to make progress here. If somebody can find a way to get the examples form my last comment to work, then I can work out the rest. I've just push the last state of my work from May25th (which implements the alternative strategy suggested by @toivoh) to the nl/hash2 branch (beware, it's not clean yet).

timholy · 2017-12-21T20:43:36Z

I didn't think we were going to get this one. What a wonderful gift! Thanks, @nalimilan and all those who helped.

nalimilan · 2017-12-21T20:47:55Z

I didn't either, so thank @StefanKarpinski for deciding that had to be possible. :-) I just hope it won't turn out to be too complex/annoying, as you never know whether another ugly corner case might pop up...

yurivish · 2017-12-21T21:06:59Z

Could somebody explain the practical benefits of this one? I've skimmed the thread and am finding it a bit tricky to wrap my mind around.

Edit: Aha:

julia> 1:3 == [1,2,3]
true

So this is effectively a sort of (very subtle and nuanced) bugfix that corrects the behavior of == to correspond to the definition in the docs?

"Should be implemented for all types with a notion of equality, based on the abstract value that an instance represents. For example, all numeric types are compared by numeric value, ignoring type."

nalimilan · 2017-12-21T21:16:00Z

The "only" benefit of this change is this:

julia> 1:3 == [1,2,3]
true

The drawbacks are more complex to explain... 😈

StefanKarpinski · 2017-12-21T21:21:53Z

The main drawbacks are: O(n) hashing for float ranges and needing to implement the widen operation for a type in order to hash ranges with that element type. That and a lot of blood, sweat and tears (mostly @nalimilan's).

vchuravy · 2017-12-22T09:59:31Z

This could have warranted another nanosoldier run.
cf: https://github.com/JuliaCI/BaseBenchmarkReports/blob/f82f6a38256ebe75f738602db635f6579e34f076/daily_2017_12_22/report.md

nalimilan · 2017-12-22T10:07:43Z

Indeed. Of course, equality tests between arrays and ranges got 686 times slower, because they previously compared pointers and returned false immediately. Most other changes in that report appear to be noise (especially all the unrelated improvements), so I'm not sure what to do with that. :-/

vchuravy · 2017-12-22T10:09:24Z

:) Well that's fair enough then. Expected regressions are better than unexpected regressions.

JeffBezanson · 2018-01-04T06:45:02Z

I really think this is too complex and brittle. Could we, say, restrict it to Integers? I would rather have hashing of ranges be O(n) more often than have cases like

julia> hash(Any[Date("2018-1-03"), Date("2018-1-04"), Date("2018-1-05")])
ERROR: MethodError: no method matching widen(::Type{Date})

nalimilan · 2018-01-04T11:35:04Z

I agree it would make sense to use the O(N) algorithm by default. The difficulty with that is that all types which can compare equal need to use the same algorithm: if we restrict the O(1) algorithm to Integer, 1:2 would not hash the same as 1.0:2.0.

EDIT: we could probably restrict the O(1) hashing to Number by default, and make it opt-in for other types.

StefanKarpinski · 2018-01-04T15:35:47Z

The approach that could work here is to have a generic O(n) algorithm for ranges in general, which happens to collapse to an O(1) algorithm for integer ranges. I can try to show what I mean, but it won't be this week, so I propose that we leave this as is and we can try to improve it by making it fail in fewer cases but potentially be slower in more cases.

StefanKarpinski · 2018-01-04T15:37:20Z

Just to clarify: this is the inverse design: it detects cases where the O(1) algorithm may be switched to, rather than using the same algorithm in different cases with different performance.

nalimilan · 2018-01-15T13:52:57Z

Without more details, I don't see how it could be implemented (but maybe I lack imagination). For 0.7, it wouldn't be unreasonable to restrict the O(1) algorithm to Number types, and allow custom types to opt-in if needed. The only assumption this would make is that no non-Number type should compare equal to a Number type (but that sounds like a good rule anyway?).

nalimilan · 2018-01-30T21:29:50Z

#25822 restricts the O(1) algorithm to Number types to limit problems.

nalimilan mentioned this pull request May 17, 2016

Get rid of special-casing of ranges in == and isequal() for AbstractArrays #16364

Closed

timholy reviewed May 17, 2016
View reviewed changes

nalimilan force-pushed the nl/hash branch from 97c3bb3 to 73c056e Compare May 17, 2016 21:18

nalimilan force-pushed the nl/hash branch from 73c056e to 080849b Compare May 17, 2016 21:20

mbauman reviewed May 17, 2016
View reviewed changes

nalimilan force-pushed the nl/hash branch from 080849b to 0e6e2ce Compare May 17, 2016 21:56

timholy mentioned this pull request May 18, 2016

Static evaluation of method_exists (when true) #16422

Closed

StefanKarpinski added the breaking This change will break code label Sep 13, 2016

StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016

tkelman added this to the 1.0 milestone Jan 5, 2017

Use a single loop

bba5070

nalimilan force-pushed the nl/hash branch from b17c0c5 to bba5070 Compare December 21, 2017 14:05

nalimilan added the triage This should be discussed on a triage call label Dec 21, 2017

StefanKarpinski removed the triage This should be discussed on a triage call label Dec 21, 2017

StefanKarpinski merged commit 8872f90 into master Dec 21, 2017

nalimilan deleted the nl/hash branch December 21, 2017 19:58

This was referenced Dec 22, 2017

Fix hash(::AbstractArray) failure in when some types support - and others do not #25250

Merged

Old method incorrectly called after redefining it #25252

Closed

nalimilan mentioned this pull request Jan 2, 2018

add a function to test sets equality? #25318

Closed

mbauman mentioned this pull request Jan 12, 2018

Arraypocalypse Now and Then #13157

Closed

27 tasks

This was referenced Jan 24, 2018

Move BigFloat to stdlib #25716

Open

Add ArrayHashingStyle trait for O(1) range hashing and use it only for Numbers #25822

Closed

This was referenced Feb 13, 2018

RFC: Simpler array hashing #26022

Merged

Array hashing breaks equality on 0.7 #26034

Closed

nalimilan added the hashing label Feb 18, 2018

nalimilan mentioned this pull request Sep 29, 2018

Equality of empty ranges #29421

Closed

Make arrays and ranges hash and compare equal #16401

Make arrays and ranges hash and compare equal #16401

Conversation

nalimilan commented May 17, 2016 • edited Loading

timholy May 17, 2016

Choose a reason for hiding this comment

nalimilan May 17, 2016

Choose a reason for hiding this comment

nalimilan commented May 17, 2016 • edited Loading

mbauman May 17, 2016

Choose a reason for hiding this comment

nalimilan May 17, 2016

Choose a reason for hiding this comment

mbauman May 18, 2016 • edited Loading

Choose a reason for hiding this comment

nalimilan May 18, 2016

Choose a reason for hiding this comment

StefanKarpinski commented May 17, 2016

StefanKarpinski commented May 17, 2016

StefanKarpinski commented May 17, 2016

nalimilan commented May 18, 2016

toivoh commented May 23, 2016

StefanKarpinski commented May 23, 2016

nalimilan commented May 23, 2016

StefanKarpinski commented May 23, 2016 • edited Loading

StefanKarpinski commented May 23, 2016

nalimilan commented May 23, 2016

StefanKarpinski commented May 23, 2016

toivoh commented May 24, 2016

StefanKarpinski commented May 24, 2016

nalimilan commented May 25, 2016

StefanKarpinski commented May 25, 2016

nalimilan commented May 25, 2016

StefanKarpinski commented May 25, 2016

StefanKarpinski commented Sep 13, 2016

nalimilan commented Sep 13, 2016

timholy commented Dec 21, 2017

nalimilan commented Dec 21, 2017

yurivish commented Dec 21, 2017 • edited Loading

nalimilan commented Dec 21, 2017

StefanKarpinski commented Dec 21, 2017

vchuravy commented Dec 22, 2017

nalimilan commented Dec 22, 2017

vchuravy commented Dec 22, 2017

JeffBezanson commented Jan 4, 2018

nalimilan commented Jan 4, 2018 • edited Loading

StefanKarpinski commented Jan 4, 2018

StefanKarpinski commented Jan 4, 2018

nalimilan commented Jan 15, 2018

nalimilan commented Jan 30, 2018

nalimilan commented May 17, 2016 •

edited

Loading

nalimilan commented May 17, 2016 •

edited

Loading

mbauman May 18, 2016 •

edited

Loading

StefanKarpinski commented May 23, 2016 •

edited

Loading

yurivish commented Dec 21, 2017 •

edited

Loading

nalimilan commented Jan 4, 2018 •

edited

Loading