-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make arrays and ranges hash and compare equal #16401
Conversation
base/hashing.jl
Outdated
h = hash(x1, h) | ||
done(a, state) && return h | ||
|
||
# Then hash the difference between two subsequent elements when - is supported, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could have a fast-path version wrapped in isleaftype(eltype(a))
, which gets elided at compile-time (see #7088). Would be lovely to have applicable
work the same way, whenever all passed types are leaf-types.
Demo for isleaftype
:
julia> function testelt{T}(a::Array{T})
if isleaftype(T)
return "leaf"
end
return "nonleaf"
end
testelt (generic function with 1 method)
julia> @code_warntype testelt(Int[1])
Variables:
#self#::#testelt
a::Array{Int64,1}
Body:
begin # REPL[1], line 2:
$(QuoteNode(true)) # REPL[1], line 3:
return "leaf"
end::String
julia> @code_warntype testelt(Integer[1])
Variables:
#self#::#testelt
a::Array{Integer,1}
Body:
begin # REPL[1], line 2:
$(QuoteNode(false))
goto 4
4: # REPL[1], line 5:
return "nonleaf"
end::String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's more or less what I have in mind. Thanks for the precisions!
OK, I've updated the PR with an adapted There's still an issue with some float ranges like EDIT: for now I had to comment out a test where a sparse matrix starts with |
base/hashing.jl
Outdated
val = (x1, x2) -> x2 | ||
end | ||
else | ||
val = (x1, x2) -> applicable(-, x2, x1) ? x2 - x1 : x2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would be a little better defined if we just did RLE of the diff for Number subtypes. If a number doesn't support subtraction, I think it's pretty broken. But I could imagine a non-numeric type whose subtraction method doesn't definitively identify distinct elements.
That way you could simply express this as regular old dispatch. It'll automatically be fast for well-typed collections… and you won't need to worry about type stability yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd thought about that, but one also needs to support non-Numbers like Char
, Date
, etc. Some kind of trait indicating whether a type can be used with ranges (i.e. supports subtraction and addition) would definitely help.
But there's also the problem that you need to return the same hash for untyped arrays, and I'm not sure how you'd dispatch on that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just imagining a generic function:
diffhash(a, b) = b
diffhash(a::Number, b::Number) = a-b
diffhash(a::DateTime, b::DateTime) = a-b
# … etc …
Custom types can directly hook into it themselves. And you don't need to worry about container types since Julia will just do the dispatch on the actual elements; well-typed containers will automatically be type-stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. But without a trait to identify types for which ranges work, authors of custom types could easily forget to add that definition, since hashing issues aren't always glaring. So I guess the Any
fallback should still call applicable
to be safe. Anyway with @timholy's PR we might not even need this.
How does this handle floating-point range like |
Eh, sorry, I just saw that you asked me about that in the above comment. I think that's the really hard part and I can't really see a way to solve it – I suspect something really clever would be required. |
I guess one approach would be to just punt on ranges where the actual steps are not exactly equal to the requested step. However, I'm not entirely sure what that criterion should be. Conservatively, I would guess that if the endpoints and the step are all integral and have epsilons ≤ 1 it should be safe to do this. Otherwise, you'd have to just enumerate the values and compute the full hash like you would for a vector. |
Yeah, that would be a possible fallback if we don't find any other strategy. Though getting an O(n) operation for some float ranges and not others isn't ideal. Another strategy I tried is to change the hashing of arrays to consider subsequent |
About float ranges: could we infer for a given vector which range it would be if it were a range, eg from the first and last element and the length, and then hash the differences between that range's elements and the ones given? |
Ah, that's a clever idea, @toivoh. That seems like a feasible solution. |
I don't get it. "Which range it would be if it were a range" is more or less what I was asking, but so far I don't know how to do that unfortunately. It would be great if you have a solution... |
You hash |
I think this requires RLE hashing of the differences between |
Ah, that's clever indeed. So we could compute these differences, and then do the hashing on the result. Though we also need to hash sparse arrays efficiently. For these, the difference between a range and actual values will be equal to a range for each sequence of zero elements; in order to hash these efficiently, we need to hash the difference/step between two subsequent elements IIUC. This is getting pretty complex, but could work. |
Yeah, I'm not sure how to make it handle sparse arrays efficiently too... |
For range-like arrays, the only hashing case that is really important is when the array turns out to match a range exactly. So another way could be to encode the number of initial elements that match the range, and then to go on and hash the rest as normal. That should allow to hash sparse vectors efficiently as well. |
Also a clever idea. |
I have something almost ready locally. This approach gives a much simpler code, with only quite limited changes. But it still doesn't work in corner cases. For example: julia> r=0.1:0.1:0.3;
julia> first(r):(last(r)-first(r))/(length(r)-1):last(r)
0.1:0.09999999999999999:0.3
julia> (last(r)-first(r))/(length(r)-1) - step(r)
-1.3877787807814457e-17 Since this is exactly |
Ugh. Ranges are the worst. |
Maybe the same kind of trick you use to choose the best bounds could be used here? |
Yeah, it needs to be something similar. |
This is close but not quite there. We should figure this out for 0.6. |
I really need help to make progress here. If somebody can find a way to get the examples form my last comment to work, then I can work out the rest. I've just push the last state of my work from May25th (which implements the alternative strategy suggested by @toivoh) to the |
I didn't think we were going to get this one. What a wonderful gift! Thanks, @nalimilan and all those who helped. |
I didn't either, so thank @StefanKarpinski for deciding that had to be possible. :-) I just hope it won't turn out to be too complex/annoying, as you never know whether another ugly corner case might pop up... |
Could somebody explain the practical benefits of this one? I've skimmed the thread and am finding it a bit tricky to wrap my mind around. Edit: Aha:
So this is effectively a sort of (very subtle and nuanced) bugfix that corrects the behavior of "Should be implemented for all types with a notion of equality, based on the abstract value that an instance represents. For example, all numeric types are compared by numeric value, ignoring type." |
The "only" benefit of this change is this: julia> 1:3 == [1,2,3]
true The drawbacks are more complex to explain... 😈 |
The main drawbacks are: O(n) hashing for float ranges and needing to implement the |
This could have warranted another nanosoldier run. |
Indeed. Of course, equality tests between arrays and ranges got 686 times slower, because they previously compared pointers and returned false immediately. Most other changes in that report appear to be noise (especially all the unrelated improvements), so I'm not sure what to do with that. :-/ |
:) Well that's fair enough then. Expected regressions are better than unexpected regressions. |
I really think this is too complex and brittle. Could we, say, restrict it to Integers? I would rather have hashing of ranges be O(n) more often than have cases like
|
I agree it would make sense to use the O(N) algorithm by default. The difficulty with that is that all types which can compare equal need to use the same algorithm: if we restrict the O(1) algorithm to EDIT: we could probably restrict the O(1) hashing to |
The approach that could work here is to have a generic O(n) algorithm for ranges in general, which happens to collapse to an O(1) algorithm for integer ranges. I can try to show what I mean, but it won't be this week, so I propose that we leave this as is and we can try to improve it by making it fail in fewer cases but potentially be slower in more cases. |
Just to clarify: this is the inverse design: it detects cases where the O(1) algorithm may be switched to, rather than using the same algorithm in different cases with different performance. |
Without more details, I don't see how it could be implemented (but maybe I lack imagination). For 0.7, it wouldn't be unreasonable to restrict the O(1) algorithm to |
#25822 restricts the O(1) algorithm to |
Instead of hashing the values themselves, hash the first value and
differences between subsequent elements using run-length encoding.
This allows for O(1) hashing of ranges consistent with AbstractArrays,
which means they can now compare equal.
Elements for which the
-
operator is not defined are hashed directly.This assumes that types which can be used for ranges implement
-
.This is based on the suggestion by @StefanKarpinski and @mbauman at #12226 (comment). It fixes two issues I repeatedly filed about the same problem (#13565 and #16364).
The hashing tests currently pass except for sparse matrices, for which I didn't adapt the algorithm yet. I'd like to get some feedback first. I'm a total newbie as regards hashing, so don't trust me on this -- though the code was already surprisingly close to what I needed, so changes are limited.
The main limitation is the handling of the case where the difference between two subsequent elements cannot be computed. Since the goal is only to apply the same hashing to arrays and ranges, it doesn't really matter what we do when
-
doesn't work, since ranges shouldn't work in that case either. But we still need to hashAny[1,2,3]
,[1,2,3]
and1:3
the same, which means we cannot rely oneltype
to find out whether-
works. I guess the best solution would be to have a fast path for wheneltype
is concrete (or precise enough to know whether-
isapplicable
or not), and the current slow path for other cases. Suggestions welcome.Fixes #16364.