-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Chained hash pipelining in array hashing #58252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
28cd8fb
d6c90bd
8525bcc
1fb79bf
2de255c
ef6a236
7930179
4f001b9
1c1fbff
ccef362
548ca93
b526d76
cfebc9d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1999,3 +1999,105 @@ end | |
|
|
||
| getindex(b::Ref, ::CartesianIndex{0}) = getindex(b) | ||
| setindex!(b::Ref, x, ::CartesianIndex{0}) = setindex!(b, x) | ||
|
|
||
| ## hashing AbstractArray ## can't be put in abstractarray.jl due to bootstrapping problems with the use of @nexpr | ||
|
|
||
| function _hash_fib(A::AbstractArray, h::UInt) | ||
| # Goal: Hash approximately log(N) entries with a higher density of hashed elements | ||
| # weighted towards the end and special consideration for repeated values. Colliding | ||
| # hashes will often subsequently be compared by equality -- and equality between arrays | ||
| # works elementwise forwards and is short-circuiting. This means that a collision | ||
| # between arrays that differ by elements at the beginning is cheaper than one where the | ||
| # difference is towards the end. Furthermore, choosing `log(N)` arbitrary entries from a | ||
| # sparse array will likely only choose the same element repeatedly (zero in this case). | ||
|
|
||
| # To achieve this, we work backwards, starting by hashing the last element of the | ||
| # array. After hashing each element, we skip `fibskip` elements, where `fibskip` | ||
| # is pulled from the Fibonacci sequence -- Fibonacci was chosen as a simple | ||
| # ~O(log(N)) algorithm that ensures we don't hit a common divisor of a dimension | ||
| # and only end up hashing one slice of the array (as might happen with powers of | ||
| # two). Finally, we find the next distinct value from the one we just hashed. | ||
|
|
||
| # This is a little tricky since skipping an integer number of values inherently works | ||
| # with linear indices, but `findprev` uses `keys`. Hoist out the conversion "maps": | ||
| ks = keys(A) | ||
| key_to_linear = LinearIndices(ks) # Index into this map to compute the linear index | ||
| linear_to_key = vec(ks) # And vice-versa | ||
|
|
||
| # Start at the last index | ||
| keyidx = last(ks) | ||
| linidx = key_to_linear[keyidx] | ||
| fibskip = prevfibskip = oneunit(linidx) | ||
| first_linear = first(LinearIndices(linear_to_key)) | ||
| @nexprs 8 i -> p_i = h | ||
|
|
||
| n = 0 | ||
| while true | ||
| n += 1 | ||
| # Hash the element | ||
| elt = A[keyidx] | ||
|
|
||
| stream_idx = mod1(n, 8) | ||
| @nexprs 8 i -> stream_idx == i && (p_i = hash(keyidx => elt, p_i)) | ||
|
|
||
| # Skip backwards a Fibonacci number of indices -- this is a linear index operation | ||
| linidx = key_to_linear[keyidx] | ||
| linidx < fibskip + first_linear && break | ||
| linidx -= fibskip | ||
| keyidx = linear_to_key[linidx] | ||
|
|
||
| # Only increase the Fibonacci skip once every N iterations. This was chosen | ||
| # to be big enough that all elements of small arrays get hashed while | ||
| # obscenely large arrays are still tractable. With a choice of N=4096, an | ||
| # entirely-distinct 8000-element array will have ~75% of its elements hashed, | ||
| # with every other element hashed in the first half of the array. At the same | ||
| # time, hashing a `typemax(Int64)`-length Float64 range takes about a second. | ||
| if rem(n, 4096) == 0 | ||
| fibskip, prevfibskip = fibskip + prevfibskip, fibskip | ||
| end | ||
|
|
||
| # Find a key index with a value distinct from `elt` -- might be `keyidx` itself | ||
| keyidx = findprev(!isequal(elt), A, keyidx) | ||
| keyidx === nothing && break | ||
| end | ||
|
|
||
| @nexprs 8 i -> h = hash_mix_linear(p_i, h) | ||
| return hash_uint(h) | ||
| end | ||
|
|
||
| const hash_abstractarray_seed = UInt === UInt64 ? 0x7e2d6fb6448beb77 : 0xd4514ce5 | ||
| function hash(A::AbstractArray, h::UInt) | ||
| h ⊻= hash_abstractarray_seed | ||
| # Axes are themselves AbstractArrays, so hashing them directly would stack overflow | ||
| # Instead hash the tuple of firsts and lasts along each dimension | ||
| h = hash(map(first, axes(A)), h) | ||
| h = hash(map(last, axes(A)), h) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will this cause collisions for 2x1 vs 1x2 matrices? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think so since |
||
|
|
||
| len = length(A) | ||
|
|
||
| if len < 8 | ||
| # for the shortest arrays we chain directly | ||
| for elt in A | ||
| h = hash(elt, h) | ||
| end | ||
| return h | ||
| elseif len < 65536 | ||
adienes marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| # separate accumulator streams, unrolled | ||
| @nexprs 8 i -> p_i = h | ||
| n = 1 | ||
| limit = len - 7 | ||
| while n <= limit | ||
| @nexprs 8 i -> p_i = hash(A[n + i - 1], p_i) | ||
adienes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| n += 8 | ||
| end | ||
| while n <= len | ||
| p_1 = hash(A[n], p_1) | ||
| n += 1 | ||
| end | ||
| # fold all streams back together | ||
| @nexprs 8 i -> h = hash_mix_linear(p_i, h) | ||
| return hash_uint(h) | ||
| else | ||
| return _hash_fib(A, h) | ||
| end | ||
| end | ||
Uh oh!
There was an error while loading. Please reload this page.