Skip to content

Conversation

@N5N3
Copy link
Member

@N5N3 N5N3 commented Dec 22, 2021

This performance difference was found when working on #42736.
Currently, our ReshapedArray use stride based MultiplicativeInverse to speed up index transformation.
For example, for a::AbstractArray{T,3} and b = vec(a), the index transformation is equivalent to:

offset = i - 1                        # b[i]
d1, r1 = divrem(offset, stride(a, 3)) # stride(a, 3) = size(a, 1) * size(a, 2)
d2, r2 = divrem(r1, stride(a, 2))     # stride(a, 2) = size(a, 1)
CartesianIndex(r2 + 1, d2 +1, d1 + 1) # a has one-based axes

(All the stride is replaced with a MultiplicativeInverse to accelerate divrem)

This PR wants to replace the above machinery with:

offset = i - 1               
d1, r1 = divrem(offset, size(a, 1))
d2, r2 = divrem(d1, size(a, 2))
CartesianIndex(r1 + 1, r2 +1, d2 + 1)

For random access, they should have the same computational cost. But for sequential access, like sum(b), size based transformation seems faster.
To avoid bottleneck from IO, use reshape(::CartesianIndices, x...) to benchmark:

f(x) = let r = 0
           for i in eachindex(x)
               @inbounds r |= +(x[i].I...)
           end
           r
       end
a = CartesianIndices((99,100,101));
@btime f(vec($a));                 #2.766 ms  -->  2.591 ms
@btime f(reshape($a,990,1010));    #3.412 ms  -->  2.626 ms
@btime f(reshape($a,33,300,101));  #3.422 ms  -->  2.342 ms

I haven't looked into the reason for this performance difference.

Beside acceleration, this also makes it possible to reuse the MultiplicativeInverse in some cases (like #42736).
So I think it might be useful?

@N5N3 N5N3 changed the title Use sized based mi to speedup sequential access of ReshapedArray Use size based MultiplicativeInverse to speedup sequential access of ReshapedArray Dec 22, 2021
@dkarrasch dkarrasch added the arrays [a, r, r, a, y, s] label Dec 22, 2021
@adienes
Copy link
Member

adienes commented Jun 21, 2025

this appears to still be a decent performance improvement (& non-conflicted) on my machine is 2ms --> 1.6ms. any interest in picking it back up?

@adienes adienes added the performance Must go faster label Jun 21, 2025
@adienes
Copy link
Member

adienes commented Sep 9, 2025

I haven't looked into the reason for this performance difference.

I believe it is because the sequence of divrem has more favorable data dependency. since in q, r = divrem(x, d) the q is computed by multiplication + shift (a la MultiplicativeInverse) then r requires an additional multiplication + subtraction. if r is required for the next divrem call then the cpu has to wait, but if only q is required it can do these out of order

and note that the improvement can get really quite large when dimensionality is high. on e.g.

const a = CartesianIndices((2,3,4,5,1,1,1,2,3,4,1,1,1,2,1,2,1,1,3))
const va = vec(a)
@btime f($va);

this PR is 3x faster than master

@oscardssmith
Copy link
Member

seems reasonable to me!

@adienes
Copy link
Member

adienes commented Sep 9, 2025

test error is real, but weird. I reduced it to this:

using LinearAlgebra


function test_triangular(elty1)
    X = randn(elty1, 3, 3)
    psd_matrix = X' * X
    data = cholesky(psd_matrix).U

    A1 = UpperTriangular(data)
    exp(Matrix(log(A1))) ≈ A1
end

test_triangular(Float32)

which works on master but segfaults on this PR

I doubt this is technically this PR's fault... but I will keep trying to understand why it happens anyway

@adienes
Copy link
Member

adienes commented Sep 9, 2025

fixed after #59525

@adienes
Copy link
Member

adienes commented Sep 10, 2025

I hope it is ok that I pushed directly here. I also changed the _ind2sub method for AbstractArray to match a similar form. I don't think there will be much performance improvement on that method since it's not using MultiplicativeInverse, but it seems a bit more consistent & clearer implementation anyway

@N5N3 N5N3 merged commit 7de5585 into JuliaLang:master Sep 11, 2025
7 checks passed
@N5N3 N5N3 deleted the size_reshape branch September 11, 2025 02:51
@N5N3
Copy link
Member Author

N5N3 commented Sep 11, 2025

Thanks so much for picking this up and exploring the root cause of the improvements @adienes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrays [a, r, r, a, y, s] performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants