Conversation
Are these representative? The arrays being passed in are exactly the same array after all, so it's not unlikely that there is some special casing going on with |
|
That's a good point! I've re-run the benchmarks, and some of these do hold up in more general cases: julia> @btime A + B setup=(A = rand(3,3); B = rand(3,3));
39.452 ns (2 allocations: 144 bytes) # v"1.13.0-DEV.1387"
27.789 ns (2 allocations: 144 bytes) # this PR
julia> @btime A + B setup=(A = rand(3,3000); B = rand(3,3000));
10.130 μs (3 allocations: 70.40 KiB) # v"1.13.0-DEV.1387"
5.026 μs (3 allocations: 70.40 KiB) # this PRThe difference in the The main benefit comes in the wide matrix case, where the first dimension is too small for vectorization to kick in. Using linear indexing offers a significant speed-up. This was suggested in #47873 (comment). |
0ecb681 to
6cd702b
Compare
|
This seems to have broken |
|
maybe |
This was broken in #59961, as `map` deals with trailing singleton axes differently from broadcasting: ```julia julia> map(+, ones(1), ones(1,1)) |> size (1,) julia> broadcast(+, ones(1), ones(1,1)) |> size (1, 1) ``` This PR limits the new method to the case where the ndims match, in which case there are no trailing axes and the two are equivalent. The alternate approach suggested in #59961 (comment) is to reshape the arrays, but this adds overhead that nullifies the performance improvement for small arrays.
mapis a simpler operation and uses linear indexing forArrays. This often improves performance (occasionally enabling vectorization) and improves TTFX in common cases. It also automatically returns the correct result for 0-D arrays, unlike broadcasting that returns a scalar.Performance:
Similarly for
-.TTFX:
These are measured on