Slow broadcast addition of matrices #6041

fjarri · 2014-03-04T05:25:12Z

Consider the following benchmark. It compares the performance of the broadcast addition (.+) for matrices of different sizes in Julia and in Python. In Python + was used because it broadcasts automatically behind the scenes.

My configuration:
OSX 10.9.2
Python 3.3.4, numpy 1.8.0
Julia HEAD (as of 4 Mar 2014, commit d36bb08), compiled with Accelerate

The results for the dot tests are (combined side-by-side for convenience)

Python | Julia

add(float, float), 10 x 10: 0.22297 s | 0.51618 s
add(float, float), 100 x 100: 0.013359 s | 0.094047 s
add(float, float), 1000 x 1000: 0.068224 s | 0.10046 s
add(complex, float), 10 x 10: 0.69337 s | 1.0949 s
add(complex, float), 100 x 100: 0.049746 s | 0.27063 s
add(complex, float), 1000 x 1000: 0.12786 s | 0.19651 s
add(complex, complex), 10 x 10: 0.22025 s | 1.0902 s
add(complex, complex), 100 x 100: 0.030946 s | 0.27640 s
add(complex, complex), 1000 x 1000: 0.13954 s | 0.20228 s

In all cases there is 2x-10x slowdown. Is there something special that broadcasting does, and should it be avoided in the performance-critical code? It does not seem to have that prominent effect in Python.

The text was updated successfully, but these errors were encountered:

ivarne · 2014-03-04T07:34:34Z

~~I do not think Julia is slow on broadcasting, but it gets very slow when the code uses anonymous functions. I'll write a revised version of your benchmark to see if I'm right in that statement~~

ivarne · 2014-03-04T08:30:33Z

I'm sorry. None of the usual suspects for Julia performance seems to make a difference. Your benchmark seems to not have that huge overhead that I expected. I have a revised benshmark that acomplishes the same as yours in much fewer lines, but the time seems to be spent in the BLAS functions.

I had some improvement for the 10*10 matrices to call blas_set_num_threads(1).

timholy · 2014-03-04T10:07:32Z

The entire problem is garbage-collection---if I preallocate the output, the performance of Julia and Python are identical:

function rundotplus_prealloc(A, B, iters)
         C = similar(A); for k = 1:iters
           broadcast!(.+, C, A, B)       
         end
         nothing
       end

A = rand(1000,1000);
B = rand(1000,1000);
@time rundotplus_prealloc(A, B, 100)

You may be interested in trying the branch on #5227.

simonster · 2014-03-04T16:44:13Z

Related: #4784

timholy · 2014-03-04T22:19:15Z

Since this is "just" a GC issue, and since we have other open GC issues, I'm closing. Feel free to reopen if something unique crops up.

fjarri · 2014-03-09T23:24:38Z

Edit: used the "pure" broadcast instead of "inplace" broadcast! thus producing incorrect results.

Tim,

Sorry for bringing this up again, but I have run this updated benchmark to compare "preallocated" and "dynamic" cases (and compare them both to the regular + operator), and while preallocation does help a lot, broadcasted addition is still much slower than the pure elementwise one:

add, 10 x 10: 0.106823909 s
add, 100 x 100: 0.084994372 s
add, 1000 x 1000: 0.096064399 s
add_bc, 10 x 10: 0.523851014 s
add_bc, 100 x 100: 0.08918326 s
add_bc, 1000 x 1000: 0.088193045 s
add_bc_prealloc, 10 x 10: 0.327231519 s
add_bc_prealloc, 100 x 100: 0.016404025 s
add_bc_prealloc, 1000 x 1000: 0.055645974 s

Broadcasting for small arrays shows 3x slowdown as compared to just +.

carlobaldassi · 2014-03-09T23:39:02Z

With the latest master, I get these timings from your gist:

add, 10 x 10: 0.148591762 s
add, 100 x 100: 0.138836403 s
add, 1000 x 1000: 0.079722051 s
add_bc, 10 x 10: 0.591883451 s
add_bc, 100 x 100: 0.142932361 s
add_bc, 1000 x 1000: 0.078317679 s
add_bc_prealloc, 10 x 10: 0.3482838 s
add_bc_prealloc, 100 x 100: 0.017104865 s
add_bc_prealloc, 1000 x 1000: 0.037650405 s

So yes, broadcasting with preallocation is still slower then plain + for small matrices (by a factor of ~2.3), while performance improves dramatically with preallocation for large matrices.

fjarri · 2014-03-09T23:51:12Z

So, to reiterate my initial question, is there something special that broadcasting does, and should it be avoided in the performance-critical code? Because even with the preallocation the result for 10x10 matrices is 1.5 times slower than the Python version. I would expect .+ to check for the equality of sizes and fall back to the regular +, but this alone does not seem to be enough to explain the difference in performance.

carlobaldassi · 2014-03-09T23:54:37Z

Profiling the 10x10 case shows that most of the time is spent in the function lookup from the cache, i.e. the line:

func = broadcast_cache[key]

where key is a Tuple, which in turn points to line 565 in dict.jl:

index = ht_keyindex(h, key)

after which profiling information stops.

timholy · 2014-03-10T00:02:04Z

Argh. I suppose we could create special-purpose versions for all kinds of operators for the binary case. A general fix is #5395.

carlobaldassi · 2014-03-10T14:22:48Z

In fact, we can fix it!
But it needs using

a multi-level Dict, so you'd have broadcast_cache[f][nd][narray] instead of broadcast_cache[(f,nd,narray)]
a macro which I believe at some point was called @get!, i.e. the same as the get! function for Dicts which does not evaluate the provided default value unless needed.

See this gist where I wrote everything explicitly. It would look much better with a macro, but note that the full ugly signatures of the nested Dicts are actually needed to achieve maximum performance.

Timings:

julia> include("perf.jl")
add, 10 x 10: 0.071660699 s
add, 100 x 100: 0.061634806 s
add, 1000 x 1000: 0.066237116 s
add_bc, 10 x 10: 0.176661078 s
add_bc, 100 x 100: 0.064063408 s
add_bc, 1000 x 1000: 0.065880186 s
add_bc_prealloc, 10 x 10: 0.038968927 s
add_bc_prealloc, 100 x 100: 0.014158759 s
add_bc_prealloc, 1000 x 1000: 0.037325413 s

StefanKarpinski · 2014-03-10T14:26:02Z

Is there perhaps a better data structure that would make this easier / more elegant / faster?

timholy · 2014-03-10T14:38:47Z

@carlobaldassi, that's a nice speedup!

@StefanKarpinski, this would not be easier or more elegant (the Dict is pretty much the easiest way to look up a cached method, when you have to maintain your own cache), but perhaps even faster: while the function would probably have to live in a Dict, it could return a vector-of-vectors, in which you look up nargs and then nd. You'd have to grow them as needed, and check whether an element is defined. I don't know if the latter hammers performance or not.

Or, we could decree the following: anytime you compile a method for nd, you also have to compile any missing methods for n < nd. Then we could skip the isdefined check.

carlobaldassi · 2014-03-10T14:40:33Z

Maybe a Dict{Function,Vector{Vector{Function}}}? After all, the integer keys are "number of dimensions" and "number of arguments", which are likely to be small. I wouldn't know how to avoid the Dict for Functions though. And I'm not sure it would be more elegant to deal with Vectors with potentially missing entries and potentially variable length. See also an improved function at the bottom of the previous gist.

timholy · 2014-03-10T15:31:21Z

Hmm, great minds think alike 😄.

StefanKarpinski · 2014-03-10T15:44:00Z

The isdefined check is just a null pointer check, so that's not a huge performance issue. This might be a faster way to do this. Using arrays instead of dicts wherever possible certainly is good. Would it be possible to avoid the function key by explicitly generating one of these caches for each function that needs it?

StefanKarpinski · 2014-03-10T15:45:49Z

Even though the isdefined check isn't expensive, it may be reasonable to pre-generate methods since you're already taking a performance hit in those cases anyway. If you're doing n-d aren't you likely to need ≤n-d soon anyway?

timholy · 2014-03-10T19:32:53Z

Yes, I think pre-defining up to nd is a good idea.

And yes, maybe we should just generate separate caches for .+, .-, .*, ./, and .==.

carlobaldassi · 2014-03-10T20:22:43Z

Hmm, great minds think alike 😄.

Ahah it seems we posted the exact same comment (well, almost) at the same moment.

Anyway, I experimented a bit with the solutions which have come up so far. My results (based on an extended version of the benchmark by @Manticore) are that

Using vectors of vectors with isdefined check is (very slightly, less than 5% on 10x10 matrices) worse than using nested Dicts
Pre-defining up to nd and avoiding the isdefined check (leaving only the length check) brings performance back on par with nested Dicts.
A quick experiment with function-specific cache shows some performance improvement, on the order of 10% for 10x10 matrices. I'm not sure it's worth the trouble, but it's an orthogonal issue with respect to the other one anyway.

Points 1 and 2 are surprising but I tried to make it as fast as I could and that's the result.
So I'll submit a PR with the nested Dicts version, and maybe we can decide whether to add the specialized caches or not on top of that.

timholy · 2014-03-10T20:59:15Z

That is quite surprising.

carlobaldassi · 2014-03-10T23:07:50Z

Just to add to the confusion, I tried to do more thorough experiments with specialized broadcast functions for +, - etc., but I stumbled upon something I'm not able to explain, namely that while timings seem consistent in a given julia session, restarting julia gives slightly different results. This is not the first time I see this, when trying to benchmark very fast functions which are called repeatedly, and trying to detect small differences.
In any case, the net result is that I can't reliably measure any difference in performance with respect to using the nested Dict version.

Related to #6041

Change broadcast! cache to use nested Dict instead of a Dict with Tuple keys; use get! macro to access its fields. Related to #6041. Also reduces code duplication in broadcast! definitions via eval macro.

timholy · 2014-03-11T03:38:38Z

I have seen the same thing. Never figured out what is going on. I've even seen a function suddenly start running more slowly partway through a REPL session, with no obvious reason for the switch.

kmsquire · 2014-03-11T03:55:03Z

Just FYI, there is a do-block version of get!, proposed by Stefan, which does not evaluate its arguments unless they're needed (see http://docs.julialang.org/en/latest/stdlib/base/#associative-collections). I'm not sure how useful it is, though, since it's dealing with anonymous functions.

kmsquire · 2014-03-11T06:18:11Z

FWIW, I noticed differences in timings especially on a machine with many cores, and I suspect that there might be subtle timing differences when code gets run on different processors/cores.

Change broadcast! cache to use nested Dict instead of a Dict with Tuple keys; use get! macro to access its fields. Related to #6041. Also reduces code duplication in broadcast! definitions via eval macro.

toivoh · 2014-03-11T23:10:51Z

Regarding fast broadcasting on small matrices, I created the functions
broadcast_function and broadcast!_function to allow to generate the
broadcast function for a given function once, store it in a const, and then
apply it many times. That should save a dict lookup.

Apart from that dict lookup, the intent was that the lookup based on number
and type of arguments should be plain method lookup. But I understand that
that doesn't quite work until we have recompilation of functions.

Change broadcast! cache to use nested Dict instead of a Dict with Tuple keys; use get! macro to access its fields. Related to #6041. Also reduces code duplication in broadcast! definitions via eval macro.

stevengj · 2016-12-02T02:02:41Z

This issue seems to be moot in 0.5 with fast higher-order functions. If anything, the broadcast version is now faster than + on my machine:

julia> using BenchmarkTools

julia> dotplus(A,B) = broadcast(+, A, B)
dotplus (generic function with 1 method)

julia> A = rand(1000,1000); B = copy(A);

julia> @benchmark $A + $B
BenchmarkTools.Trial: 
  memory estimate:  7.63 mb
  allocs estimate:  3
  --------------
  minimum time:     2.151 ms (0.00% GC)
  median time:      2.579 ms (0.00% GC)
  mean time:        2.678 ms (16.03% GC)
  maximum time:     5.043 ms (43.44% GC)
  --------------
  samples:          1866
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dotplus($A, $B)
BenchmarkTools.Trial: 
  memory estimate:  7.63 mb
  allocs estimate:  14
  --------------
  minimum time:     1.681 ms (0.00% GC)
  median time:      1.922 ms (0.00% GC)
  mean time:        2.075 ms (20.73% GC)
  maximum time:     3.763 ms (56.19% GC)
  --------------
  samples:          2407
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

simonster · 2016-12-02T02:53:17Z

At least for me, broadcast is still substantially slower for very small arrays, probably because it's doing more allocations:

julia> A = rand(10,10); B = copy(A);

julia> @benchmark $A + $B
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     530
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  928.00 bytes
  allocs estimate:  2
  minimum time:     212.00 ns (0.00% GC)
  median time:      224.00 ns (0.00% GC)
  mean time:        255.32 ns (6.28% GC)
  maximum time:     2.27 μs (76.42% GC)

julia> @benchmark dotplus($A, $B)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     137
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.23 kb
  allocs estimate:  13
  minimum time:     720.00 ns (0.00% GC)
  median time:      736.00 ns (0.00% GC)
  mean time:        840.92 ns (8.36% GC)
  maximum time:     17.67 μs (90.57% GC)

boubalos · 2019-10-12T17:58:05Z

The entire problem is garbage-collection---if I preallocate the output, the performance of Julia and Python are identical:
function rundotplus_prealloc(A, B, iters)
         C = similar(A); for k = 1:iters
           broadcast!(.+, C, A, B)       
         end
         nothing
       end
A = rand(1000,1000);
B = rand(1000,1000);
@time rundotplus_prealloc(A, B, 100)
You may be interested in trying the branch on #5227.

this is more elegant, see the julia doc
@time @. C=A+B

JeffBezanson added the performance label Mar 4, 2014

jiahao added the linear algebra label Mar 4, 2014

timholy closed this as completed Mar 4, 2014

carlobaldassi added a commit that referenced this issue Mar 10, 2014

Faster cache lookup in broadcast!

17d4bd4

Related to #6041

carlobaldassi mentioned this issue Mar 11, 2014

Faster cache lookup in broadcast! via nested Dicts and get! macro #6107

Merged

carlobaldassi mentioned this issue Jan 25, 2015

Function caching issue with broadcast #9869

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow broadcast addition of matrices #6041

Slow broadcast addition of matrices #6041

fjarri commented Mar 4, 2014

ivarne commented Mar 4, 2014

ivarne commented Mar 4, 2014

timholy commented Mar 4, 2014

simonster commented Mar 4, 2014

timholy commented Mar 4, 2014

fjarri commented Mar 9, 2014

carlobaldassi commented Mar 9, 2014

fjarri commented Mar 9, 2014

carlobaldassi commented Mar 9, 2014

timholy commented Mar 10, 2014

carlobaldassi commented Mar 10, 2014

StefanKarpinski commented Mar 10, 2014

timholy commented Mar 10, 2014

carlobaldassi commented Mar 10, 2014

timholy commented Mar 10, 2014

StefanKarpinski commented Mar 10, 2014

StefanKarpinski commented Mar 10, 2014

timholy commented Mar 10, 2014

carlobaldassi commented Mar 10, 2014

timholy commented Mar 10, 2014

carlobaldassi commented Mar 10, 2014

timholy commented Mar 11, 2014

kmsquire commented Mar 11, 2014

kmsquire commented Mar 11, 2014

toivoh commented Mar 11, 2014

stevengj commented Dec 2, 2016

simonster commented Dec 2, 2016

boubalos commented Oct 12, 2019

Slow broadcast addition of matrices #6041

Slow broadcast addition of matrices #6041

Comments

fjarri commented Mar 4, 2014

ivarne commented Mar 4, 2014

ivarne commented Mar 4, 2014

timholy commented Mar 4, 2014

simonster commented Mar 4, 2014

timholy commented Mar 4, 2014

fjarri commented Mar 9, 2014

carlobaldassi commented Mar 9, 2014

fjarri commented Mar 9, 2014

carlobaldassi commented Mar 9, 2014

timholy commented Mar 10, 2014

carlobaldassi commented Mar 10, 2014

StefanKarpinski commented Mar 10, 2014

timholy commented Mar 10, 2014

carlobaldassi commented Mar 10, 2014

timholy commented Mar 10, 2014

StefanKarpinski commented Mar 10, 2014

StefanKarpinski commented Mar 10, 2014

timholy commented Mar 10, 2014

carlobaldassi commented Mar 10, 2014

timholy commented Mar 10, 2014

carlobaldassi commented Mar 10, 2014

timholy commented Mar 11, 2014

kmsquire commented Mar 11, 2014

kmsquire commented Mar 11, 2014

toivoh commented Mar 11, 2014

stevengj commented Dec 2, 2016

simonster commented Dec 2, 2016

boubalos commented Oct 12, 2019