Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow broadcast addition of matrices #6041

Closed
fjarri opened this issue Mar 4, 2014 · 28 comments
Closed

Slow broadcast addition of matrices #6041

fjarri opened this issue Mar 4, 2014 · 28 comments
Labels
linear algebra Linear algebra performance Must go faster

Comments

@fjarri
Copy link
Contributor

fjarri commented Mar 4, 2014

Consider the following benchmark. It compares the performance of the broadcast addition (.+) for matrices of different sizes in Julia and in Python. In Python + was used because it broadcasts automatically behind the scenes.

My configuration:
OSX 10.9.2
Python 3.3.4, numpy 1.8.0
Julia HEAD (as of 4 Mar 2014, commit d36bb08), compiled with Accelerate

The results for the dot tests are (combined side-by-side for convenience)

Python | Julia

add(float, float), 10 x 10: 0.22297 s | 0.51618 s
add(float, float), 100 x 100: 0.013359 s | 0.094047 s
add(float, float), 1000 x 1000: 0.068224 s | 0.10046 s
add(complex, float), 10 x 10: 0.69337 s | 1.0949 s
add(complex, float), 100 x 100: 0.049746 s | 0.27063 s
add(complex, float), 1000 x 1000: 0.12786 s | 0.19651 s
add(complex, complex), 10 x 10: 0.22025 s | 1.0902 s
add(complex, complex), 100 x 100: 0.030946 s | 0.27640 s
add(complex, complex), 1000 x 1000: 0.13954 s | 0.20228 s

In all cases there is 2x-10x slowdown. Is there something special that broadcasting does, and should it be avoided in the performance-critical code? It does not seem to have that prominent effect in Python.

@ivarne
Copy link
Member

ivarne commented Mar 4, 2014

I do not think Julia is slow on broadcasting, but it gets very slow when the code uses anonymous functions. I'll write a revised version of your benchmark to see if I'm right in that statement

@ivarne
Copy link
Member

ivarne commented Mar 4, 2014

I'm sorry. None of the usual suspects for Julia performance seems to make a difference. Your benchmark seems to not have that huge overhead that I expected. I have a revised benshmark that acomplishes the same as yours in much fewer lines, but the time seems to be spent in the BLAS functions.

I had some improvement for the 10*10 matrices to call blas_set_num_threads(1).

@timholy
Copy link
Member

timholy commented Mar 4, 2014

The entire problem is garbage-collection---if I preallocate the output, the performance of Julia and Python are identical:

function rundotplus_prealloc(A, B, iters)
         C = similar(A); for k = 1:iters
           broadcast!(.+, C, A, B)       
         end
         nothing
       end
A = rand(1000,1000);
B = rand(1000,1000);
@time rundotplus_prealloc(A, B, 100)

You may be interested in trying the branch on #5227.

@simonster
Copy link
Member

Related: #4784

@timholy
Copy link
Member

timholy commented Mar 4, 2014

Since this is "just" a GC issue, and since we have other open GC issues, I'm closing. Feel free to reopen if something unique crops up.

@timholy timholy closed this as completed Mar 4, 2014
@fjarri
Copy link
Contributor Author

fjarri commented Mar 9, 2014

Edit: used the "pure" broadcast instead of "inplace" broadcast! thus producing incorrect results.

Tim,

Sorry for bringing this up again, but I have run this updated benchmark to compare "preallocated" and "dynamic" cases (and compare them both to the regular + operator), and while preallocation does help a lot, broadcasted addition is still much slower than the pure elementwise one:

add, 10 x 10: 0.106823909 s
add, 100 x 100: 0.084994372 s
add, 1000 x 1000: 0.096064399 s
add_bc, 10 x 10: 0.523851014 s
add_bc, 100 x 100: 0.08918326 s
add_bc, 1000 x 1000: 0.088193045 s
add_bc_prealloc, 10 x 10: 0.327231519 s
add_bc_prealloc, 100 x 100: 0.016404025 s
add_bc_prealloc, 1000 x 1000: 0.055645974 s

Broadcasting for small arrays shows 3x slowdown as compared to just +.

@carlobaldassi
Copy link
Member

With the latest master, I get these timings from your gist:

add, 10 x 10: 0.148591762 s
add, 100 x 100: 0.138836403 s
add, 1000 x 1000: 0.079722051 s
add_bc, 10 x 10: 0.591883451 s
add_bc, 100 x 100: 0.142932361 s
add_bc, 1000 x 1000: 0.078317679 s
add_bc_prealloc, 10 x 10: 0.3482838 s
add_bc_prealloc, 100 x 100: 0.017104865 s
add_bc_prealloc, 1000 x 1000: 0.037650405 s

So yes, broadcasting with preallocation is still slower then plain + for small matrices (by a factor of ~2.3), while performance improves dramatically with preallocation for large matrices.

@fjarri
Copy link
Contributor Author

fjarri commented Mar 9, 2014

So, to reiterate my initial question, is there something special that broadcasting does, and should it be avoided in the performance-critical code? Because even with the preallocation the result for 10x10 matrices is 1.5 times slower than the Python version. I would expect .+ to check for the equality of sizes and fall back to the regular +, but this alone does not seem to be enough to explain the difference in performance.

@carlobaldassi
Copy link
Member

Profiling the 10x10 case shows that most of the time is spent in the function lookup from the cache, i.e. the line:

func = broadcast_cache[key]

where key is a Tuple, which in turn points to line 565 in dict.jl:

index = ht_keyindex(h, key)

after which profiling information stops.

@timholy
Copy link
Member

timholy commented Mar 10, 2014

Argh. I suppose we could create special-purpose versions for all kinds of operators for the binary case. A general fix is #5395.

@carlobaldassi
Copy link
Member

In fact, we can fix it!
But it needs using

  1. a multi-level Dict, so you'd have broadcast_cache[f][nd][narray] instead of broadcast_cache[(f,nd,narray)]
  2. a macro which I believe at some point was called @get!, i.e. the same as the get! function for Dicts which does not evaluate the provided default value unless needed.

See this gist where I wrote everything explicitly. It would look much better with a macro, but note that the full ugly signatures of the nested Dicts are actually needed to achieve maximum performance.

Timings:

julia> include("perf.jl")
add, 10 x 10: 0.071660699 s
add, 100 x 100: 0.061634806 s
add, 1000 x 1000: 0.066237116 s
add_bc, 10 x 10: 0.176661078 s
add_bc, 100 x 100: 0.064063408 s
add_bc, 1000 x 1000: 0.065880186 s
add_bc_prealloc, 10 x 10: 0.038968927 s
add_bc_prealloc, 100 x 100: 0.014158759 s
add_bc_prealloc, 1000 x 1000: 0.037325413 s

@StefanKarpinski
Copy link
Member

Is there perhaps a better data structure that would make this easier / more elegant / faster?

@timholy
Copy link
Member

timholy commented Mar 10, 2014

@carlobaldassi, that's a nice speedup!

@StefanKarpinski, this would not be easier or more elegant (the Dict is pretty much the easiest way to look up a cached method, when you have to maintain your own cache), but perhaps even faster: while the function would probably have to live in a Dict, it could return a vector-of-vectors, in which you look up nargs and then nd. You'd have to grow them as needed, and check whether an element is defined. I don't know if the latter hammers performance or not.

Or, we could decree the following: anytime you compile a method for nd, you also have to compile any missing methods for n < nd. Then we could skip the isdefined check.

@carlobaldassi
Copy link
Member

Maybe a Dict{Function,Vector{Vector{Function}}}? After all, the integer keys are "number of dimensions" and "number of arguments", which are likely to be small. I wouldn't know how to avoid the Dict for Functions though. And I'm not sure it would be more elegant to deal with Vectors with potentially missing entries and potentially variable length. See also an improved function at the bottom of the previous gist.

@timholy
Copy link
Member

timholy commented Mar 10, 2014

Hmm, great minds think alike 😄.

@StefanKarpinski
Copy link
Member

The isdefined check is just a null pointer check, so that's not a huge performance issue. This might be a faster way to do this. Using arrays instead of dicts wherever possible certainly is good. Would it be possible to avoid the function key by explicitly generating one of these caches for each function that needs it?

@StefanKarpinski
Copy link
Member

Even though the isdefined check isn't expensive, it may be reasonable to pre-generate methods since you're already taking a performance hit in those cases anyway. If you're doing n-d aren't you likely to need ≤n-d soon anyway?

@timholy
Copy link
Member

timholy commented Mar 10, 2014

Yes, I think pre-defining up to nd is a good idea.

And yes, maybe we should just generate separate caches for .+, .-, .*, ./, and .==.

@carlobaldassi
Copy link
Member

Hmm, great minds think alike 😄.

Ahah it seems we posted the exact same comment (well, almost) at the same moment.

Anyway, I experimented a bit with the solutions which have come up so far. My results (based on an extended version of the benchmark by @Manticore) are that

  1. Using vectors of vectors with isdefined check is (very slightly, less than 5% on 10x10 matrices) worse than using nested Dicts
  2. Pre-defining up to nd and avoiding the isdefined check (leaving only the length check) brings performance back on par with nested Dicts.
  3. A quick experiment with function-specific cache shows some performance improvement, on the order of 10% for 10x10 matrices. I'm not sure it's worth the trouble, but it's an orthogonal issue with respect to the other one anyway.

Points 1 and 2 are surprising but I tried to make it as fast as I could and that's the result.
So I'll submit a PR with the nested Dicts version, and maybe we can decide whether to add the specialized caches or not on top of that.

@timholy
Copy link
Member

timholy commented Mar 10, 2014

That is quite surprising.

@carlobaldassi
Copy link
Member

Just to add to the confusion, I tried to do more thorough experiments with specialized broadcast functions for +, - etc., but I stumbled upon something I'm not able to explain, namely that while timings seem consistent in a given julia session, restarting julia gives slightly different results. This is not the first time I see this, when trying to benchmark very fast functions which are called repeatedly, and trying to detect small differences.
In any case, the net result is that I can't reliably measure any difference in performance with respect to using the nested Dict version.

carlobaldassi added a commit that referenced this issue Mar 10, 2014
carlobaldassi added a commit that referenced this issue Mar 11, 2014
Change broadcast! cache to use nested Dict instead of a Dict with Tuple
keys; use get! macro to access its fields.
Related to #6041.

Also reduces code duplication in broadcast! definitions via eval macro.
carlobaldassi added a commit that referenced this issue Mar 11, 2014
Change broadcast! cache to use nested Dict instead of a Dict with Tuple
keys; use get! macro to access its fields.
Related to #6041.

Also reduces code duplication in broadcast! definitions via eval macro.
@timholy
Copy link
Member

timholy commented Mar 11, 2014

I have seen the same thing. Never figured out what is going on. I've even seen a function suddenly start running more slowly partway through a REPL session, with no obvious reason for the switch.

@kmsquire
Copy link
Member

Just FYI, there is a do-block version of get!, proposed by Stefan, which does not evaluate its arguments unless they're needed (see http://docs.julialang.org/en/latest/stdlib/base/#associative-collections). I'm not sure how useful it is, though, since it's dealing with anonymous functions.

@kmsquire
Copy link
Member

FWIW, I noticed differences in timings especially on a machine with many cores, and I suspect that there might be subtle timing differences when code gets run on different processors/cores.

carlobaldassi added a commit that referenced this issue Mar 11, 2014
Change broadcast! cache to use nested Dict instead of a Dict with Tuple
keys; use get! macro to access its fields.
Related to #6041.

Also reduces code duplication in broadcast! definitions via eval macro.
@toivoh
Copy link
Contributor

toivoh commented Mar 11, 2014

Regarding fast broadcasting on small matrices, I created the functions
broadcast_function and broadcast!_function to allow to generate the
broadcast function for a given function once, store it in a const, and then
apply it many times. That should save a dict lookup.

Apart from that dict lookup, the intent was that the lookup based on number
and type of arguments should be plain method lookup. But I understand that
that doesn't quite work until we have recompilation of functions.

carlobaldassi added a commit that referenced this issue Mar 12, 2014
Change broadcast! cache to use nested Dict instead of a Dict with Tuple
keys; use get! macro to access its fields.
Related to #6041.

Also reduces code duplication in broadcast! definitions via eval macro.
@stevengj
Copy link
Member

stevengj commented Dec 2, 2016

This issue seems to be moot in 0.5 with fast higher-order functions. If anything, the broadcast version is now faster than + on my machine:

julia> using BenchmarkTools

julia> dotplus(A,B) = broadcast(+, A, B)
dotplus (generic function with 1 method)

julia> A = rand(1000,1000); B = copy(A);

julia> @benchmark $A + $B
BenchmarkTools.Trial: 
  memory estimate:  7.63 mb
  allocs estimate:  3
  --------------
  minimum time:     2.151 ms (0.00% GC)
  median time:      2.579 ms (0.00% GC)
  mean time:        2.678 ms (16.03% GC)
  maximum time:     5.043 ms (43.44% GC)
  --------------
  samples:          1866
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark dotplus($A, $B)
BenchmarkTools.Trial: 
  memory estimate:  7.63 mb
  allocs estimate:  14
  --------------
  minimum time:     1.681 ms (0.00% GC)
  median time:      1.922 ms (0.00% GC)
  mean time:        2.075 ms (20.73% GC)
  maximum time:     3.763 ms (56.19% GC)
  --------------
  samples:          2407
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

@simonster
Copy link
Member

At least for me, broadcast is still substantially slower for very small arrays, probably because it's doing more allocations:

julia> A = rand(10,10); B = copy(A);

julia> @benchmark $A + $B
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     530
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  928.00 bytes
  allocs estimate:  2
  minimum time:     212.00 ns (0.00% GC)
  median time:      224.00 ns (0.00% GC)
  mean time:        255.32 ns (6.28% GC)
  maximum time:     2.27 μs (76.42% GC)

julia> @benchmark dotplus($A, $B)
BenchmarkTools.Trial: 
  samples:          10000
  evals/sample:     137
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.23 kb
  allocs estimate:  13
  minimum time:     720.00 ns (0.00% GC)
  median time:      736.00 ns (0.00% GC)
  mean time:        840.92 ns (8.36% GC)
  maximum time:     17.67 μs (90.57% GC)

@boubalos
Copy link

The entire problem is garbage-collection---if I preallocate the output, the performance of Julia and Python are identical:

function rundotplus_prealloc(A, B, iters)
         C = similar(A); for k = 1:iters
           broadcast!(.+, C, A, B)       
         end
         nothing
       end
A = rand(1000,1000);
B = rand(1000,1000);
@time rundotplus_prealloc(A, B, 100)

You may be interested in trying the branch on #5227.

this is more elegant, see the julia doc
@time @. C=A+B

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

No branches or pull requests