broadcasting operations could be faster #13603

rennis250 · 2015-10-14T06:52:32Z

Dear all,

I can't post the exact code or data just yet, but I am trying to work with arrays in a manner equivalent to the following:

a = rand(800,400,300)
b = rand(800,400)
c = rand(800,400)

a = (a .- b) .* c

This is usually the smallest size data I have to work with. Normally, a is 800x400xT, where T can be 600 elements or greater. b and c are always 800x400. Anyway, I read in the manual that with arrays of this size, broadcasting will give significant performance gains. I do indeed get gains compared to hand-written for loops, but the operation still takes ~15 secs with a function from a pre-compiled module, whereas with for loops it takes ~24 secs. I have tried parallel implementations, using @spawnat and @sync, but these actually run slower (I've tested using from 2 up to 14 cores). However, I would presume that this is due to the amount of data transfer that is going on.

On a single core, @time gives the following response:

38.776429 seconds (907.15 M allocations: 20.682 GB, 6.09% gc time)

Using @inbounds can give some extra speed by about a few seconds, but assigning into a different variable, rather than saving directly back into a does not, as expected.

As a comparison, a Go 1.5 implementation finishes this process in ~1.5 secs (as measured by pprof) and gives the same answer as Julia. It uses three nested for loops to step through these matrices element by element via linear indexing and performs the same operation. (Note, I haven't yet tested the speed of linear indexing in Julia.) The Go implementation that I mentioned just now uses 1 core, but it was expanded to run in parallel on 24 cores, where it takes about ~0.5 secs to complete (a blind, brute-force "distribute the data evenly across all the cores" approach was used here).

Here is my versioninfo() output:

Julia Version 0.5.0-dev+720
Commit ef46074* (2015-10-13 10:56 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.0.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Anyway, I have been away from the Julia community for some time, so I don't know what new practices are recommended for speed. Regardless, I was very surprised by this. As much as I like Go, I would be happy to know what I can do to improve this, as I could be moving along much faster with my work by writing everything in Julia.

Best,
Rob

The text was updated successfully, but these errors were encountered:

rennis250 · 2015-10-14T07:07:12Z

Ah, I should say that the Julia operation takes ~40 seconds at global scope. Within the function that I actually use, which is found within a pre-compiled module, it takes ~15 secs.

Sorry for that. I will correct the above.

Best,
Rob

KristofferC · 2015-10-14T07:24:02Z

This is a regression compared to 0.4.0. Running a = (a .- b) .* c in local scope:

0.4 :

0.437902 seconds (66 allocations: 1.431 GB, 17.99% gc time)

In Julia Version 0.5.0-dev+765:

25.237447 seconds (907.03 M allocations: 20.678 GB, 4.77% gc time)

rennis250 · 2015-10-14T07:30:52Z

Really? I found similar performance issues on 0.4 also... Let me double check.

Thanks.

rennis250 · 2015-10-14T07:32:21Z

Wow, you're right. I don't know why I never noticed that, since I actually developed this code on 0.4.0-dev and had similar speed issues then. Anyway, thanks for catching that.

rennis250 · 2015-10-14T07:47:05Z

As an additional comparison, MATLAB R2015b performs the following in ~0.2 secs apparently:

>> a = rand(800,400,300);
>> b = repmat(rand(800,400),1,1,300);
>> c = repmat(rand(800,400),1,1,300);
>> tic();a = (a - b) .* c;toc()
Elapsed time is 0.200628 seconds.

Anyway, now it is time to get back to work. :)

Thanks again, @KristofferC - that saves me a lot of frustration.

Best,
Rob

tkelman · 2015-10-14T08:09:23Z

Ouch. Worth a bisect. Does #13553 help?

(was the matlab timing with the single-threaded command line flag?)

rennis250 · 2015-10-14T08:20:26Z

Ah, good question. No, it was not. Here it is with the cli flag:

>> a = rand(800,400,300);
>> b = repmat(rand(800,400),1,1,300);
>> c = repmat(rand(800,400),1,1,300);
>> tic();a = (a - b) .* c;toc()
Elapsed time is 1.746552 seconds.
>> tic();a = (a - b) .* c;toc()
Elapsed time is 0.244584 seconds.
>> tic();a = (a - b) .* c;toc()
Elapsed time is 0.218467 seconds.
>>

On Julia 0.4.0, with the function in a pre-compiled module, I now get a timing of anywhere from ~1.8 secs to ~7.0 secs. However, in this case, the arrays are fields of a type, so perhaps accessing them from the type and saving back to the type introduces some overhead.

I will check out that commit at some point tkelman, but have to go.

KristofferC · 2015-10-14T08:22:56Z

Oh, I thought #13553 was merged. That one will most likely fix it.

KristofferC · 2015-10-14T09:27:45Z

Can now confirm that the regression is fixed by #13553

KristofferC · 2015-10-14T12:32:03Z

Can be closed now when #13553 is merged.

rennis250 · 2015-10-14T14:45:53Z

Great, thanks! This really saves some time.

However, I am not sure if this should be closed yet. On my machine, it still takes ~3 secs to finish the broadcasting operation at local scope (sometimes ~7 secs when the GC interrupts), while MATLAB can consistently do it in ~0.24 secs after warmup. Only once or twice in my testing so far did Julia reach ~1.2 secs.

jakebolewski · 2015-10-14T14:48:58Z

@rennis250 could you provide some reproducible benchmarks, using the Benchmarks.jl package? Otherwise it will be impossible to objectively reproduce the behavior you are seeing.

timholy · 2015-10-14T15:16:57Z

@rennis250, please also check with matlab -singleCompThread if you aren't already. Here's what I get:

Matlab:

>> tic(); bsxfun(@times, bsxfun(@minus, a, b), c); toc()
Elapsed time is 0.823640 seconds.
>> tic(); bsxfun(@times, bsxfun(@minus, a, b), c); toc()
Elapsed time is 0.749950 seconds.
>> tic(); bsxfun(@times, bsxfun(@minus, a, b), c); toc()
Elapsed time is 0.736875 seconds.
>> tic(); bsxfun(@times, bsxfun(@minus, a, b), c); toc()
Elapsed time is 0.745596 seconds.
>> tic(); bsxfun(@times, bsxfun(@minus, a, b), c); toc()
Elapsed time is 0.745860 seconds.

Julia:

julia> @time (a .- b) .* c;
  1.919276 seconds (105 allocations: 1.431 GB, 0.40% gc time)

julia> @time (a .- b) .* c;
  1.275899 seconds (70 allocations: 1.431 GB, 3.82% gc time)

julia> @time (a .- b) .* c;
  1.136543 seconds (70 allocations: 1.431 GB, 12.12% gc time)

julia> @time (a .- b) .* c;
  1.215858 seconds (70 allocations: 1.431 GB, 12.14% gc time)

julia> @time (a .- b) .* c;
  1.180650 seconds (70 allocations: 1.431 GB, 12.00% gc time)

julia> @time (a .- b) .* c;
  1.126926 seconds (70 allocations: 1.431 GB, 12.05% gc time)

julia> @time (a .- b) .* c;
  1.166782 seconds (70 allocations: 1.431 GB, 12.21% gc time)

julia> @time (a .- b) .* c;
  1.190176 seconds (70 allocations: 1.431 GB, 12.50% gc time)

Matlab is just slightly faster.

KristofferC · 2015-10-14T15:21:43Z

I am confused. On my home computer I was getting 0.4 secondish results and on my work computer (Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz) I get:

julia> @time (a .- b) .* c;
  4.111516 seconds (70 allocations: 1.431 GB, 2.45% gc time)

Maybe I messed up at home.

timholy · 2015-10-14T15:35:06Z

This benchmark depends a lot on memory pressure. If the machine isn't stressed, it's very consistent, but if it varies a lot from trial-to-trial try closing some stuff down.

I don't think 0.4 seconds is unreasonable, if your home machine is snappy. My laptop is slow, and I was getting 1.2s.

timholy · 2015-10-14T15:37:55Z

Here's the (single-threaded) upper limit of performance for julia on my machine:

julia> function minustimes2!(dest, a, b, c)
           for k = 1:size(dest,3)
               for j = 1:size(dest,2)
                   @simd for i = 1:size(dest, 1)
                       @inbounds dest[i,j,k] = (a[i,j,k] - b[i,j]) * c[i,j]
                   end
               end
           end
           dest
       end
minustimes2! (generic function with 1 method)

julia> minustimes2!(d, a, b, c);

julia> @time minustimes2!(d, a, b, c);
  0.222532 seconds (4 allocations: 160 bytes)

julia> @time minustimes2!(d, a, b, c);
  0.221468 seconds (4 allocations: 160 bytes)

So it's ~6x faster than using broadcasting. At least some of that is due to creating a temporary from a .- b, but perhaps not all of it.

Note, much faster than Matlab, too.

simonster · 2015-10-14T15:46:07Z

I get times around 0.41 seconds for this as written. If I write:

g(a, b, c) = (a - b)*c;
@time broadcast(g, a, b, c);

Then I get times around ~0.28 seconds. MATLAB is ~0.18 seconds for the expression with repmat instead of bsxfun (although this doesn't include the time to actually perform the repmat), so this is still slower than that. Tim's code gives me times around 0.21 seconds if I call it as minustimes2!(similar(a), a, b, c), or 0.15 seconds if I call it as minustimes2!(x, a, b, c) with a preallocated x. It seems there are a couple of things going on here:

MATLAB appears to be capable of devectorizing this on its own, but we aren't. (MATLAB timing goes up to ~0.29 seconds if I write tic(); d = a - b; e = d .* c; toc().)
There is some overhead to broadcasting versus a manually written loop.
First accesses to the output array are slow. (If memory pressure is high, they could be much slower.)

stevengj · 2016-08-02T23:17:15Z

Note that #17623 may mostly fix this. In my tests, e.g. x .+ 3.*x.*x.*x .+ 4.*x.*x is pretty much indistinguishable in performance from an explicit @inplace loop.

oscardssmith · 2016-12-08T04:58:24Z

Quick related question: why is 3x different from 3.*x?/ should they be the same?

stevengj · 2016-12-08T16:13:03Z

In general, α .* x tells Julia at parse time that it is a fusing broadcast(*, α, x) operation. If you do α * x, on the other hand, Julia doesn't know until compile-time (at best) what * does (since it needs to know the types of α and x to dispatch to the correct * method).

Of course, there is room for future compile-time optimizations that do additional fusion, but that is separate from #17623.

For the specific case of numeric literal juxtaposition ala 3x, it would be possible to parse that as 3 .* x rather than 3 * x. I actually experimented with that in #17623 but reverted it to be conservative. It is still something that we can consider, but I would make it separate from #17623.

stevengj · 2016-12-16T00:07:20Z

This will be fixed once #17623 is merged:

julia> a = rand(800,400,300);  b = rand(800,400); c = rand(800,400);

julia> f(a,b,c) = (a .= (a .- b) .* c)
f (generic function with 1 method)

julia> @time f(a,b,c);
  0.116997 seconds (5 allocations: 192 bytes)

stevengj · 2016-12-20T15:05:33Z

Closed by #17623

rennis250 changed the title ~~Unexpectedly slow behavior when broadcasting operations~~ (0.5 Performance regression) Unexpectedly slow behavior when broadcasting operations Oct 14, 2015

rennis250 changed the title ~~(0.5 Performance regression) Unexpectedly slow behavior when broadcasting operations~~ (0.5-dev performance regression) Unexpectedly slow behavior when broadcasting operations Oct 14, 2015

rennis250 changed the title ~~(0.5-dev performance regression) Unexpectedly slow behavior when broadcasting operations~~ (0.5.0-dev performance regression) Unexpectedly slow behavior when broadcasting operations Oct 14, 2015

yuyichao closed this as completed Oct 14, 2015

jakebolewski reopened this Oct 14, 2015

kshyatt added performance Must go faster regression Regression in behavior compared to a previous version labels Oct 14, 2015

JeffBezanson removed the regression Regression in behavior compared to a previous version label Oct 21, 2015

JeffBezanson changed the title ~~(0.5.0-dev performance regression) Unexpectedly slow behavior when broadcasting operations~~ broadcasting operations could be faster Oct 21, 2015

jrevels added the potential benchmark Could make a good benchmark in BaseBenchmarks label Nov 17, 2015

Sacha0 mentioned this issue Dec 8, 2016

Make 3x broadcasting #19537

Closed

stevengj closed this as completed Dec 20, 2016

KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Aug 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broadcasting operations could be faster #13603

broadcasting operations could be faster #13603

rennis250 commented Oct 14, 2015

rennis250 commented Oct 14, 2015

KristofferC commented Oct 14, 2015

rennis250 commented Oct 14, 2015

rennis250 commented Oct 14, 2015

rennis250 commented Oct 14, 2015

tkelman commented Oct 14, 2015

rennis250 commented Oct 14, 2015

KristofferC commented Oct 14, 2015

KristofferC commented Oct 14, 2015

KristofferC commented Oct 14, 2015

rennis250 commented Oct 14, 2015

jakebolewski commented Oct 14, 2015

timholy commented Oct 14, 2015

KristofferC commented Oct 14, 2015

timholy commented Oct 14, 2015

timholy commented Oct 14, 2015

simonster commented Oct 14, 2015

stevengj commented Aug 2, 2016

oscardssmith commented Dec 8, 2016

stevengj commented Dec 8, 2016

stevengj commented Dec 16, 2016

stevengj commented Dec 20, 2016

broadcasting operations could be faster #13603

broadcasting operations could be faster #13603

Comments

rennis250 commented Oct 14, 2015

rennis250 commented Oct 14, 2015

KristofferC commented Oct 14, 2015

rennis250 commented Oct 14, 2015

rennis250 commented Oct 14, 2015

rennis250 commented Oct 14, 2015

tkelman commented Oct 14, 2015

rennis250 commented Oct 14, 2015

KristofferC commented Oct 14, 2015

KristofferC commented Oct 14, 2015

KristofferC commented Oct 14, 2015

rennis250 commented Oct 14, 2015

jakebolewski commented Oct 14, 2015

timholy commented Oct 14, 2015

KristofferC commented Oct 14, 2015

timholy commented Oct 14, 2015

timholy commented Oct 14, 2015

simonster commented Oct 14, 2015

stevengj commented Aug 2, 2016

oscardssmith commented Dec 8, 2016

stevengj commented Dec 8, 2016

stevengj commented Dec 16, 2016

stevengj commented Dec 20, 2016