Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broadcasting operations could be faster #13603

Closed
rennis250 opened this issue Oct 14, 2015 · 22 comments
Closed

broadcasting operations could be faster #13603

rennis250 opened this issue Oct 14, 2015 · 22 comments
Labels
performance Must go faster

Comments

@rennis250
Copy link
Contributor

Dear all,

I can't post the exact code or data just yet, but I am trying to work with arrays in a manner equivalent to the following:

a = rand(800,400,300)
b = rand(800,400)
c = rand(800,400)

a = (a .- b) .* c

This is usually the smallest size data I have to work with. Normally, a is 800x400xT, where T can be 600 elements or greater. b and c are always 800x400. Anyway, I read in the manual that with arrays of this size, broadcasting will give significant performance gains. I do indeed get gains compared to hand-written for loops, but the operation still takes ~15 secs with a function from a pre-compiled module, whereas with for loops it takes ~24 secs. I have tried parallel implementations, using @spawnat and @sync, but these actually run slower (I've tested using from 2 up to 14 cores). However, I would presume that this is due to the amount of data transfer that is going on.

On a single core, @time gives the following response:

38.776429 seconds (907.15 M allocations: 20.682 GB, 6.09% gc time)

Using @inbounds can give some extra speed by about a few seconds, but assigning into a different variable, rather than saving directly back into a does not, as expected.

As a comparison, a Go 1.5 implementation finishes this process in ~1.5 secs (as measured by pprof) and gives the same answer as Julia. It uses three nested for loops to step through these matrices element by element via linear indexing and performs the same operation. (Note, I haven't yet tested the speed of linear indexing in Julia.) The Go implementation that I mentioned just now uses 1 core, but it was expanded to run in parallel on 24 cores, where it takes about ~0.5 secs to complete (a blind, brute-force "distribute the data evenly across all the cores" approach was used here).

Here is my versioninfo() output:

Julia Version 0.5.0-dev+720
Commit ef46074* (2015-10-13 10:56 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.0.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Anyway, I have been away from the Julia community for some time, so I don't know what new practices are recommended for speed. Regardless, I was very surprised by this. As much as I like Go, I would be happy to know what I can do to improve this, as I could be moving along much faster with my work by writing everything in Julia.

Best,
Rob

@rennis250
Copy link
Contributor Author

Ah, I should say that the Julia operation takes ~40 seconds at global scope. Within the function that I actually use, which is found within a pre-compiled module, it takes ~15 secs.

Sorry for that. I will correct the above.

Best,
Rob

@KristofferC
Copy link
Member

This is a regression compared to 0.4.0. Running a = (a .- b) .* c in local scope:

0.4 :

0.437902 seconds (66 allocations: 1.431 GB, 17.99% gc time)

In Julia Version 0.5.0-dev+765:

25.237447 seconds (907.03 M allocations: 20.678 GB, 4.77% gc time)

@rennis250
Copy link
Contributor Author

Really? I found similar performance issues on 0.4 also... Let me double check.

Thanks.

@rennis250
Copy link
Contributor Author

Wow, you're right. I don't know why I never noticed that, since I actually developed this code on 0.4.0-dev and had similar speed issues then. Anyway, thanks for catching that.

@rennis250
Copy link
Contributor Author

As an additional comparison, MATLAB R2015b performs the following in ~0.2 secs apparently:

>> a = rand(800,400,300);
>> b = repmat(rand(800,400),1,1,300);
>> c = repmat(rand(800,400),1,1,300);
>> tic();a = (a - b) .* c;toc()
Elapsed time is 0.200628 seconds.

Anyway, now it is time to get back to work. :)

Thanks again, @KristofferC - that saves me a lot of frustration.

Best,
Rob

@rennis250 rennis250 changed the title Unexpectedly slow behavior when broadcasting operations (0.5 Performance regression) Unexpectedly slow behavior when broadcasting operations Oct 14, 2015
@rennis250 rennis250 changed the title (0.5 Performance regression) Unexpectedly slow behavior when broadcasting operations (0.5-dev performance regression) Unexpectedly slow behavior when broadcasting operations Oct 14, 2015
@rennis250 rennis250 changed the title (0.5-dev performance regression) Unexpectedly slow behavior when broadcasting operations (0.5.0-dev performance regression) Unexpectedly slow behavior when broadcasting operations Oct 14, 2015
@tkelman
Copy link
Contributor

tkelman commented Oct 14, 2015

Ouch. Worth a bisect. Does #13553 help?

(was the matlab timing with the single-threaded command line flag?)

@rennis250
Copy link
Contributor Author

Ah, good question. No, it was not. Here it is with the cli flag:

>> a = rand(800,400,300);
>> b = repmat(rand(800,400),1,1,300);
>> c = repmat(rand(800,400),1,1,300);
>> tic();a = (a - b) .* c;toc()
Elapsed time is 1.746552 seconds.
>> tic();a = (a - b) .* c;toc()
Elapsed time is 0.244584 seconds.
>> tic();a = (a - b) .* c;toc()
Elapsed time is 0.218467 seconds.
>>

On Julia 0.4.0, with the function in a pre-compiled module, I now get a timing of anywhere from ~1.8 secs to ~7.0 secs. However, in this case, the arrays are fields of a type, so perhaps accessing them from the type and saving back to the type introduces some overhead.

I will check out that commit at some point tkelman, but have to go.

@KristofferC
Copy link
Member

Oh, I thought #13553 was merged. That one will most likely fix it.

@KristofferC
Copy link
Member

Can now confirm that the regression is fixed by #13553

@KristofferC
Copy link
Member

Can be closed now when #13553 is merged.

@rennis250
Copy link
Contributor Author

Great, thanks! This really saves some time.

However, I am not sure if this should be closed yet. On my machine, it still takes ~3 secs to finish the broadcasting operation at local scope (sometimes ~7 secs when the GC interrupts), while MATLAB can consistently do it in ~0.24 secs after warmup. Only once or twice in my testing so far did Julia reach ~1.2 secs.

@jakebolewski
Copy link
Member

@rennis250 could you provide some reproducible benchmarks, using the Benchmarks.jl package? Otherwise it will be impossible to objectively reproduce the behavior you are seeing.

@jakebolewski jakebolewski reopened this Oct 14, 2015
@timholy
Copy link
Member

timholy commented Oct 14, 2015

@rennis250, please also check with matlab -singleCompThread if you aren't already. Here's what I get:

Matlab:

>> tic(); bsxfun(@times, bsxfun(@minus, a, b), c); toc()
Elapsed time is 0.823640 seconds.
>> tic(); bsxfun(@times, bsxfun(@minus, a, b), c); toc()
Elapsed time is 0.749950 seconds.
>> tic(); bsxfun(@times, bsxfun(@minus, a, b), c); toc()
Elapsed time is 0.736875 seconds.
>> tic(); bsxfun(@times, bsxfun(@minus, a, b), c); toc()
Elapsed time is 0.745596 seconds.
>> tic(); bsxfun(@times, bsxfun(@minus, a, b), c); toc()
Elapsed time is 0.745860 seconds.

Julia:

julia> @time (a .- b) .* c;
  1.919276 seconds (105 allocations: 1.431 GB, 0.40% gc time)

julia> @time (a .- b) .* c;
  1.275899 seconds (70 allocations: 1.431 GB, 3.82% gc time)

julia> @time (a .- b) .* c;
  1.136543 seconds (70 allocations: 1.431 GB, 12.12% gc time)

julia> @time (a .- b) .* c;
  1.215858 seconds (70 allocations: 1.431 GB, 12.14% gc time)

julia> @time (a .- b) .* c;
  1.180650 seconds (70 allocations: 1.431 GB, 12.00% gc time)

julia> @time (a .- b) .* c;
  1.126926 seconds (70 allocations: 1.431 GB, 12.05% gc time)

julia> @time (a .- b) .* c;
  1.166782 seconds (70 allocations: 1.431 GB, 12.21% gc time)

julia> @time (a .- b) .* c;
  1.190176 seconds (70 allocations: 1.431 GB, 12.50% gc time)

Matlab is just slightly faster.

@KristofferC
Copy link
Member

I am confused. On my home computer I was getting 0.4 secondish results and on my work computer (Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz) I get:

julia> @time (a .- b) .* c;
  4.111516 seconds (70 allocations: 1.431 GB, 2.45% gc time)

Maybe I messed up at home.

@timholy
Copy link
Member

timholy commented Oct 14, 2015

This benchmark depends a lot on memory pressure. If the machine isn't stressed, it's very consistent, but if it varies a lot from trial-to-trial try closing some stuff down.

I don't think 0.4 seconds is unreasonable, if your home machine is snappy. My laptop is slow, and I was getting 1.2s.

@timholy
Copy link
Member

timholy commented Oct 14, 2015

Here's the (single-threaded) upper limit of performance for julia on my machine:

julia> function minustimes2!(dest, a, b, c)
           for k = 1:size(dest,3)
               for j = 1:size(dest,2)
                   @simd for i = 1:size(dest, 1)
                       @inbounds dest[i,j,k] = (a[i,j,k] - b[i,j]) * c[i,j]
                   end
               end
           end
           dest
       end
minustimes2! (generic function with 1 method)

julia> minustimes2!(d, a, b, c);

julia> @time minustimes2!(d, a, b, c);
  0.222532 seconds (4 allocations: 160 bytes)

julia> @time minustimes2!(d, a, b, c);
  0.221468 seconds (4 allocations: 160 bytes)

So it's ~6x faster than using broadcasting. At least some of that is due to creating a temporary from a .- b, but perhaps not all of it.

Note, much faster than Matlab, too.

@simonster
Copy link
Member

I get times around 0.41 seconds for this as written. If I write:

g(a, b, c) = (a - b)*c;
@time broadcast(g, a, b, c);

Then I get times around ~0.28 seconds. MATLAB is ~0.18 seconds for the expression with repmat instead of bsxfun (although this doesn't include the time to actually perform the repmat), so this is still slower than that. Tim's code gives me times around 0.21 seconds if I call it as minustimes2!(similar(a), a, b, c), or 0.15 seconds if I call it as minustimes2!(x, a, b, c) with a preallocated x. It seems there are a couple of things going on here:

  • MATLAB appears to be capable of devectorizing this on its own, but we aren't. (MATLAB timing goes up to ~0.29 seconds if I write tic(); d = a - b; e = d .* c; toc().)
  • There is some overhead to broadcasting versus a manually written loop.
  • First accesses to the output array are slow. (If memory pressure is high, they could be much slower.)

@kshyatt kshyatt added performance Must go faster regression Regression in behavior compared to a previous version labels Oct 14, 2015
@JeffBezanson JeffBezanson removed the regression Regression in behavior compared to a previous version label Oct 21, 2015
@JeffBezanson JeffBezanson changed the title (0.5.0-dev performance regression) Unexpectedly slow behavior when broadcasting operations broadcasting operations could be faster Oct 21, 2015
@jrevels jrevels added the potential benchmark Could make a good benchmark in BaseBenchmarks label Nov 17, 2015
@stevengj
Copy link
Member

stevengj commented Aug 2, 2016

Note that #17623 may mostly fix this. In my tests, e.g. x .+ 3.*x.*x.*x .+ 4.*x.*x is pretty much indistinguishable in performance from an explicit @inplace loop.

@oscardssmith
Copy link
Member

Quick related question: why is 3x different from 3.*x?/ should they be the same?

@stevengj
Copy link
Member

stevengj commented Dec 8, 2016

In general, α .* x tells Julia at parse time that it is a fusing broadcast(*, α, x) operation. If you do α * x, on the other hand, Julia doesn't know until compile-time (at best) what * does (since it needs to know the types of α and x to dispatch to the correct * method).

Of course, there is room for future compile-time optimizations that do additional fusion, but that is separate from #17623.

For the specific case of numeric literal juxtaposition ala 3x, it would be possible to parse that as 3 .* x rather than 3 * x. I actually experimented with that in #17623 but reverted it to be conservative. It is still something that we can consider, but I would make it separate from #17623.

@stevengj
Copy link
Member

This will be fixed once #17623 is merged:

julia> a = rand(800,400,300);  b = rand(800,400); c = rand(800,400);

julia> f(a,b,c) = (a .= (a .- b) .* c)
f (generic function with 1 method)

julia> @time f(a,b,c);
  0.116997 seconds (5 allocations: 192 bytes)

@stevengj
Copy link
Member

Closed by #17623

@KristofferC KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Aug 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests