-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broadcasting operations could be faster #13603
Comments
Ah, I should say that the Julia operation takes ~40 seconds at global scope. Within the function that I actually use, which is found within a pre-compiled module, it takes ~15 secs. Sorry for that. I will correct the above. Best, |
This is a regression compared to 0.4.0. Running 0.4 :
In
|
Really? I found similar performance issues on 0.4 also... Let me double check. Thanks. |
Wow, you're right. I don't know why I never noticed that, since I actually developed this code on 0.4.0-dev and had similar speed issues then. Anyway, thanks for catching that. |
As an additional comparison, MATLAB R2015b performs the following in ~0.2 secs apparently: >> a = rand(800,400,300);
>> b = repmat(rand(800,400),1,1,300);
>> c = repmat(rand(800,400),1,1,300);
>> tic();a = (a - b) .* c;toc()
Elapsed time is 0.200628 seconds. Anyway, now it is time to get back to work. :) Thanks again, @KristofferC - that saves me a lot of frustration. Best, |
Ouch. Worth a bisect. Does #13553 help? (was the matlab timing with the single-threaded command line flag?) |
Ah, good question. No, it was not. Here it is with the cli flag: >> a = rand(800,400,300);
>> b = repmat(rand(800,400),1,1,300);
>> c = repmat(rand(800,400),1,1,300);
>> tic();a = (a - b) .* c;toc()
Elapsed time is 1.746552 seconds.
>> tic();a = (a - b) .* c;toc()
Elapsed time is 0.244584 seconds.
>> tic();a = (a - b) .* c;toc()
Elapsed time is 0.218467 seconds.
>> On Julia 0.4.0, with the function in a pre-compiled module, I now get a timing of anywhere from ~1.8 secs to ~7.0 secs. However, in this case, the arrays are fields of a type, so perhaps accessing them from the type and saving back to the type introduces some overhead. I will check out that commit at some point tkelman, but have to go. |
Oh, I thought #13553 was merged. That one will most likely fix it. |
Can now confirm that the regression is fixed by #13553 |
Can be closed now when #13553 is merged. |
Great, thanks! This really saves some time. However, I am not sure if this should be closed yet. On my machine, it still takes ~3 secs to finish the broadcasting operation at local scope (sometimes ~7 secs when the GC interrupts), while MATLAB can consistently do it in ~0.24 secs after warmup. Only once or twice in my testing so far did Julia reach ~1.2 secs. |
@rennis250 could you provide some reproducible benchmarks, using the Benchmarks.jl package? Otherwise it will be impossible to objectively reproduce the behavior you are seeing. |
@rennis250, please also check with Matlab:
Julia: julia> @time (a .- b) .* c;
1.919276 seconds (105 allocations: 1.431 GB, 0.40% gc time)
julia> @time (a .- b) .* c;
1.275899 seconds (70 allocations: 1.431 GB, 3.82% gc time)
julia> @time (a .- b) .* c;
1.136543 seconds (70 allocations: 1.431 GB, 12.12% gc time)
julia> @time (a .- b) .* c;
1.215858 seconds (70 allocations: 1.431 GB, 12.14% gc time)
julia> @time (a .- b) .* c;
1.180650 seconds (70 allocations: 1.431 GB, 12.00% gc time)
julia> @time (a .- b) .* c;
1.126926 seconds (70 allocations: 1.431 GB, 12.05% gc time)
julia> @time (a .- b) .* c;
1.166782 seconds (70 allocations: 1.431 GB, 12.21% gc time)
julia> @time (a .- b) .* c;
1.190176 seconds (70 allocations: 1.431 GB, 12.50% gc time) Matlab is just slightly faster. |
I am confused. On my home computer I was getting 0.4 secondish results and on my work computer (Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz) I get:
Maybe I messed up at home. |
This benchmark depends a lot on memory pressure. If the machine isn't stressed, it's very consistent, but if it varies a lot from trial-to-trial try closing some stuff down. I don't think 0.4 seconds is unreasonable, if your home machine is snappy. My laptop is slow, and I was getting 1.2s. |
Here's the (single-threaded) upper limit of performance for julia on my machine: julia> function minustimes2!(dest, a, b, c)
for k = 1:size(dest,3)
for j = 1:size(dest,2)
@simd for i = 1:size(dest, 1)
@inbounds dest[i,j,k] = (a[i,j,k] - b[i,j]) * c[i,j]
end
end
end
dest
end
minustimes2! (generic function with 1 method)
julia> minustimes2!(d, a, b, c);
julia> @time minustimes2!(d, a, b, c);
0.222532 seconds (4 allocations: 160 bytes)
julia> @time minustimes2!(d, a, b, c);
0.221468 seconds (4 allocations: 160 bytes) So it's ~6x faster than using broadcasting. At least some of that is due to creating a temporary from Note, much faster than Matlab, too. |
I get times around 0.41 seconds for this as written. If I write: g(a, b, c) = (a - b)*c;
@time broadcast(g, a, b, c); Then I get times around ~0.28 seconds. MATLAB is ~0.18 seconds for the expression with
|
Note that #17623 may mostly fix this. In my tests, e.g. |
Quick related question: why is 3x different from 3.*x?/ should they be the same? |
In general, Of course, there is room for future compile-time optimizations that do additional fusion, but that is separate from #17623. For the specific case of numeric literal juxtaposition ala |
This will be fixed once #17623 is merged: julia> a = rand(800,400,300); b = rand(800,400); c = rand(800,400);
julia> f(a,b,c) = (a .= (a .- b) .* c)
f (generic function with 1 method)
julia> @time f(a,b,c);
0.116997 seconds (5 allocations: 192 bytes) |
Closed by #17623 |
Dear all,
I can't post the exact code or data just yet, but I am trying to work with arrays in a manner equivalent to the following:
This is usually the smallest size data I have to work with. Normally,
a
is 800x400xT, where T can be 600 elements or greater.b
andc
are always 800x400. Anyway, I read in the manual that with arrays of this size, broadcasting will give significant performance gains. I do indeed get gains compared to hand-writtenfor
loops, but the operation still takes ~15 secs with a function from a pre-compiled module, whereas withfor
loops it takes ~24 secs. I have tried parallel implementations, using@spawnat
and@sync
, but these actually run slower (I've tested using from 2 up to 14 cores). However, I would presume that this is due to the amount of data transfer that is going on.On a single core,
@time
gives the following response:38.776429 seconds (907.15 M allocations: 20.682 GB, 6.09% gc time)
Using
@inbounds
can give some extra speed by about a few seconds, but assigning into a different variable, rather than saving directly back intoa
does not, as expected.As a comparison, a Go 1.5 implementation finishes this process in ~1.5 secs (as measured by
pprof
) and gives the same answer as Julia. It uses three nestedfor
loops to step through these matrices element by element via linear indexing and performs the same operation. (Note, I haven't yet tested the speed of linear indexing in Julia.) The Go implementation that I mentioned just now uses 1 core, but it was expanded to run in parallel on 24 cores, where it takes about ~0.5 secs to complete (a blind, brute-force "distribute the data evenly across all the cores" approach was used here).Here is my
versioninfo()
output:Anyway, I have been away from the Julia community for some time, so I don't know what new practices are recommended for speed. Regardless, I was very surprised by this. As much as I like Go, I would be happy to know what I can do to improve this, as I could be moving along much faster with my work by writing everything in Julia.
Best,
Rob
The text was updated successfully, but these errors were encountered: