-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow broadcast addition of matrices #6041
Comments
|
I'm sorry. None of the usual suspects for Julia performance seems to make a difference. Your benchmark seems to not have that huge overhead that I expected. I have a revised benshmark that acomplishes the same as yours in much fewer lines, but the time seems to be spent in the BLAS functions. I had some improvement for the 10*10 matrices to call |
The entire problem is garbage-collection---if I preallocate the output, the performance of Julia and Python are identical:
You may be interested in trying the branch on #5227. |
Related: #4784 |
Since this is "just" a GC issue, and since we have other open GC issues, I'm closing. Feel free to reopen if something unique crops up. |
Edit: used the "pure" Tim, Sorry for bringing this up again, but I have run this updated benchmark to compare "preallocated" and "dynamic" cases (and compare them both to the regular
Broadcasting for small arrays shows 3x slowdown as compared to just |
With the latest master, I get these timings from your gist:
So yes, broadcasting with preallocation is still slower then plain |
So, to reiterate my initial question, is there something special that broadcasting does, and should it be avoided in the performance-critical code? Because even with the preallocation the result for 10x10 matrices is 1.5 times slower than the Python version. I would expect |
Profiling the 10x10 case shows that most of the time is spent in the function lookup from the cache, i.e. the line:
where
after which profiling information stops. |
Argh. I suppose we could create special-purpose versions for all kinds of operators for the binary case. A general fix is #5395. |
In fact, we can fix it!
See this gist where I wrote everything explicitly. It would look much better with a macro, but note that the full ugly signatures of the nested Dicts are actually needed to achieve maximum performance. Timings:
|
Is there perhaps a better data structure that would make this easier / more elegant / faster? |
@carlobaldassi, that's a nice speedup! @StefanKarpinski, this would not be easier or more elegant (the Or, we could decree the following: anytime you compile a method for |
Maybe a |
Hmm, great minds think alike 😄. |
The |
Even though the |
Yes, I think pre-defining up to And yes, maybe we should just generate separate caches for |
Ahah it seems we posted the exact same comment (well, almost) at the same moment. Anyway, I experimented a bit with the solutions which have come up so far. My results (based on an extended version of the benchmark by @Manticore) are that
Points 1 and 2 are surprising but I tried to make it as fast as I could and that's the result. |
That is quite surprising. |
Just to add to the confusion, I tried to do more thorough experiments with specialized broadcast functions for |
Change broadcast! cache to use nested Dict instead of a Dict with Tuple keys; use get! macro to access its fields. Related to #6041. Also reduces code duplication in broadcast! definitions via eval macro.
Change broadcast! cache to use nested Dict instead of a Dict with Tuple keys; use get! macro to access its fields. Related to #6041. Also reduces code duplication in broadcast! definitions via eval macro.
I have seen the same thing. Never figured out what is going on. I've even seen a function suddenly start running more slowly partway through a REPL session, with no obvious reason for the switch. |
Just FYI, there is a |
FWIW, I noticed differences in timings especially on a machine with many cores, and I suspect that there might be subtle timing differences when code gets run on different processors/cores. |
Change broadcast! cache to use nested Dict instead of a Dict with Tuple keys; use get! macro to access its fields. Related to #6041. Also reduces code duplication in broadcast! definitions via eval macro.
Regarding fast broadcasting on small matrices, I created the functions Apart from that dict lookup, the intent was that the lookup based on number |
Change broadcast! cache to use nested Dict instead of a Dict with Tuple keys; use get! macro to access its fields. Related to #6041. Also reduces code duplication in broadcast! definitions via eval macro.
This issue seems to be moot in 0.5 with fast higher-order functions. If anything, the julia> using BenchmarkTools
julia> dotplus(A,B) = broadcast(+, A, B)
dotplus (generic function with 1 method)
julia> A = rand(1000,1000); B = copy(A);
julia> @benchmark $A + $B
BenchmarkTools.Trial:
memory estimate: 7.63 mb
allocs estimate: 3
--------------
minimum time: 2.151 ms (0.00% GC)
median time: 2.579 ms (0.00% GC)
mean time: 2.678 ms (16.03% GC)
maximum time: 5.043 ms (43.44% GC)
--------------
samples: 1866
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark dotplus($A, $B)
BenchmarkTools.Trial:
memory estimate: 7.63 mb
allocs estimate: 14
--------------
minimum time: 1.681 ms (0.00% GC)
median time: 1.922 ms (0.00% GC)
mean time: 2.075 ms (20.73% GC)
maximum time: 3.763 ms (56.19% GC)
--------------
samples: 2407
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00% |
At least for me, julia> A = rand(10,10); B = copy(A);
julia> @benchmark $A + $B
BenchmarkTools.Trial:
samples: 10000
evals/sample: 530
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 928.00 bytes
allocs estimate: 2
minimum time: 212.00 ns (0.00% GC)
median time: 224.00 ns (0.00% GC)
mean time: 255.32 ns (6.28% GC)
maximum time: 2.27 μs (76.42% GC)
julia> @benchmark dotplus($A, $B)
BenchmarkTools.Trial:
samples: 10000
evals/sample: 137
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.23 kb
allocs estimate: 13
minimum time: 720.00 ns (0.00% GC)
median time: 736.00 ns (0.00% GC)
mean time: 840.92 ns (8.36% GC)
maximum time: 17.67 μs (90.57% GC) |
this is more elegant, see the julia doc |
Consider the following benchmark. It compares the performance of the broadcast addition (
.+
) for matrices of different sizes in Julia and in Python. In Python+
was used because it broadcasts automatically behind the scenes.My configuration:
OSX 10.9.2
Python 3.3.4, numpy 1.8.0
Julia HEAD (as of 4 Mar 2014, commit d36bb08), compiled with Accelerate
The results for the dot tests are (combined side-by-side for convenience)
Python | Julia
In all cases there is 2x-10x slowdown. Is there something special that broadcasting does, and should it be avoided in the performance-critical code? It does not seem to have that prominent effect in Python.
The text was updated successfully, but these errors were encountered: