-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve performance of complex numbers #323
Comments
This depends on dealing with packed aggregates (structs) and either immutability or escape analysis. Or some kind of "value types", which is how people usually punt on this. |
If these optimizations are some ways away, then should we have a complex array implementation that stores the real and imaginary parts separately? |
That seems like a lot of extra work, especially when calling native libraries. It also doesn't address the whole performance issue, which exists even with scalar operations (e.g. mandel). And at least we're faster than the other systems. |
This gist https://gist.github.com/3150312 modifies the code from the new mandelbrot shootout to examine this issue. For that function, using Julia's complex is almost 100-fold slower than a manual implementation of the complex arithmetic. Here's how you use it: load("mandelbrot_with_real.jl")
main(ARGS, stdout_stream) Only pay attention to the results on the second run, as the first run is affected by compilation time. |
Hmm, I wonder if it's a type inference issue. I added a second function to the gist, and it shows only a tenfold gap, which (while slower) is closer to what you'd expect. load("complex_test.jl")
run_timing(20)
run_timing(10_000_000) |
LLVM strikes again! The manual complex arithmetic version always returns true, so the optimizer skips the entire body of the function. Computing |
Hmm, skipping the body of the code entirely would tend to speed it up. :-) Thanks for figuring that out. |
|
Homepage updated with new benchmark results: http://julialang.org/. We are now beating all other high-level languages on all benchmarks. We're even beating Fortran on all but two benchmarks! We still get spanked on fib and mandel, but man, these are exciting numbers. I can't even imagine what's going to happen once composite types go all immutable. |
Holy smokes! Very exciting, and really, really impressive. |
Actually, isn't it |
I'm not sure that beating Fortan by a few percent counts as a "bad case" ;-) |
I don't know why fortran would be slower than C in that case; that is very suspicious. |
Agreed. I skimmed the Fortran and it looks reasonable, but I am not Fortran expert. Do we have any Fortran experts? |
Wait...isn't there a bug? In the C version, PtP1 is 5x5. In the Julia case, P.'*P is 20x20. If I'm right, that will narrow the gap between Julia and C tremendously. |
Oh, now I see the comments re Fortran. I bet this explains that oddity too. |
In other words, in Julia it should be |
The julia version is the original version ported from matlab, so it is authoritative. Not that it really matters what it does as long as they all do the same thing. |
But the point is, they don't. The C version takes the 4th power of a 5x5 matrix. The Julia version takes the 4th power of a 20x20 matrix. |
Oh, forehead slap. I wrote this C code. I am ashamed. I guess I'll take a crack at fixing it in the morning. Tim, thanks for finding that. It's like performance manna from heaven. Or from my own stupidity. |
Illustrates the point even better than numbers --- the C code is hard to get right!! |
Very true. It's pretty common for people to look at the C code and say "oh, that's not so bad", but it is — because numerical computing is already hard, and having to do it in C just makes it harder. Case in point. |
Also, have we discovered a new principle? "If the C code is faster than the Fortran code, the C code is probably buggy." |
No problem, Stefan. I agree with Jeff that the C code is much harder to get right. The principle doesn't work universally: C++ is beating Fortran at some tasks these days. Check out the graphs on this page: http://eigen.tuxfamily.org/index.php?title=Benchmark. No one package wins them all (by any stretch), but for some tasks and parameters the C++ code is well ahead of the others. I think the main factor is template metaprogramming, but I'm not sure. On a related point, the work I've been doing to avoid temporaries is interesting with respect to this problem:
(edit) compared with
where
I doubt that's the version you want to use for the perf test (presuming you want to be as readable as the Matlab version), but it is revealing. It's interesting that even for 20x20 matrix multiplication, the allocation of temporaries has such a large effect. Of course, it's also a closer match to what the C code is doing. So if I were to predict your findings, Stefan, it's that even once you change the C code, Julia will still be worse by a factor of at least 1.5. |
Those benchmarks are pretty fascinating. I always find timings expressed as flops or "performance" — i.e. 1/time — to exaggerate what you actually care about, which is, of course, how long your problem takes to compute. Comparing the inverse of that just spreads the high end, although I know it's traditional (that might be why, to make comparisons more exciting). Maybe we should have a 1/time version of our benchmarks... It's nice that it's so easy to write a singly allocating version of randmatstat after your work, Tim. I agree we shouldn't use that version because the real sweet spot we're going for is the ease-of-expression-vs-performance optimum. Being 1.5x C ain't bad at all, especially with such simple intuitive code. |
Regarding the Eigen benchmarks, remember that part of the template metaprogramming for Eigen is delayed evaluation, which could help for Y = alpha * X + beta * Y. Also, some Eigen operations, like gemm, can use SSE instruction pipelines. Not sure if they do here. |
@StefanKarpinski it is probably most informative to plot the vertical axis, either mflops or execution time, on a logarithmic scale. In many situations "twice as fast" or "half the execution time" is the easiest scale to interpret. The log scale also removes the need to make all the benchmarks take roughly the same amount of time for comparison. |
Indeed, that's probably my favorite way to plot these things since it has the "up is good" intuitiveness and has that appropriate scaling property and doesn't unduly focus on high or low values. |
@JeffBezanson Should we close the issue on performance of complex numbers? Seems like we are in a good position for now. |
We are in good shape for complex performance, at least with our micro-benchmarks. |
* use win isdir workaround in one more place (#323) * add a test
Out of the various benchmark codes, the one where julia performs poorly compared to C++ is mandel(), which seems to suggest that we need better performance for our complex number implementation.
Are a couple of key optimizations that are yet to be implemented that will speed this up, or is it something more fundamental with our complex number implementation?
The text was updated successfully, but these errors were encountered: