Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance of complex numbers #323

Closed
ViralBShah opened this issue Dec 29, 2011 · 30 comments
Closed

improve performance of complex numbers #323

ViralBShah opened this issue Dec 29, 2011 · 30 comments
Labels
performance Must go faster

Comments

@ViralBShah
Copy link
Member

Out of the various benchmark codes, the one where julia performs poorly compared to C++ is mandel(), which seems to suggest that we need better performance for our complex number implementation.

Are a couple of key optimizations that are yet to be implemented that will speed this up, or is it something more fundamental with our complex number implementation?

@JeffBezanson
Copy link
Member

This depends on dealing with packed aggregates (structs) and either immutability or escape analysis. Or some kind of "value types", which is how people usually punt on this.

@ViralBShah
Copy link
Member Author

If these optimizations are some ways away, then should we have a complex array implementation that stores the real and imaginary parts separately?

@JeffBezanson
Copy link
Member

That seems like a lot of extra work, especially when calling native libraries. It also doesn't address the whole performance issue, which exists even with scalar operations (e.g. mandel). And at least we're faster than the other systems.

@timholy
Copy link
Member

timholy commented Jul 20, 2012

This gist https://gist.github.com/3150312 modifies the code from the new mandelbrot shootout to examine this issue. For that function, using Julia's complex is almost 100-fold slower than a manual implementation of the complex arithmetic.

Here's how you use it:

load("mandelbrot_with_real.jl")
main(ARGS, stdout_stream)

Only pay attention to the results on the second run, as the first run is affected by compilation time.

@timholy
Copy link
Member

timholy commented Jul 20, 2012

Hmm, I wonder if it's a type inference issue. I added a second function to the gist, and it shows only a tenfold gap, which (while slower) is closer to what you'd expect.

load("complex_test.jl")
run_timing(20)
run_timing(10_000_000)

@JeffBezanson
Copy link
Member

LLVM strikes again! The manual complex arithmetic version always returns true, so the optimizer skips the entire body of the function. Computing abs(z) also has a significant effect and brings the manual and Complex128 versions closer. It looks like hypot could be faster. I should try to get the sqrtsd instruction generated inline.

@timholy
Copy link
Member

timholy commented Jul 21, 2012

Hmm, skipping the body of the code entirely would tend to speed it up. :-) Thanks for figuring that out.

JeffBezanson added a commit that referenced this issue Aug 12, 2012
we can now handle cases that require inserting new temp variables
fixes #1145
helps complex performance a lot (#323)
@JeffBezanson
Copy link
Member

mandel now more than 2x faster!

@StefanKarpinski
Copy link
Member

Homepage updated with new benchmark results: http://julialang.org/. We are now beating all other high-level languages on all benchmarks. We're even beating Fortran on all but two benchmarks! We still get spanked on fib and mandel, but man, these are exciting numbers. I can't even imagine what's going to happen once composite types go all immutable.

@timholy
Copy link
Member

timholy commented Aug 13, 2012

Holy smokes! Very exciting, and really, really impressive.

@timholy
Copy link
Member

timholy commented Aug 13, 2012

Actually, isn't it rand_mat_stat that is the standout poor case for Julia?

@StefanKarpinski
Copy link
Member

I'm not sure that beating Fortan by a few percent counts as a "bad case" ;-)

@JeffBezanson
Copy link
Member

I don't know why fortran would be slower than C in that case; that is very suspicious.

@StefanKarpinski
Copy link
Member

Agreed. I skimmed the Fortran and it looks reasonable, but I am not Fortran expert. Do we have any Fortran experts?

@timholy
Copy link
Member

timholy commented Aug 13, 2012

Wait...isn't there a bug? In the C version, PtP1 is 5x5. In the Julia case, P.'*P is 20x20.

If I'm right, that will narrow the gap between Julia and C tremendously.

@timholy
Copy link
Member

timholy commented Aug 13, 2012

Oh, now I see the comments re Fortran. I bet this explains that oddity too.

@timholy
Copy link
Member

timholy commented Aug 13, 2012

In other words, in Julia it should be P = [a; b; c; d].

@JeffBezanson
Copy link
Member

The julia version is the original version ported from matlab, so it is authoritative. Not that it really matters what it does as long as they all do the same thing.

@timholy
Copy link
Member

timholy commented Aug 13, 2012

But the point is, they don't. The C version takes the 4th power of a 5x5 matrix. The Julia version takes the 4th power of a 20x20 matrix.

@StefanKarpinski
Copy link
Member

Oh, forehead slap. I wrote this C code. I am ashamed.

I guess I'll take a crack at fixing it in the morning. Tim, thanks for finding that. It's like performance manna from heaven. Or from my own stupidity.

@JeffBezanson
Copy link
Member

Illustrates the point even better than numbers --- the C code is hard to get right!!

@StefanKarpinski
Copy link
Member

Very true. It's pretty common for people to look at the C code and say "oh, that's not so bad", but it is — because numerical computing is already hard, and having to do it in C just makes it harder. Case in point.

@StefanKarpinski
Copy link
Member

Also, have we discovered a new principle? "If the C code is faster than the Fortran code, the C code is probably buggy."

@timholy
Copy link
Member

timholy commented Aug 13, 2012

No problem, Stefan. I agree with Jeff that the C code is much harder to get right.

The principle doesn't work universally: C++ is beating Fortran at some tasks these days. Check out the graphs on this page: http://eigen.tuxfamily.org/index.php?title=Benchmark. No one package wins them all (by any stretch), but for some tasks and parameters the C++ code is well ahead of the others. I think the main factor is template metaprogramming, but I'm not sure.

On a related point, the work I've been doing to avoid temporaries is interesting with respect to this problem:

julia> load("rsm.jl")

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.5080680847167969 seconds
(0.7367224771132586,0.7488240938163588)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.18616795539855957 seconds
(0.7001166261086185,0.7444765680497817)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.17365694046020508 seconds
(0.7297389301439973,0.76173139425919)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.18402886390686035 seconds
(0.7029257859746334,0.7211008241161565)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.17666387557983398 seconds
(0.7451147175777081,0.770844667140391)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.19351482391357422 seconds
(0.7197005597183664,0.7771155799610724)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.16952896118164062 seconds
(0.7162939907706058,0.7479821850240462)

(edit) compared with

julia> @time (s1, s2) = randmatstat(10000)
elapsed time: 0.38619279861450195 seconds
(0.7227165628461657,0.7414642774588018)

julia> @time (s1, s2) = randmatstat(10000)
elapsed time: 0.2670421600341797 seconds
(0.7229146596308375,0.7429089280573156)

julia> @time (s1, s2) = randmatstat(10000)
elapsed time: 0.2625889778137207 seconds
(0.7382900533507232,0.755698272447621)

julia> @time (s1, s2) = randmatstat(10000)
elapsed time: 0.2730410099029541 seconds
(0.7198034945602977,0.7399915950520334)

where

function randmatstat_fast(t)
    n = 5
    v = zeros(t)
    w = zeros(t)
    Ptmp1 = Array(Float64, 4*n, 4*n)
    Ptmp2 = similar(Ptmp1)
    Qtmp1 = Array(Float64, 2*n, 2*n)
    Qtmp2 = similar(Qtmp1)
    for i=1:t
        a = randn(n, n)
        b = randn(n, n)
        c = randn(n, n)
        d = randn(n, n)
        P = [a b c d]
        Q = [a b; c d]
        At_mul_B(Ptmp1, P, P)
        At_mul_B(Qtmp1, Q, Q)
        A_mul_B(Ptmp2, Ptmp1, Ptmp1)
        A_mul_B(Ptmp1, Ptmp2, Ptmp2)
        A_mul_B(Qtmp2, Qtmp1, Qtmp1)
        A_mul_B(Qtmp1, Qtmp2, Qtmp2)
        v[i] = trace(Ptmp1)
        w[i] = trace(Qtmp1)
    end
    return (std(v)/mean(v), std(w)/mean(w))
end

I doubt that's the version you want to use for the perf test (presuming you want to be as readable as the Matlab version), but it is revealing. It's interesting that even for 20x20 matrix multiplication, the allocation of temporaries has such a large effect.

Of course, it's also a closer match to what the C code is doing. So if I were to predict your findings, Stefan, it's that even once you change the C code, Julia will still be worse by a factor of at least 1.5.

@StefanKarpinski
Copy link
Member

Those benchmarks are pretty fascinating. I always find timings expressed as flops or "performance" — i.e. 1/time — to exaggerate what you actually care about, which is, of course, how long your problem takes to compute. Comparing the inverse of that just spreads the high end, although I know it's traditional (that might be why, to make comparisons more exciting). Maybe we should have a 1/time version of our benchmarks...

It's nice that it's so easy to write a singly allocating version of randmatstat after your work, Tim. I agree we shouldn't use that version because the real sweet spot we're going for is the ease-of-expression-vs-performance optimum. Being 1.5x C ain't bad at all, especially with such simple intuitive code.

@dmbates
Copy link
Member

dmbates commented Aug 13, 2012

Regarding the Eigen benchmarks, remember that part of the template metaprogramming for Eigen is delayed evaluation, which could help for Y = alpha * X + beta * Y. Also, some Eigen operations, like gemm, can use SSE instruction pipelines. Not sure if they do here.

@dmbates
Copy link
Member

dmbates commented Aug 13, 2012

@StefanKarpinski it is probably most informative to plot the vertical axis, either mflops or execution time, on a logarithmic scale. In many situations "twice as fast" or "half the execution time" is the easiest scale to interpret. The log scale also removes the need to make all the benchmarks take roughly the same amount of time for comparison.

@StefanKarpinski
Copy link
Member

Indeed, that's probably my favorite way to plot these things since it has the "up is good" intuitiveness and has that appropriate scaling property and doesn't unduly focus on high or low values.

@ViralBShah
Copy link
Member Author

@JeffBezanson Should we close the issue on performance of complex numbers? Seems like we are in a good position for now.

@ViralBShah
Copy link
Member Author

We are in good shape for complex performance, at least with our micro-benchmarks.

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018
KristofferC added a commit that referenced this issue Jun 4, 2018
* use win isdir workaround in one more place (#323)

* add a test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants