improve performance of complex numbers #323

ViralBShah · 2011-12-29T14:37:58Z

Out of the various benchmark codes, the one where julia performs poorly compared to C++ is mandel(), which seems to suggest that we need better performance for our complex number implementation.

Are a couple of key optimizations that are yet to be implemented that will speed this up, or is it something more fundamental with our complex number implementation?

JeffBezanson · 2011-12-29T21:33:26Z

This depends on dealing with packed aggregates (structs) and either immutability or escape analysis. Or some kind of "value types", which is how people usually punt on this.

ViralBShah · 2011-12-30T11:41:10Z

If these optimizations are some ways away, then should we have a complex array implementation that stores the real and imaginary parts separately?

JeffBezanson · 2011-12-30T19:43:23Z

That seems like a lot of extra work, especially when calling native libraries. It also doesn't address the whole performance issue, which exists even with scalar operations (e.g. mandel). And at least we're faster than the other systems.

timholy · 2012-07-20T11:50:49Z

This gist https://gist.github.com/3150312 modifies the code from the new mandelbrot shootout to examine this issue. For that function, using Julia's complex is almost 100-fold slower than a manual implementation of the complex arithmetic.

Here's how you use it:

load("mandelbrot_with_real.jl")
main(ARGS, stdout_stream)

Only pay attention to the results on the second run, as the first run is affected by compilation time.

timholy · 2012-07-20T12:30:10Z

Hmm, I wonder if it's a type inference issue. I added a second function to the gist, and it shows only a tenfold gap, which (while slower) is closer to what you'd expect.

load("complex_test.jl")
run_timing(20)
run_timing(10_000_000)

JeffBezanson · 2012-07-21T07:23:03Z

LLVM strikes again! The manual complex arithmetic version always returns true, so the optimizer skips the entire body of the function. Computing abs(z) also has a significant effect and brings the manual and Complex128 versions closer. It looks like hypot could be faster. I should try to get the sqrtsd instruction generated inline.

timholy · 2012-07-21T09:13:11Z

Hmm, skipping the body of the code entirely would tend to speed it up. :-) Thanks for figuring that out.

we can now handle cases that require inserting new temp variables fixes #1145 helps complex performance a lot (#323)

JeffBezanson · 2012-08-12T05:24:07Z

mandel now more than 2x faster!

StefanKarpinski · 2012-08-12T22:36:12Z

Homepage updated with new benchmark results: http://julialang.org/. We are now beating all other high-level languages on all benchmarks. We're even beating Fortran on all but two benchmarks! We still get spanked on fib and mandel, but man, these are exciting numbers. I can't even imagine what's going to happen once composite types go all immutable.

timholy · 2012-08-13T01:21:53Z

Holy smokes! Very exciting, and really, really impressive.

timholy · 2012-08-13T01:24:24Z

Actually, isn't it rand_mat_stat that is the standout poor case for Julia?

StefanKarpinski · 2012-08-13T01:31:39Z

I'm not sure that beating Fortan by a few percent counts as a "bad case" ;-)

JeffBezanson · 2012-08-13T01:56:48Z

I don't know why fortran would be slower than C in that case; that is very suspicious.

StefanKarpinski · 2012-08-13T02:44:26Z

Agreed. I skimmed the Fortran and it looks reasonable, but I am not Fortran expert. Do we have any Fortran experts?

timholy · 2012-08-13T03:05:20Z

Wait...isn't there a bug? In the C version, PtP1 is 5x5. In the Julia case, P.'*P is 20x20.

If I'm right, that will narrow the gap between Julia and C tremendously.

timholy · 2012-08-13T03:05:46Z

Oh, now I see the comments re Fortran. I bet this explains that oddity too.

timholy · 2012-08-13T03:06:20Z

In other words, in Julia it should be P = [a; b; c; d].

JeffBezanson · 2012-08-13T03:12:46Z

The julia version is the original version ported from matlab, so it is authoritative. Not that it really matters what it does as long as they all do the same thing.

timholy · 2012-08-13T03:18:20Z

But the point is, they don't. The C version takes the 4th power of a 5x5 matrix. The Julia version takes the 4th power of a 20x20 matrix.

StefanKarpinski · 2012-08-13T06:01:17Z

Oh, forehead slap. I wrote this C code. I am ashamed.

I guess I'll take a crack at fixing it in the morning. Tim, thanks for finding that. It's like performance manna from heaven. Or from my own stupidity.

JeffBezanson · 2012-08-13T06:04:38Z

Illustrates the point even better than numbers --- the C code is hard to get right!!

StefanKarpinski · 2012-08-13T06:11:06Z

Very true. It's pretty common for people to look at the C code and say "oh, that's not so bad", but it is — because numerical computing is already hard, and having to do it in C just makes it harder. Case in point.

StefanKarpinski · 2012-08-13T06:13:14Z

Also, have we discovered a new principle? "If the C code is faster than the Fortran code, the C code is probably buggy."

timholy · 2012-08-13T09:52:08Z

No problem, Stefan. I agree with Jeff that the C code is much harder to get right.

The principle doesn't work universally: C++ is beating Fortran at some tasks these days. Check out the graphs on this page: http://eigen.tuxfamily.org/index.php?title=Benchmark. No one package wins them all (by any stretch), but for some tasks and parameters the C++ code is well ahead of the others. I think the main factor is template metaprogramming, but I'm not sure.

On a related point, the work I've been doing to avoid temporaries is interesting with respect to this problem:

julia> load("rsm.jl")

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.5080680847167969 seconds
(0.7367224771132586,0.7488240938163588)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.18616795539855957 seconds
(0.7001166261086185,0.7444765680497817)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.17365694046020508 seconds
(0.7297389301439973,0.76173139425919)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.18402886390686035 seconds
(0.7029257859746334,0.7211008241161565)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.17666387557983398 seconds
(0.7451147175777081,0.770844667140391)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.19351482391357422 seconds
(0.7197005597183664,0.7771155799610724)

julia> @time (s1, s2) = randmatstat_fast(10000)
elapsed time: 0.16952896118164062 seconds
(0.7162939907706058,0.7479821850240462)

(edit) compared with

julia> @time (s1, s2) = randmatstat(10000)
elapsed time: 0.38619279861450195 seconds
(0.7227165628461657,0.7414642774588018)

julia> @time (s1, s2) = randmatstat(10000)
elapsed time: 0.2670421600341797 seconds
(0.7229146596308375,0.7429089280573156)

julia> @time (s1, s2) = randmatstat(10000)
elapsed time: 0.2625889778137207 seconds
(0.7382900533507232,0.755698272447621)

julia> @time (s1, s2) = randmatstat(10000)
elapsed time: 0.2730410099029541 seconds
(0.7198034945602977,0.7399915950520334)

where

function randmatstat_fast(t)
    n = 5
    v = zeros(t)
    w = zeros(t)
    Ptmp1 = Array(Float64, 4*n, 4*n)
    Ptmp2 = similar(Ptmp1)
    Qtmp1 = Array(Float64, 2*n, 2*n)
    Qtmp2 = similar(Qtmp1)
    for i=1:t
        a = randn(n, n)
        b = randn(n, n)
        c = randn(n, n)
        d = randn(n, n)
        P = [a b c d]
        Q = [a b; c d]
        At_mul_B(Ptmp1, P, P)
        At_mul_B(Qtmp1, Q, Q)
        A_mul_B(Ptmp2, Ptmp1, Ptmp1)
        A_mul_B(Ptmp1, Ptmp2, Ptmp2)
        A_mul_B(Qtmp2, Qtmp1, Qtmp1)
        A_mul_B(Qtmp1, Qtmp2, Qtmp2)
        v[i] = trace(Ptmp1)
        w[i] = trace(Qtmp1)
    end
    return (std(v)/mean(v), std(w)/mean(w))
end

I doubt that's the version you want to use for the perf test (presuming you want to be as readable as the Matlab version), but it is revealing. It's interesting that even for 20x20 matrix multiplication, the allocation of temporaries has such a large effect.

Of course, it's also a closer match to what the C code is doing. So if I were to predict your findings, Stefan, it's that even once you change the C code, Julia will still be worse by a factor of at least 1.5.

StefanKarpinski · 2012-08-13T14:12:31Z

Those benchmarks are pretty fascinating. I always find timings expressed as flops or "performance" — i.e. 1/time — to exaggerate what you actually care about, which is, of course, how long your problem takes to compute. Comparing the inverse of that just spreads the high end, although I know it's traditional (that might be why, to make comparisons more exciting). Maybe we should have a 1/time version of our benchmarks...

It's nice that it's so easy to write a singly allocating version of randmatstat after your work, Tim. I agree we shouldn't use that version because the real sweet spot we're going for is the ease-of-expression-vs-performance optimum. Being 1.5x C ain't bad at all, especially with such simple intuitive code.

dmbates · 2012-08-13T15:26:53Z

Regarding the Eigen benchmarks, remember that part of the template metaprogramming for Eigen is delayed evaluation, which could help for Y = alpha * X + beta * Y. Also, some Eigen operations, like gemm, can use SSE instruction pipelines. Not sure if they do here.

dmbates · 2012-08-13T15:35:59Z

@StefanKarpinski it is probably most informative to plot the vertical axis, either mflops or execution time, on a logarithmic scale. In many situations "twice as fast" or "half the execution time" is the easiest scale to interpret. The log scale also removes the need to make all the benchmarks take roughly the same amount of time for comparison.

StefanKarpinski · 2012-08-13T16:19:32Z

Indeed, that's probably my favorite way to plot these things since it has the "up is good" intuitiveness and has that appropriate scaling property and doesn't unduly focus on high or low values.

ViralBShah · 2012-08-17T16:41:07Z

@JeffBezanson Should we close the issue on performance of complex numbers? Seems like we are in a good position for now.

ViralBShah · 2012-09-26T09:17:27Z

We are in good shape for complex performance, at least with our micro-benchmarks.

Clean out pre-0.4 cruft

* use win isdir workaround in one more place (#323) * add a test

JeffBezanson mentioned this issue Dec 30, 2011

im*Matrix is much slower than real scalar*Matrix #243

Closed

timholy mentioned this issue Jul 20, 2012

shootout: sequential version of mandelbrot #1068

Merged

JeffBezanson mentioned this issue Aug 10, 2012

Performance of matrices using complex numbers #1145

Closed

JeffBezanson added a commit that referenced this issue Aug 12, 2012

improved inlining

be9a37f

we can now handle cases that require inserting new temp variables fixes #1145 helps complex performance a lot (#323)

ViralBShah closed this as completed Sep 26, 2012

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018

Merge pull request #323 from JuliaLang/teh/cruft

44207e4

Clean out pre-0.4 cruft

KristofferC added a commit that referenced this issue May 31, 2018

use win isdir workaround in one more place (#323)

eba77e6

KristofferC added a commit that referenced this issue May 31, 2018

use win isdir workaround in one more place (#323)

16e3ab7

KristofferC added a commit that referenced this issue Jun 4, 2018

Pkg3: fix adding and developing from url in Windows (#27347)

4939458

* use win isdir workaround in one more place (#323) * add a test

sethaxen mentioned this issue Jan 14, 2021

Symmetric/Hermitian matrix function rules JuliaDiff/ChainRules.jl#193

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance of complex numbers #323

improve performance of complex numbers #323

ViralBShah commented Dec 29, 2011

JeffBezanson commented Dec 29, 2011

ViralBShah commented Dec 30, 2011

JeffBezanson commented Dec 30, 2011

timholy commented Jul 20, 2012

timholy commented Jul 20, 2012

JeffBezanson commented Jul 21, 2012

timholy commented Jul 21, 2012

JeffBezanson commented Aug 12, 2012

StefanKarpinski commented Aug 12, 2012

timholy commented Aug 13, 2012

timholy commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

JeffBezanson commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

timholy commented Aug 13, 2012

timholy commented Aug 13, 2012

timholy commented Aug 13, 2012

JeffBezanson commented Aug 13, 2012

timholy commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

JeffBezanson commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

timholy commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

dmbates commented Aug 13, 2012

dmbates commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

ViralBShah commented Aug 17, 2012

ViralBShah commented Sep 26, 2012

improve performance of complex numbers #323

improve performance of complex numbers #323

Comments

ViralBShah commented Dec 29, 2011

JeffBezanson commented Dec 29, 2011

ViralBShah commented Dec 30, 2011

JeffBezanson commented Dec 30, 2011

timholy commented Jul 20, 2012

timholy commented Jul 20, 2012

JeffBezanson commented Jul 21, 2012

timholy commented Jul 21, 2012

JeffBezanson commented Aug 12, 2012

StefanKarpinski commented Aug 12, 2012

timholy commented Aug 13, 2012

timholy commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

JeffBezanson commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

timholy commented Aug 13, 2012

timholy commented Aug 13, 2012

timholy commented Aug 13, 2012

JeffBezanson commented Aug 13, 2012

timholy commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

JeffBezanson commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

timholy commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

dmbates commented Aug 13, 2012

dmbates commented Aug 13, 2012

StefanKarpinski commented Aug 13, 2012

ViralBShah commented Aug 17, 2012

ViralBShah commented Sep 26, 2012