Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD performance regression tests #13686

Closed
damiendr opened this issue Oct 20, 2015 · 45 comments
Closed

SIMD performance regression tests #13686

damiendr opened this issue Oct 20, 2015 · 45 comments
Labels
performance Must go faster test This change adds or pertains to unit tests

Comments

@damiendr
Copy link
Contributor

There are currenly a number of performance regressions with @simd loops, some involving inlining problems.

It would be great to have a test suite for @simd loops that checks for actual vector tags in the llvm code or SSE instructions in the native code. The current suite only checks for the correctness of the results.

I'm willing to contribute that. I'd need a bit of guidance though on how to do the PR, considering that the tests will likely fail with the latest head.

@timholy
Copy link
Member

timholy commented Oct 20, 2015

A big 👍 from me. I'd say submit the PR and then hopefully that will trigger focused work on fixing the current problems. Presumably you can comment-out specific tests to check that your testing framework makes sense.

@ViralBShah
Copy link
Member

cc @ArchRobison

@yuyichao
Copy link
Contributor

A good starting point would be going through the issues/PRs with the performance label. I think we've been using that label for exactly this.

@damiendr
Copy link
Contributor Author

OK, I have something almost ready. Here are the things that currently don't vectorize:

  • multiple reductions
  • loops on local variables without a type declaration (eg. a = Array(Float32, 100) (NOK) vs. a::Array{Float32,1} = Array(Float32, 100) (OK). That is a regression from 0.4-rc4.
  • loops with field access eg. obj.a[i] = ...
  • long expressions (depending on where the parentheses are)
  • function calls that could be vectorized eg. sqrt, sin, etc. with the exception of muladd (checked with @fastmath).

Is there any of these things that I should not include in the tests?

@timholy
Copy link
Member

timholy commented Oct 20, 2015

That's a great list, and great contribution. I'd suggest submitting it with the failing tests commented out, with

# TODO: uncomment the following tests
# failing tests here...
# TODO: uncomment the above tests

bracketing the failing tests. That might reduce the chances that someone will accidentally delete them.

Then it would be incredibly helpful to file separate issues about each failing case, each issue linking to the commented-out test(s). (Do you know the 'y' GitHub trick?)

@yuyichao
Copy link
Contributor

loops on local variables without a type declaration (eg. a = Array(Float32, 100) (NOK) vs. a::Array{Float32,1} = Array(Float32, 100) (OK). That is a regression from 0.4-rc4.

Can you try #13463 (I'm still working on a Jameson compatible version of it...)?

loops with field access eg. obj.a[i] = ...

Is that a immutable? (It might be harder if it's not)

function calls that could be vectorized eg. sqrt, sin, etc. with the exception of muladd (checked with @fastmath).

Are you talking about using libmvec or sth like that? It'll be hard since AFAIK LLVM can't do that yet.

@jiahao
Copy link
Member

jiahao commented Oct 20, 2015

I've also found that @simd does not vectorize division. (See #13681 for a discussion)

@yuyichao
Copy link
Contributor

Seems that the following code vectorize just fine on llvm 37. Is this a llvm improvement or are your example more complicated?

julia> function f(a, s)
           @inbounds @simd for i in eachindex(a)
               a[i] /= s
           end
       end

@damiendr
Copy link
Contributor Author

@timholy thanks! Will do that.

@yuyichao I'll check. As for sqrt etc. it seems like LLVM can do it and that it used to work: #9672

@yuyichao
Copy link
Contributor

As for sqrt etc. it seems like LLVM can do it and that it used to work: #9672

When I check with clang, it does vectorize it at IR level but then lower it to scalar function calls.

@damiendr
Copy link
Contributor Author

@yuyichao it was labelled "Fixed in LLVM-3.5" in #9672, maybe that's why? My julia HEAD install says it's using LLVM-3.3.

@jiahao
Copy link
Member

jiahao commented Oct 20, 2015

@yuyichao I see vdivpd in the output of @code_native on your f on LLVM 3.3 too. However, consider generic_scale!, which is essentially

function p(a, s)
    @inbounds for i in eachindex(a)
        a[i] *= s
     end
end

When I add the @simd annotation I see a significant speedup for a = rand(10^8). @code_native spits out different code and in both cases emits vectorized instructions.

@yuyichao
Copy link
Contributor

@damiendr I'm on LLVM 37.
Just checked and sqrt indeed vectorize at both IR and native level. sin still doesn't work likely because we pass the openlibm function pointer instead of using the llvm intrinsics (there's a issue and a PR IIRC). I would expect it to only work at IR level (at least on linux) at best though.

@damiendr
Copy link
Contributor Author

Is that a immutable? (It might be harder if it's not)

You're right, it works with an immutable but not with a type. It's a fairly common idiom though, so I'll open an issue for it. Right now my workaround is to unpack everything outside the simd loops (which is OK in my case because it's all generated code anyways).

@damiendr
Copy link
Contributor Author

Here are the tests:
damiendr@cd72ff1

Will finish opening the issues & create a PR.

@damiendr
Copy link
Contributor Author

OK, so I have some issues I don't understand with the PR. The new tests pass if I run the tests from the REPL but fail in make test. There isn't much information about why, only "Error during test". Any idea why that could be?

damiendr@a65c2b2

I'm not very familiar with PRs yet, sorry about that!

@damiendr
Copy link
Contributor Author

Found the culprit: https://github.com/JuliaLang/julia/blob/5fbf28bfaa80098d8d70ca9c2131209949f36a21/test/Makefile

$(TESTS):
    @cd $(SRCDIR) && $(call PRINT_JULIA, $(call spawn,$(JULIA_EXECUTABLE)) --check-bounds=yes --startup-file=no ./runtests.jl $@)

Seems like the tests are run with --check-bounds, bypassing any @inbounds in the code.

@malmaud
Copy link
Contributor

malmaud commented Oct 20, 2015

Thanks for this effort, @damiendr - it's much appreciated.

@yuyichao
Copy link
Contributor

@jiahao (Finally get back to my computer). FYI, your p(a, s) vectorize just fine on llvm 37 too (without @simd).

@timholy
Copy link
Member

timholy commented Oct 20, 2015

There are good reasons to run most tests with forced bounds-checking, but you're right that this is problematic for testing SIMD. Seems worth running the SIMD tests by a separate mechanism.

@yuyichao
Copy link
Contributor

Should this be part of the performance test handled by @nanosoldier?

@damiendr
Copy link
Contributor Author

I think the important bit to avoid regressions is that these test are run with the CI and can make a build/PR fail. Is that the case of the tests in tests/perf? If so it would be quite natural to put them there.

@tkelman
Copy link
Contributor

tkelman commented Oct 20, 2015

Demanding perf tests need to be run on consistent hardware and we can run them often, but primarily on master or specifically chosen PR's, not every single PR. We should first try tracking and making perf results visible so regressions get noticed promptly, and see what kind of hardware load that imposes. Gating every single build's success or failure on whether there's a significant perf regression would likely require more hardware than we can dedicate to it initially.

@damiendr
Copy link
Contributor Author

The tests I added to simdloop.jl don't actually run anything costly. They just compile functions and look at the LLVM IR.
Base.runtests("simdloop") takes about 2 seconds on my laptop. Don't know how that compares to the tests in test/perf but that is certainly insignificant compared to those in make test-all, which take several minutes in total.

@tkelman
Copy link
Contributor

tkelman commented Oct 21, 2015

2 seconds is fine to put in the normal test suite, but if it relies on running with bounds checks disabled then maybe running them inside a separate spawned process is the best thing to do.

@ArchRobison
Copy link
Contributor

+1. Thanks for doing this. I recommend checking the LLVM output since it's a little easier to read, though it will still be target dependent. Checking that @simd did better than the auto-vectorizer can be tricky. You might be able to check for the absence of the run-time dependence check, but I forget whether it's easily recognizable.

I'm tied up this week but can take a look next week at cases that refuse to vectorize but seem like they should.

@JeffBezanson JeffBezanson added performance Must go faster test This change adds or pertains to unit tests labels Oct 21, 2015
@yuyichao
Copy link
Contributor

@ArchRobison There's also another llvm 3.7 regression that is causing SIMD to fail in random cases (I noticed it in fill! for example). I didn't report it here and only to the llvm bug tracker since it is reproducible with clang 3.7 as well. Would be nice if you could have a look at that as well when you have time.

@damiendr
Copy link
Contributor Author

So, in the short term, should I...

  • put the tests in /perf or in a separate project
  • create a new make target for tests that need to be run without --check-bounds (am not very familiar with make but can probably manage something)
  • add code to simdloop.jl to spawn a separate process for these tests?

Or maybe there is a way to override --check-bounds temporarily?

@damiendr damiendr reopened this Oct 23, 2015
@KristofferC
Copy link
Member

I don't see why this is supposed to go into the /perf directory when it is just directly testing codegen in a pass/fail way. I think spawning a separate process for the simd tests would work well.

Overall, I think we would have much use of tests that do not only test for correctness but also test that the generated code and that the number and size of allocations does not change (whenever it is feasible). Might help to catch things like: #13735

@damiendr
Copy link
Contributor Author

I found a quick-and-dirty solution to spawn a new process from within simdloop.jl:

<imports>
<types>

if Base.JLOptions().check_bounds == 1
    testproc = addprocs(1, exeflags="--check-bounds=no")[1]
    @fetchfrom testproc include("testdefs.jl")
    @fetchfrom testproc include("simdloop.jl")
else
    <tests>
end

But I'm wondering if the cleanest way would not be to rewrite runtests.jl to have per-test exec flags. That is much more work though.

@damiendr
Copy link
Contributor Author

Incidentally, I realised that ever since this --check-bounds option was added the entire simdloop test suite has been functionally dead. No vector code was ever generated and therefore the correctness of vectorized code could not possibly have been checked.

@ArchRobison
Copy link
Contributor

Ouch. One way to catch this is to write a deliberately abusive use of @simd that generates incorrect results if the code is really vectorized. Then a wrong answer is right, and vice-versa :-). Though it's a bit tricky because LLVM's "vectorizer" largely acts like an auto-vectorizer and will refuse to vectorize some abuses. A loop over i with a[i] = b*a[i+k]+c where k is a parameter to the containing function, and k is actually negative, might do the trick. (Though it might vectorize without the b* and +c, it might not because of the low flop to memory hit ratio.)

I concur that @KristofferC that the tests that check for vectorization by inspecting the compiled code technically belong in some other directory.

@damiendr
Copy link
Contributor Author

So, the tests (that is, those that should work already) now pass on my machine. However they still fail on the CI machines, in different places. Does Julia generate a CPU-specific LLVM IR?

@damiendr
Copy link
Contributor Author

After having a look at codegen.cpp and the many #ifdef I now believe that my current approach, eg. looking for vector.body in the LLVM IR can only work for one particular target platform. I'll check what can be done at the code_lowered/typed level...

@yuyichao
Copy link
Contributor

I've also noticed cases that doesn't vectorize but still have the vector.* labels. Does checking for the existance of vector type like < x > works?

@ArchRobison
Copy link
Contributor

@damiendr : Vectorization is going to be target specific since some targets may lack SIMD support, and cost models (i.e. whether vectorization is profitable) vary even across micro-architectures. So the tests will need to be platform specific.

With respect to vector labels showing up in code without vector instructions, this happens because the vectorizer does two optimizations: using SIMD instructions and interleaving separate loop iterations to hide latency. So presence of the labels indicates that the right semantic information got to the vectorizer, but its cost model indicated that only interleaving was worthwhile. This commonly happens for codes involving scatters or gathers on hardware that requires synthesizing those instructions from serial instructions.

@damiendr
Copy link
Contributor Author

@ArchRobison Right. Initially I thought that the IR given by code_llvm would be the input to platform-specific transforms but I now see it is an output of these!

I'm not quite sure how to design the tests in these conditions, because ideally we should be testing only Julia's ability to work nicely with the vectorizer, not the vectorizer's cost model and the CPU features.

  • use LLVM options to force vectorization for testing, (eg. connected to a --vectorize={yes|no|all} flag)
  • assume/force a specific cpu for testing
  • have multiple cpu-specific test cases
  • rather than test for vectorized LLVM IR, test for a number of criteria for vectorizable code_typed, for instance conformance to a whitelist of function calls. It looks like this would already catch a lot of the issues I had like long expressions (Base.afoldl in code_typed) or non-inlined functions.

@ArchRobison
Copy link
Contributor

I recommend making the tests resemble real field usage as much as possible, so we're testing what we're shipping. E.g., not rely on special options for testing. LLVM puts CPU-specific tests in separate subdirectories, one for each CPU. I think that's a reasonable scheme for us too. We could start with 64-bit x86 because it's prevalent and guaranteed to have at least SSE, and then expand to other targets. [Disclosure: I work for Intel.]

I think it's worth the trouble to test the LLVM IR for vector instructions, since there are all sorts of subtle things that can prevent vectorization.

As far as infrastructure, it's worth taking a look at how LLVM's tests combine both the input and the expected output into a single file. It's a nice arrangement maintenance-wise.

@damiendr
Copy link
Contributor Author

I recommend making the tests resemble real field usage as much as possible, so we're testing what we're shipping. E.g., not rely on special options for testing. LLVM puts CPU-specific tests in separate subdirectories, one for each CPU. I think that's a reasonable scheme for us too. We could start with 64-bit x86 because it's prevalent and guaranteed to have at least SSE, and then expand to other targets.

I'm not sure if it's that simple. One test suceeds on my machine (Sandy Bridge Core i7-2635QM) and fails on the CI machine (Nehalem Xeon X5570). I'm not yet 100% sure that there aren't other factors or bugs involved but it could be that the tests only vectorize with AVX instructions, or just that the cost model is different for the Xeon.

@yuyichao
Copy link
Contributor

I thought we disabled avx on non mcjit. Are you testing locally with a newer llvm?

@ArchRobison
Copy link
Contributor

Is the test that fails on the CI machine use Float32 or Float64? I can see where a cost model might determine that Float64 is not worth the trouble with just SSE, but I would have expected Float32 to vectorize. I'm not sure if I can get Julia to build on our Nehalems here -- we tend to have antique environments on our antique machines.

@damiendr
Copy link
Contributor Author

I ran some diagnostics and here are some elements:

  • The tests fail on the first types tested, which happen to be Int32 in one case and Float32 in the other
  • the AppVeyor x86_64 build has cpu_target = "x86-64" (vs. "native" on my machine and in the Travis build)
  • the code_warntype output is the same on my machine and on the test machines
  • failure is indeed caused by non-vectorized code_llvm, despite correct execflags
  • I checked the code emitted by my machine and it does contain AVX instructions, even though I use the built-in libLLVM-3.3 (if versioninfo() can be trusted). Here's a snippet:
    vblendvps   %ymm1, %ymm4, %ymm5, %ymm5
    vextractf128    $1, %ymm5, %xmm7
    vpcmpgtd    %xmm0, %xmm7, %xmm7
    vcmpnleps   %ymm4, %ymm6, %ymm4
    vpslld  $31, %xmm4, %xmm6
...
    vpxor   %xmm0, %xmm0, %xmm0
    vpxor   %xmm1, %xmm1, %xmm1
L224:   vpaddd  %xmm0, %xmm1, %xmm0
    vmovhlps    %xmm0, %xmm0, %xmm1 ## xmm1 = xmm0[1,1]
    vpaddd  %xmm1, %xmm0, %xmm0
    vphaddd %xmm0, %xmm0, %xmm0
    vmovd   %xmm0, %eax

None of the test machines seem to support the AVX instruction set.

So a tentative conclusion would be that most of the SIMD examples, including some very simple ones, fail to vectorize on non-AVX architectures.

It would be nice to run the tests in such a way that failures do not stop the next tests from running, to get a better idea of what fails & what does not.

@eschnett
Copy link
Contributor

@ArchRobison To check for SIMD vectorization, you can craft a loop that involves floating-point operations with round-off errors. Depending on the order in which the operations are performed, you would get different round-off, and thus be able to tell whether SIMD vectorization was employed. This requires a reduction operations, though, so not all SIMD loops can be checked.

@nalimilan
Copy link
Member

the AppVeyor x86_64 build has cpu_target = "x86-64" (vs. "native" on my machine and in the Travis build)

@damiendr This difference is a plausible explanation of the different results you get. Try passing JULIA_CPU_TARGET=x86_64 on your machine, and you'll likely get the same results as the AppVeyor build.

@jrevels jrevels added the potential benchmark Could make a good benchmark in BaseBenchmarks label Nov 17, 2015
jrevels added a commit to JuliaCI/BaseBenchmarks.jl that referenced this issue Jan 26, 2016
@jrevels jrevels removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 26, 2016
@KristofferC
Copy link
Member

I think the extensive benchmarks together with the recent codegen tests cover this pretty ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster test This change adds or pertains to unit tests
Projects
None yet
Development

No branches or pull requests