-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD performance regression tests #13686
Comments
A big 👍 from me. I'd say submit the PR and then hopefully that will trigger focused work on fixing the current problems. Presumably you can comment-out specific tests to check that your testing framework makes sense. |
cc @ArchRobison |
A good starting point would be going through the issues/PRs with the performance label. I think we've been using that label for exactly this. |
OK, I have something almost ready. Here are the things that currently don't vectorize:
Is there any of these things that I should not include in the tests? |
That's a great list, and great contribution. I'd suggest submitting it with the failing tests commented out, with # TODO: uncomment the following tests
# failing tests here...
# TODO: uncomment the above tests bracketing the failing tests. That might reduce the chances that someone will accidentally delete them. Then it would be incredibly helpful to file separate issues about each failing case, each issue linking to the commented-out test(s). (Do you know the |
Can you try #13463 (I'm still working on a Jameson compatible version of it...)?
Is that a immutable? (It might be harder if it's not)
Are you talking about using |
I've also found that |
Seems that the following code vectorize just fine on llvm 37. Is this a llvm improvement or are your example more complicated?
|
@timholy thanks! Will do that. @yuyichao I'll check. As for |
When I check with clang, it does vectorize it at IR level but then lower it to scalar function calls. |
@yuyichao I see
When I add the |
@damiendr I'm on LLVM 37. |
You're right, it works with an |
Here are the tests: Will finish opening the issues & create a PR. |
OK, so I have some issues I don't understand with the PR. The new tests pass if I run the tests from the REPL but fail in I'm not very familiar with PRs yet, sorry about that! |
Found the culprit: https://github.com/JuliaLang/julia/blob/5fbf28bfaa80098d8d70ca9c2131209949f36a21/test/Makefile $(TESTS):
@cd $(SRCDIR) && $(call PRINT_JULIA, $(call spawn,$(JULIA_EXECUTABLE)) --check-bounds=yes --startup-file=no ./runtests.jl $@) Seems like the tests are run with --check-bounds, bypassing any |
Thanks for this effort, @damiendr - it's much appreciated. |
@jiahao (Finally get back to my computer). FYI, your |
There are good reasons to run most tests with forced bounds-checking, but you're right that this is problematic for testing SIMD. Seems worth running the SIMD tests by a separate mechanism. |
Should this be part of the performance test handled by @nanosoldier? |
I think the important bit to avoid regressions is that these test are run with the CI and can make a build/PR fail. Is that the case of the tests in tests/perf? If so it would be quite natural to put them there. |
Demanding perf tests need to be run on consistent hardware and we can run them often, but primarily on master or specifically chosen PR's, not every single PR. We should first try tracking and making perf results visible so regressions get noticed promptly, and see what kind of hardware load that imposes. Gating every single build's success or failure on whether there's a significant perf regression would likely require more hardware than we can dedicate to it initially. |
The tests I added to simdloop.jl don't actually run anything costly. They just compile functions and look at the LLVM IR. |
2 seconds is fine to put in the normal test suite, but if it relies on running with bounds checks disabled then maybe running them inside a separate spawned process is the best thing to do. |
+1. Thanks for doing this. I recommend checking the LLVM output since it's a little easier to read, though it will still be target dependent. Checking that I'm tied up this week but can take a look next week at cases that refuse to vectorize but seem like they should. |
@ArchRobison There's also another llvm 3.7 regression that is causing SIMD to fail in random cases (I noticed it in |
So, in the short term, should I...
Or maybe there is a way to override |
I don't see why this is supposed to go into the Overall, I think we would have much use of tests that do not only test for correctness but also test that the generated code and that the number and size of allocations does not change (whenever it is feasible). Might help to catch things like: #13735 |
I found a quick-and-dirty solution to spawn a new process from within simdloop.jl: <imports>
<types>
if Base.JLOptions().check_bounds == 1
testproc = addprocs(1, exeflags="--check-bounds=no")[1]
@fetchfrom testproc include("testdefs.jl")
@fetchfrom testproc include("simdloop.jl")
else
<tests>
end But I'm wondering if the cleanest way would not be to rewrite runtests.jl to have per-test exec flags. That is much more work though. |
Incidentally, I realised that ever since this |
Ouch. One way to catch this is to write a deliberately abusive use of I concur that @KristofferC that the tests that check for vectorization by inspecting the compiled code technically belong in some other directory. |
So, the tests (that is, those that should work already) now pass on my machine. However they still fail on the CI machines, in different places. Does Julia generate a CPU-specific LLVM IR? |
After having a look at |
I've also noticed cases that doesn't vectorize but still have the |
@damiendr : Vectorization is going to be target specific since some targets may lack SIMD support, and cost models (i.e. whether vectorization is profitable) vary even across micro-architectures. So the tests will need to be platform specific. With respect to vector labels showing up in code without vector instructions, this happens because the vectorizer does two optimizations: using SIMD instructions and interleaving separate loop iterations to hide latency. So presence of the labels indicates that the right semantic information got to the vectorizer, but its cost model indicated that only interleaving was worthwhile. This commonly happens for codes involving scatters or gathers on hardware that requires synthesizing those instructions from serial instructions. |
@ArchRobison Right. Initially I thought that the IR given by I'm not quite sure how to design the tests in these conditions, because ideally we should be testing only Julia's ability to work nicely with the vectorizer, not the vectorizer's cost model and the CPU features.
|
I recommend making the tests resemble real field usage as much as possible, so we're testing what we're shipping. E.g., not rely on special options for testing. LLVM puts CPU-specific tests in separate subdirectories, one for each CPU. I think that's a reasonable scheme for us too. We could start with 64-bit x86 because it's prevalent and guaranteed to have at least SSE, and then expand to other targets. [Disclosure: I work for Intel.] I think it's worth the trouble to test the LLVM IR for vector instructions, since there are all sorts of subtle things that can prevent vectorization. As far as infrastructure, it's worth taking a look at how LLVM's tests combine both the input and the expected output into a single file. It's a nice arrangement maintenance-wise. |
I'm not sure if it's that simple. One test suceeds on my machine (Sandy Bridge Core i7-2635QM) and fails on the CI machine (Nehalem Xeon X5570). I'm not yet 100% sure that there aren't other factors or bugs involved but it could be that the tests only vectorize with AVX instructions, or just that the cost model is different for the Xeon. |
I thought we disabled avx on non mcjit. Are you testing locally with a newer llvm? |
Is the test that fails on the CI machine use Float32 or Float64? I can see where a cost model might determine that Float64 is not worth the trouble with just SSE, but I would have expected Float32 to vectorize. I'm not sure if I can get Julia to build on our Nehalems here -- we tend to have antique environments on our antique machines. |
I ran some diagnostics and here are some elements:
None of the test machines seem to support the AVX instruction set. So a tentative conclusion would be that most of the SIMD examples, including some very simple ones, fail to vectorize on non-AVX architectures. It would be nice to run the tests in such a way that failures do not stop the next tests from running, to get a better idea of what fails & what does not. |
@ArchRobison To check for SIMD vectorization, you can craft a loop that involves floating-point operations with round-off errors. Depending on the order in which the operations are performed, you would get different round-off, and thus be able to tell whether SIMD vectorization was employed. This requires a reduction operations, though, so not all SIMD loops can be checked. |
@damiendr This difference is a plausible explanation of the different results you get. Try passing |
I think the extensive benchmarks together with the recent codegen tests cover this pretty ok. |
There are currenly a number of performance regressions with
@simd
loops, some involving inlining problems.It would be great to have a test suite for
@simd
loops that checks for actualvector
tags in the llvm code or SSE instructions in the native code. The current suite only checks for the correctness of the results.I'm willing to contribute that. I'd need a bit of guidance though on how to do the PR, considering that the tests will likely fail with the latest
head
.The text was updated successfully, but these errors were encountered: