-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add -funroll-all-loops to compiler flags #183
Conversation
Unroll all loops, even if their number of iterations is uncertain when the loop is entered. Runs faster on my tests. For the difference in resulting assembly, see https://godbolt.org/z/onaEsaEfT
Interesting. It does not change the code. Please post the result of some benchmarks, without and with this flag, for each makefile section It is important also to compare when using |
I'm not able to reproduce a speedup here on my Linux box. Do you have some timings to share? Also I can't seem to compile this with clang, only with gcc. A warning gets issued that this flag is unsupported. |
on my Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz with
To see the possible slight performance improvement on your particular box, you have to run it a couple of times and take the max, as it corresponds to the actual performance sans noise (noise is always additive). Slight performance improvement is in line with slight performance improvement reported here from manual hand-unrolling: In some cases improvement from may be better. @krzysztof-jusiak reported huge performance improvement with manual hand-unrolling on his use case. This PR was meant is a just a probably much better alternative to the proposed hand-unrolling. My guess is that the limited performance improvement from unrolling is that CPUs implicitly unroll those loops with speculative execution, always making the correct branch prediction in this case. @kroggen, what exactly do you mean "It does not change the code."? In godbolt link above it clearly does, even compared to hand-unrolling from the PR that inspired this PR, let alone original vanilla. @karpathy>I can't seem to compile this with clang, only with gcc. A warning gets issued that this flag is unsupported. Right, clang doesn't support the flag. I guess I'd replace |
Not 100% sure why it didn't reproduce for me so I'm just adding a mention into readme e270c6e |
I agree that on the platforms tested so far it's barely worth it, maybe if more performance improvement is reported by others it'll be worth it. This PR was inspired by #95 where it was reported that with hand-unrolling
I am waiting for something like #93 to merged to recheck and possibly reopen this PR. |
-funroll-all-loops isnt supported on clang anyway. AFAIK clang -Ofast -fopenmp -funroll-all-loops -D_WIN32 -fprofile-instr-use=default.profdata -o run_clang.exe -I. run.c win.c |
@twobob, yes we noticed :-) see above. If it's close to being merged, I'll spend some time with godbolt to see what I can do for clang, short of manual unrolling. |
to be fair. I run the 4x manual unroll for now anyway, but it would be nice for others. |
Try with 8. That's what gcc does (see godbolt above). How much speedup do you get? A couple percent, like everybody else? |
I went with achieved tok/s: 40.514851 for MINGW which is a marked improvement for mingw and gcc over the previously ubiquitously superior clang build; what might have been as much as a 15% deficits sometimes clawed back within almost noise levels. Windows. Junky machine. |
Unroll all loops, even if their number of iterations is uncertain when the loop is entered. Runs faster on my tests.
For the difference in resulting assembly, see
https://godbolt.org/z/onaEsaEfT
Inspired by
#95