Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add -funroll-all-loops to compiler flags #183

Closed
wants to merge 1 commit into from

Conversation

ozabluda
Copy link
Contributor

Unroll all loops, even if their number of iterations is uncertain when the loop is entered. Runs faster on my tests.

For the difference in resulting assembly, see
https://godbolt.org/z/onaEsaEfT

Inspired by
#95

Unroll all loops, even if their number of iterations is uncertain when the loop is entered. Runs faster on my tests.

For the difference in resulting assembly, see
https://godbolt.org/z/onaEsaEfT
@kroggen
Copy link
Contributor

kroggen commented Jul 30, 2023

Interesting. It does not change the code.

Please post the result of some benchmarks, without and with this flag, for each makefile section

It is important also to compare when using OMP_NUM_THREADS=4 ...

@karpathy
Copy link
Owner

I'm not able to reproduce a speedup here on my Linux box. Do you have some timings to share? Also I can't seem to compile this with clang, only with gcc. A warning gets issued that this flag is unsupported.

@ozabluda
Copy link
Contributor Author

ozabluda commented Jul 31, 2023

on my Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz with ./run stories110M.bin

runfast:
tok/s: 33.337691 (without -funroll-all-loops)
tok/s: 33.860045 (with -funroll-all-loops) (+1.6%)

runopenmp:
tok/s: 48.067861 (without -funroll-all-loops)
tok/s: 48.925556 (with -funroll-all-loops) (+1.8%)

To see the possible slight performance improvement on your particular box, you have to run it a couple of times and take the max, as it corresponds to the actual performance sans noise (noise is always additive).

Slight performance improvement is in line with slight performance improvement reported here from manual hand-unrolling:
#95 (comment)

In some cases improvement from may be better. @krzysztof-jusiak reported huge performance improvement with manual hand-unrolling on his use case. This PR was meant is a just a probably much better alternative to the proposed hand-unrolling.

My guess is that the limited performance improvement from unrolling is that CPUs implicitly unroll those loops with speculative execution, always making the correct branch prediction in this case.

@kroggen, what exactly do you mean "It does not change the code."? In godbolt link above it clearly does, even compared to hand-unrolling from the PR that inspired this PR, let alone original vanilla.

@karpathy>I can't seem to compile this with clang, only with gcc. A warning gets issued that this flag is unsupported.

Right, clang doesn't support the flag. I guess I'd replace
CC = gcc with CC = gcc -funroll-all-loops

@karpathy
Copy link
Owner

karpathy commented Aug 1, 2023

Not 100% sure why it didn't reproduce for me so I'm just adding a mention into readme e270c6e

@karpathy karpathy closed this Aug 1, 2023
@ozabluda
Copy link
Contributor Author

ozabluda commented Aug 1, 2023

I agree that on the platforms tested so far it's barely worth it, maybe if more performance improvement is reported by others it'll be worth it.

This PR was inspired by #95 where it was reported that with hand-unrolling

Result (with float16):

before: 16tok/s
after: 71tok/s

I am waiting for something like #93 to merged to recheck and possibly reopen this PR.

@twobob
Copy link

twobob commented Aug 1, 2023

-funroll-all-loops isnt supported on clang anyway. AFAIK

clang -Ofast -fopenmp -funroll-all-loops -D_WIN32 -fprofile-instr-use=default.profdata -o run_clang.exe -I. run.c win.c
clang: warning: optimization flag '-funroll-all-loops' is not supported [-Wignored-optimization-argument]
clang: warning: optimization flag '-funroll-all-loops' is not supported [-Wignored-optimization-argument]

@ozabluda
Copy link
Contributor Author

ozabluda commented Aug 1, 2023

@twobob, yes we noticed :-) see above. If it's close to being merged, I'll spend some time with godbolt to see what I can do for clang, short of manual unrolling.

@twobob
Copy link

twobob commented Aug 1, 2023

to be fair. I run the 4x manual unroll for now anyway, but it would be nice for others.

@ozabluda
Copy link
Contributor Author

ozabluda commented Aug 1, 2023

Try with 8. That's what gcc does (see godbolt above). How much speedup do you get? A couple percent, like everybody else?

@twobob
Copy link

twobob commented Aug 2, 2023

I went with
win64:
x86_64-w64-mingw32-gcc -Ofast -funroll-all-loops -fopenmp -DCOMPILER="MINGW" -D_WIN32 -o runmingw.exe -I. run.c win.c
winclang:
clang -march=native -fno-math-errno -fopenmp -DCOMPILER="CLANG" -D_CRT_SECURE_NO_WARNINGS -Ofast -D_WIN32 -o run.exe -I. run.c win.c
wingcc:
$(CC) -march=native -fno-math-errno -funroll-all-loops -fopenmp -DCOMPILER="GCC" -Ofast -D_WIN32 -o rungcc.exe -I. run.c win.c

achieved tok/s: 40.514851 for MINGW
achieved tok/s: 40.742363 for GCC
achieved tok/s: 43.186424 for CLANG

which is a marked improvement for mingw and gcc over the previously ubiquitously superior clang build; what might have been as much as a 15% deficits sometimes clawed back within almost noise levels. Windows. Junky machine.
thanks for the tip. And for the godbolt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants