Add -funroll-all-loops to compiler flags #183

ozabluda · 2023-07-30T18:24:57Z

Unroll all loops, even if their number of iterations is uncertain when the loop is entered. Runs faster on my tests.

For the difference in resulting assembly, see
https://godbolt.org/z/onaEsaEfT

Inspired by
#95

Unroll all loops, even if their number of iterations is uncertain when the loop is entered. Runs faster on my tests. For the difference in resulting assembly, see https://godbolt.org/z/onaEsaEfT

kroggen · 2023-07-30T23:37:45Z

Interesting. It does not change the code.

Please post the result of some benchmarks, without and with this flag, for each makefile section

It is important also to compare when using OMP_NUM_THREADS=4 ...

karpathy · 2023-07-31T01:07:28Z

I'm not able to reproduce a speedup here on my Linux box. Do you have some timings to share? Also I can't seem to compile this with clang, only with gcc. A warning gets issued that this flag is unsupported.

ozabluda · 2023-07-31T04:46:39Z

on my Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz with ./run stories110M.bin

runfast:
tok/s: 33.337691 (without -funroll-all-loops)
tok/s: 33.860045 (with -funroll-all-loops) (+1.6%)

runopenmp:
tok/s: 48.067861 (without -funroll-all-loops)
tok/s: 48.925556 (with -funroll-all-loops) (+1.8%)

To see the possible slight performance improvement on your particular box, you have to run it a couple of times and take the max, as it corresponds to the actual performance sans noise (noise is always additive).

Slight performance improvement is in line with slight performance improvement reported here from manual hand-unrolling:
#95 (comment)

In some cases improvement from may be better. @krzysztof-jusiak reported huge performance improvement with manual hand-unrolling on his use case. This PR was meant is a just a probably much better alternative to the proposed hand-unrolling.

My guess is that the limited performance improvement from unrolling is that CPUs implicitly unroll those loops with speculative execution, always making the correct branch prediction in this case.

@kroggen, what exactly do you mean "It does not change the code."? In godbolt link above it clearly does, even compared to hand-unrolling from the PR that inspired this PR, let alone original vanilla.

@karpathy>I can't seem to compile this with clang, only with gcc. A warning gets issued that this flag is unsupported.

Right, clang doesn't support the flag. I guess I'd replace
CC = gcc with CC = gcc -funroll-all-loops

karpathy · 2023-08-01T15:59:31Z

Not 100% sure why it didn't reproduce for me so I'm just adding a mention into readme e270c6e

ozabluda · 2023-08-01T18:30:39Z

I agree that on the platforms tested so far it's barely worth it, maybe if more performance improvement is reported by others it'll be worth it.

This PR was inspired by #95 where it was reported that with hand-unrolling

Result (with float16):

before: 16tok/s
after: 71tok/s

I am waiting for something like #93 to merged to recheck and possibly reopen this PR.

twobob · 2023-08-01T19:18:37Z

-funroll-all-loops isnt supported on clang anyway. AFAIK

clang -Ofast -fopenmp -funroll-all-loops -D_WIN32 -fprofile-instr-use=default.profdata -o run_clang.exe -I. run.c win.c
clang: warning: optimization flag '-funroll-all-loops' is not supported [-Wignored-optimization-argument]
clang: warning: optimization flag '-funroll-all-loops' is not supported [-Wignored-optimization-argument]

ozabluda · 2023-08-01T19:28:15Z

@twobob, yes we noticed :-) see above. If it's close to being merged, I'll spend some time with godbolt to see what I can do for clang, short of manual unrolling.

twobob · 2023-08-01T19:29:14Z

to be fair. I run the 4x manual unroll for now anyway, but it would be nice for others.

ozabluda · 2023-08-01T19:32:13Z

Try with 8. That's what gcc does (see godbolt above). How much speedup do you get? A couple percent, like everybody else?

twobob · 2023-08-02T19:44:32Z

I went with
win64:
x86_64-w64-mingw32-gcc -Ofast -funroll-all-loops -fopenmp -DCOMPILER="MINGW" -D_WIN32 -o runmingw.exe -I. run.c win.c
winclang:
clang -march=native -fno-math-errno -fopenmp -DCOMPILER="CLANG" -D_CRT_SECURE_NO_WARNINGS -Ofast -D_WIN32 -o run.exe -I. run.c win.c
wingcc:
$(CC) -march=native -fno-math-errno -funroll-all-loops -fopenmp -DCOMPILER="GCC" -Ofast -D_WIN32 -o rungcc.exe -I. run.c win.c

achieved tok/s: 40.514851 for MINGW
achieved tok/s: 40.742363 for GCC
achieved tok/s: 43.186424 for CLANG

which is a marked improvement for mingw and gcc over the previously ubiquitously superior clang build; what might have been as much as a 15% deficits sometimes clawed back within almost noise levels. Windows. Junky machine.
thanks for the tip. And for the godbolt

Add -funroll-all-loops to compiler flags

d71c1ae

Unroll all loops, even if their number of iterations is uncertain when the loop is entered. Runs faster on my tests. For the difference in resulting assembly, see https://godbolt.org/z/onaEsaEfT

karpathy closed this Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add -funroll-all-loops to compiler flags #183

Add -funroll-all-loops to compiler flags #183

ozabluda commented Jul 30, 2023

kroggen commented Jul 30, 2023

karpathy commented Jul 31, 2023

ozabluda commented Jul 31, 2023 •

edited

Loading

karpathy commented Aug 1, 2023

ozabluda commented Aug 1, 2023

twobob commented Aug 1, 2023

ozabluda commented Aug 1, 2023

twobob commented Aug 1, 2023

ozabluda commented Aug 1, 2023

twobob commented Aug 2, 2023

Add -funroll-all-loops to compiler flags #183

Add -funroll-all-loops to compiler flags #183

Conversation

ozabluda commented Jul 30, 2023

kroggen commented Jul 30, 2023

karpathy commented Jul 31, 2023

ozabluda commented Jul 31, 2023 • edited Loading

karpathy commented Aug 1, 2023

ozabluda commented Aug 1, 2023

twobob commented Aug 1, 2023

ozabluda commented Aug 1, 2023

twobob commented Aug 1, 2023

ozabluda commented Aug 1, 2023

twobob commented Aug 2, 2023

ozabluda commented Jul 31, 2023 •

edited

Loading