Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of bounds accesses in gemv leading to segmentation faults on AMD Epyc #4013

Closed
twesterhout opened this issue Apr 19, 2023 · 6 comments · Fixed by #4014
Closed

Out of bounds accesses in gemv leading to segmentation faults on AMD Epyc #4013

twesterhout opened this issue Apr 19, 2023 · 6 comments · Fixed by #4014
Milestone

Comments

@twesterhout
Copy link

Hello,

I've originally reported the issue to ArrayFire, but I believe that I can now reproduce it without ArrayFire and it's really an issue in OpenBLAS.

I've created a gist with a minimal working prototype: https://gist.github.com/twesterhout/855a4268c357b7d504072d2445e529fc

If we compile the code

gcc test.c -lopenblas

and then run it under valgrind

valgrind ./a.out

Here's the output:

==54975== Memcheck, a memory error detector
==54975== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==54975== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==54975== Command: ./a.out
==54975==
==54975== Thread 8:
==54975== Invalid read of size 16
==54975==    at 0x56ED798: zgemv_n_HASWELL (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x4A1A0B0: gemv_kernel (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x4BAF170: exec_threads (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x4BAF539: exec_blas._omp_fn.1 (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x620A10D: gomp_thread_start (in /nix/store/shasq3azl2298vqkvq5mc7vivdqp3yrj-gcc-12.2.0-lib/lib/libgomp.so.1.0.0)
==54975==    by 0x62C1E85: start_thread (in /nix/store/8xk4yl1r3n6kbyn05qhan7nbag7npymx-glibc-2.35-224/lib/libc.so.6)
==54975==    by 0x6347EE3: clone (in /nix/store/8xk4yl1r3n6kbyn05qhan7nbag7npymx-glibc-2.35-224/lib/libc.so.6)
==54975==  Address 0x20cc92050 is 16 bytes after a block of size 50,331,648 alloc'd
==54975==    at 0x484679B: malloc (in /nix/store/nffc5p3n0hxk08fbkvg8v5swiqjp493r-valgrind-3.20.0/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==54975==    by 0x4015DD: random_fill (test.c:24)
==54975==    by 0x401755: main (test.c:67)
==54975==
==54975== Thread 1:
==54975== Invalid read of size 16
==54975==    at 0x56ED798: zgemv_n_HASWELL (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x499F1BA: zgemv_ (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x5CFA180: zlarf_ (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x5C411A1: zgelq2_ (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x5C4166C: zgelqf_ (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x5C41F26: zgels_ (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x5F8D83B: LAPACKE_zgels_work (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x5F8D501: LAPACKE_zgels (in /nix/store/nz7x22bdcln7vjws178lx8abrz0cw0v1-openblas-0.3.21/lib/libopenblasp-r0.3.21.so)
==54975==    by 0x4017FE: main (test.c:84)
==54975==  Address 0x20cc92210 is 50,332,112 bytes inside a block of size 50,335,648 in arena "client"
==54975==

==54975==
==54975== HEAP SUMMARY:
==54975==     in use at exit: 36,560 bytes in 67 blocks
==54975==   total heap usage: 867 allocs, 800 frees, 557,720,640 bytes allocated
==54975==
==54975== LEAK SUMMARY:
==54975==    definitely lost: 0 bytes in 0 blocks
==54975==    indirectly lost: 0 bytes in 0 blocks
==54975==      possibly lost: 20,160 bytes in 63 blocks
==54975==    still reachable: 16,400 bytes in 4 blocks
==54975==         suppressed: 0 bytes in 0 blocks
==54975== Rerun with --leak-check=full to see details of leaked memory
==54975==
==54975== For lists of detected and suppressed errors, rerun with: -s
==54975== ERROR SUMMARY: 64 errors from 2 contexts (suppressed: 0 from 0)

The system that I tested on:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  1
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7502 32-Core Processor
Stepping:            0
CPU MHz:             3338.296
BogoMIPS:            4990.58
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-31
NUMA node1 CPU(s):   32-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
@martin-frbg
Copy link
Collaborator

Could be related to #3740 (which I had ascribed to GCC's tree-vectorizer that is now active by default in recent gcc versions). Unfortunately at the time the problem only showed up on non-Linux systems, so the compiler pragma to disable it in the affected files is currently ifdef'd to Windows and OSX only. (Anyway that workaround was only put in after 0.3.21)

@twesterhout
Copy link
Author

@martin-frbg it indeed seems related to #3740. I've compiled OpenBLAS with CFLAGS=-fno-tree-vectorize Make flag, and my test runs cleanly under valgrind. ArrayFire's test suite now also passes. Do you think tree-vectorize should be disabled globally when using gcc or only for certain files?

@martin-frbg
Copy link
Collaborator

Probably needs more tests, but from the "experiences" gathered so far it may be sufficient to disable this on a per-file basis, i.e. by removing the preprocessor conditionals around the #pragma GCC optimize lines in all files touched by PR #3745 . In retrospect, there was already a report of this also happening on Linux in February, but I failed to reproduce the problem on my hardware back then. (Have not had a chance to try your testcase, if I cannot reproduce this either, then my local hardware is probably too weak to expose this problem)

@twesterhout
Copy link
Author

@martin-frbg unfortunately, #4014 is not sufficient. It fixes the issue related to linear solving, but I'm now getting crashes for SVD decomposition with complex floats. Disabling tree-vectorize globally again fixes the issue.

@martin-frbg
Copy link
Collaborator

Damn autoclose... I had already suspected the same pragma would be necessary in the equivalent kernels for complex and for double. Guess I'll need to do a full Arrayfire build for testing eventually, and/or find a big x86 server.

@martin-frbg
Copy link
Collaborator

Closing as no longer reproducible (including with the ArrayFire testsuite) after #4014 & #4015 . Apparently the just released gcc 13.1 also has fixed the tree-vectorizer misbehaviour of preceding releases, but I'll rather wait and test a bit more before the #pragma gets limited to pre-13 versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants