Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Windows threading performance scaling #4103

Merged
merged 5 commits into from
Jun 25, 2023

Conversation

mseminatore
Copy link
Contributor

No description provided.

@mseminatore
Copy link
Contributor Author

Failed builds look related to PR #4092

@martin-frbg
Copy link
Collaborator

Appveyor build (using old mingw version of gcc) failed with

lib/libopenblas.a(blas_server_win32.c.obj):blas_server_win32.c:(.text+0x3b8): undefined reference to `InterlockedCompareExchange64'

which does seem related to your changes - the others are just timing out after getting scheduled on old&slow hardware

@mseminatore
Copy link
Contributor Author

mseminatore commented Jun 24, 2023 via email

@martin-frbg
Copy link
Collaborator

Appears to be a deficiency in (mingw-) GCC <= 5.3, I've now seen the gcc intrinsic __sync_val_compare_and_swap used as a replacement e.g. in https://github.com/migueldeicaza/mono-wasm-mono/blob/master/mono/utils/atomic.h (could probably be made conditional on GNUC being defined and less than 6)

@martin-frbg martin-frbg added this to the 0.3.24 milestone Jun 25, 2023
@martin-frbg martin-frbg linked an issue Jun 25, 2023 that may be closed by this pull request
@martin-frbg martin-frbg merged commit 7acff1b into OpenMathLib:develop Jun 25, 2023
@mseminatore mseminatore deleted the windows_perf branch June 25, 2023 20:17
@mmuetzel
Copy link
Contributor

It looks like the MSYS2 runners using gcc are stuck indefinitely during the test for SINGLE-DOUBLE_PRECISION_LAPACK_prototype_linear_equation_routines with this change.

Is that a deficiency in GCC from MSYS2? Or is there something odd with this change?

@martin-frbg
Copy link
Collaborator

I was hoping it was just gh runners acting up, as I see no reason why the jobs would hang precisely there. Trying now with the stealth equivalent of #4102 reverted

@mseminatore
Copy link
Contributor Author

mseminatore commented Jun 27, 2023 via email

@martin-frbg
Copy link
Collaborator

Probably not from the workflow run, guess I'll have to reproduce this locally.

@martin-frbg
Copy link
Collaborator

Seems to get stuck in dgetrf (from TESTING/LIN/xlintstds <dstest.in) :

(gdb) thr 1
[Switching to thread 1 (Thread 8604.0x2ba8)]
#0  0x000000010054cb7f in dgetrf_parallel (args=0xfff94150, range_m=0x0, range_n=0x0,
    sa=0x6ffff0000000, sb=0x6ffff0090000, myid=0) at getrf_parallel.c:627
627     getrf_parallel.c: No such file or directory.
(gdb) bt
#0  0x000000010054cb7f in dgetrf_parallel (args=0xfff94150, range_m=0x0, range_n=0x0,
    sa=0x6ffff0000000, sb=0x6ffff0090000, myid=0) at getrf_parallel.c:627
#1  0x000000010051bf3d in dgetrf_ (M=0xfff94600, N=0xfff94600, a=0xfffb8a60, ldA=0xfff9460c,
    ipiv=0xfffb0370, Info=0xfff94634) at lapack/getrf.c:115
#2  0x000000010050e9b2 in dsgesv (n=101, nrhs=1, a=..., lda=101, ipiv=..., b=..., ldb=101, x=...,
    ldx=101, work=..., swork=..., iter=-3, info=0) at dsgesv.f:417
#3  0x0000000100403833 in ddrvab (dotype=..., nm=11, mval=..., nns=4, nsval=..., thresh=30,
    nmax=132, a=..., afac=..., b=..., x=..., work=..., rwork=..., swork=..., iwork=..., nout=6)
    at ddrvab.f:313
#4  0x0000000100402af3 in dchkab () at dchkab.f:303
#5  0x0000000100402f3d in main (argc=1, argv=0x80004a810) at dchkab.f:364
#6  0x0000000180049b91 in dll_crt0_1 ()
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/dcrt0.cc:1053
#7  0x0000000180047716 in _cygtls::call2 (this=0xffffce00, func=0x180048a40 <dll_crt0_1(void*)>,
    arg=0x0, buf=buf@entry=0xffffcdf0)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#8  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#9  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

with these threads

(gdb)  info thre
  Id   Target Id                      Frame
* 1    Thread 8604.0x2ba8 "xlintstds" 0x000000010054cb7f in dgetrf_parallel (args=0xfff94150,
    range_m=0x0, range_n=0x0, sa=0x6ffff0000000, sb=0x6ffff0090000, myid=0)
    at getrf_parallel.c:627
  5    Thread 8604.0x834 "sig"        0x00007ff8d86aed14 in ntdll!ZwReadFile ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
  6    Thread 8604.0x3030 "consm"     0x00007ff8d86af7a4 in ntdll!ZwWaitForMultipleObjects ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
  7    Thread 8604.0x968              0x00007ff8d86aecd4 in ntdll!ZwWaitForSingleObject ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
  8    Thread 8604.0x291c             0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
  9    Thread 8604.0x3364             0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
  19   Thread 8604.0xfb0              0x00007ff8d5d567ae in KERNELBASE!CtrlRoutine ()
   from C:\WINDOWS\System32\KernelBase.dll
(gdb)

(gdb) thr 5
[Switching to thread 5 (Thread 8604.0x834)]
#0  0x00007ff8d86aed14 in ntdll!ZwReadFile () from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff8d86aed14 in ntdll!ZwReadFile () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff8d5c36e4b in ReadFile () from C:\WINDOWS\System32\KernelBase.dll
#2  0x0000000180139201 in wait_sig ()
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/sigproc.cc:1358
#3  0x0000000180046583 in cygthread::callfunc (this=this@entry=0x18023c840 <threads>,
    issimplestub=issimplestub@entry=false)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:48
#4  0x0000000180046b46 in cygthread::stub (arg=arg@entry=0x18023c840 <threads>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:91
#5  0x0000000180047716 in _cygtls::call2 (this=0x10ece00,
    func=0x180046ac0 <cygthread::stub(void*)>, arg=0x18023c840 <threads>, buf=buf@entry=0x10ecd20)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#6  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#7  0x00007ff8d68626ad in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#8  0x00007ff8d866a9f8 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#9  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thr 6
[Switching to thread 6 (Thread 8604.0x3030)]
#0  0x00007ff8d86af7a4 in ntdll!ZwWaitForMultipleObjects () from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff8d86af7a4 in ntdll!ZwWaitForMultipleObjects () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff8d5c6fcf9 in WaitForMultipleObjectsEx () from C:\WINDOWS\System32\KernelBase.dll
#2  0x00007ff8d5c6fbfe in WaitForMultipleObjects () from C:\WINDOWS\System32\KernelBase.dll
#3  0x000000018004791a in cygwait (object=object@entry=0x0, timeout=timeout@entry=0x288caf8,
    mask=mask@entry=5) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygwait.cc:75
#4  0x000000018006ef58 in cygwait (mask=5, howlong=40, h=0x0)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygwait.h:45
#5  cygwait (howlong=40, h=0x0) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygwait.h:51
#6  cygwait (howlong=40) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygwait.h:57
#7  fhandler_console::cons_master_thread (p=0x288cba0, ttyp=0x180000000)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/fhandler_console.cc:569
#8  0x000000018007b6ee in cons_master_thread (arg=0x180359790)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/fhandler_console.cc:216
#9  0x0000000180046583 in cygthread::callfunc (this=this@entry=0x18023c898 <threads+88>,
    issimplestub=issimplestub@entry=false)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:48
#10 0x0000000180046b46 in cygthread::stub (arg=arg@entry=0x18023c898 <threads+88>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:91
#11 0x0000000180047716 in _cygtls::call2 (this=0x288ce00,
    func=0x180046ac0 <cygthread::stub(void*)>, arg=0x18023c898 <threads+88>,
    buf=buf@entry=0x288cd20) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#12 0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#13 0x00007ff8d68626ad in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#14 0x00007ff8d866a9f8 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#15 0x0000000000000000 in ?? ()
--Type <RET> for more, q to quit, c to continue without paging--
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thr 7
[Switching to thread 7 (Thread 8604.0x968)]
#0  0x00007ff8d86aecd4 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff8d86aecd4 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff8d5c4480e in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
#2  0x000000010044a497 in blas_thread_server (arg=0x0) at blas_server_win32.c:233
#3  0x0000000180047716 in _cygtls::call2 (this=0x2a8ce00, func=0x10044a462 <blas_thread_server>,
    arg=0x0, buf=buf@entry=0x2a8cd20)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#4  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#5  0x00007ff8d68626ad in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#6  0x00007ff8d866a9f8 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#7  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thr 8
[Switching to thread 8 (Thread 8604.0x291c)]
#0  0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff8d864536e in ntdll!RtlClearThreadWorkOnBehalfTicket ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x0000000180047716 in _cygtls::call2 (this=0x2c8ce00,
    func=0x7ff8d8645080 <ntdll!RtlClearThreadWorkOnBehalfTicket+112>, arg=0x7f0ce0,
    buf=buf@entry=0x2c8cd20) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#3  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#4  0x00007ff8d68626ad in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#5  0x00007ff8d866a9f8 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#6  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thr 9
[Switching to thread 9 (Thread 8604.0x3364)]
#0  0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff8d864536e in ntdll!RtlClearThreadWorkOnBehalfTicket ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x0000000180047716 in _cygtls::call2 (this=0x2e8ce00,
    func=0x7ff8d8645080 <ntdll!RtlClearThreadWorkOnBehalfTicket+112>, arg=0x7f0ce0,
    buf=buf@entry=0x2e8cd20) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#3  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#4  0x00007ff8d68626ad in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#5  0x00007ff8d866a9f8 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#6  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

(this was built with msys-mingw64 (msys2 version 20190524), gcc 11.3 on a quad-core SkylakeX (Xeon W-2123) under Windows 11)

@mseminatore
Copy link
Contributor Author

mseminatore commented Jun 27, 2023

@martin-frbg Let me open a new branch on my side, with a PR to try to fix this. In optimizing away locks I removed one that may be questionable. That was a lock to handle calling exec_blas_async() while worker threads are still processing.

I did not think I was able to find a case where that happens. As a result, I removed a lock and logic that tested for and appended a new work queue to an existing one if an async work-ahead call was made.

I can put that code back and see if it fixes the problem.

@mseminatore
Copy link
Contributor Author

It looks to me like the only place in the entire codebase where exec_blas_async() is explicitly called is from lapack/getrf/getrf_parallel.c @ line 615. If that call can happen before the worker thread queue is drained work can be lost. So this looks like a race condition that maybe shows up only in this specific hw/sw configuration and test.

I will put the protection on that case back in place and send a new PR so we can see if that fixes it.

@mseminatore
Copy link
Contributor Author

See PR #4112 for a possible fix.

@mseminatore
Copy link
Contributor Author

OK, so I don't think that fixed it.

There must be an issue with exec_blas_async_wait() - my guess is the hang is related to the logic in that join point. getrf_parallel.c does some strange things explicitly calling exec_blas_async() and exec_blas_async_wait() along with inner_basic_thread(). I'll try to unravel that logic to understand what may be going on.

@martin-frbg
Copy link
Collaborator

I guess I'll revert your contribution for now, just to avoid having the problematic behavior end up in somebody's build (or even distribution)

@mseminatore
Copy link
Contributor Author

Yes, thank you and sorry. I will study the code further to see if I can come up with a better solution.

Do you know if it is possible to run that particular test that was failing in isolation? I'd like to debug it to understand the calling pattern better.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 28, 2023

Sure - in lapack-netlib/TESTING, run LIN/xlintstds.exe with dstest.in as (standard) input. You can edit dstest.in to cut down the number of matrix sizes it tries, and to keep it from testing dposv as well. (Have not checked which matrix size shows the problem first though - backtrace suggests it is M=101)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Threading performance issue on Windows
3 participants