Improve Windows threading performance scaling #4103

mseminatore · 2023-06-24T02:46:30Z

No description provided.

mseminatore · 2023-06-24T05:11:09Z

Failed builds look related to PR #4092

martin-frbg · 2023-06-24T07:12:06Z

Appveyor build (using old mingw version of gcc) failed with

lib/libopenblas.a(blas_server_win32.c.obj):blas_server_win32.c:(.text+0x3b8): undefined reference to `InterlockedCompareExchange64'

which does seem related to your changes - the others are just timing out after getting scheduled on old&slow hardware

mseminatore · 2023-06-24T15:27:00Z

Thank you, I will look into that.

martin-frbg · 2023-06-24T15:48:33Z

Appears to be a deficiency in (mingw-) GCC <= 5.3, I've now seen the gcc intrinsic __sync_val_compare_and_swap used as a replacement e.g. in https://github.com/migueldeicaza/mono-wasm-mono/blob/master/mono/utils/atomic.h (could probably be made conditional on GNUC being defined and less than 6)

mmuetzel · 2023-06-27T07:03:23Z

It looks like the MSYS2 runners using gcc are stuck indefinitely during the test for SINGLE-DOUBLE_PRECISION_LAPACK_prototype_linear_equation_routines with this change.

Is that a deficiency in GCC from MSYS2? Or is there something odd with this change?

martin-frbg · 2023-06-27T08:46:20Z

I was hoping it was just gh runners acting up, as I see no reason why the jobs would hang precisely there. Trying now with the stealth equivalent of #4102 reverted

mseminatore · 2023-06-27T14:25:25Z

I wonder if it had to do with re-entrancy in `exec_blas_async`. I had that as a question. I will look. Can we get a callstack from the hang?

martin-frbg · 2023-06-27T17:16:18Z

Probably not from the workflow run, guess I'll have to reproduce this locally.

martin-frbg · 2023-06-27T22:36:01Z

Seems to get stuck in dgetrf (from TESTING/LIN/xlintstds <dstest.in) :

(gdb) thr 1
[Switching to thread 1 (Thread 8604.0x2ba8)]
#0  0x000000010054cb7f in dgetrf_parallel (args=0xfff94150, range_m=0x0, range_n=0x0,
    sa=0x6ffff0000000, sb=0x6ffff0090000, myid=0) at getrf_parallel.c:627
627     getrf_parallel.c: No such file or directory.
(gdb) bt
#0  0x000000010054cb7f in dgetrf_parallel (args=0xfff94150, range_m=0x0, range_n=0x0,
    sa=0x6ffff0000000, sb=0x6ffff0090000, myid=0) at getrf_parallel.c:627
#1  0x000000010051bf3d in dgetrf_ (M=0xfff94600, N=0xfff94600, a=0xfffb8a60, ldA=0xfff9460c,
    ipiv=0xfffb0370, Info=0xfff94634) at lapack/getrf.c:115
#2  0x000000010050e9b2 in dsgesv (n=101, nrhs=1, a=..., lda=101, ipiv=..., b=..., ldb=101, x=...,
    ldx=101, work=..., swork=..., iter=-3, info=0) at dsgesv.f:417
#3  0x0000000100403833 in ddrvab (dotype=..., nm=11, mval=..., nns=4, nsval=..., thresh=30,
    nmax=132, a=..., afac=..., b=..., x=..., work=..., rwork=..., swork=..., iwork=..., nout=6)
    at ddrvab.f:313
#4  0x0000000100402af3 in dchkab () at dchkab.f:303
#5  0x0000000100402f3d in main (argc=1, argv=0x80004a810) at dchkab.f:364
#6  0x0000000180049b91 in dll_crt0_1 ()
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/dcrt0.cc:1053
#7  0x0000000180047716 in _cygtls::call2 (this=0xffffce00, func=0x180048a40 <dll_crt0_1(void*)>,
    arg=0x0, buf=buf@entry=0xffffcdf0)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#8  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#9  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

with these threads

(gdb)  info thre
  Id   Target Id                      Frame
* 1    Thread 8604.0x2ba8 "xlintstds" 0x000000010054cb7f in dgetrf_parallel (args=0xfff94150,
    range_m=0x0, range_n=0x0, sa=0x6ffff0000000, sb=0x6ffff0090000, myid=0)
    at getrf_parallel.c:627
  5    Thread 8604.0x834 "sig"        0x00007ff8d86aed14 in ntdll!ZwReadFile ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
  6    Thread 8604.0x3030 "consm"     0x00007ff8d86af7a4 in ntdll!ZwWaitForMultipleObjects ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
  7    Thread 8604.0x968              0x00007ff8d86aecd4 in ntdll!ZwWaitForSingleObject ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
  8    Thread 8604.0x291c             0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
  9    Thread 8604.0x3364             0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
  19   Thread 8604.0xfb0              0x00007ff8d5d567ae in KERNELBASE!CtrlRoutine ()
   from C:\WINDOWS\System32\KernelBase.dll
(gdb)

(gdb) thr 5
[Switching to thread 5 (Thread 8604.0x834)]
#0  0x00007ff8d86aed14 in ntdll!ZwReadFile () from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff8d86aed14 in ntdll!ZwReadFile () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff8d5c36e4b in ReadFile () from C:\WINDOWS\System32\KernelBase.dll
#2  0x0000000180139201 in wait_sig ()
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/sigproc.cc:1358
#3  0x0000000180046583 in cygthread::callfunc (this=this@entry=0x18023c840 <threads>,
    issimplestub=issimplestub@entry=false)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:48
#4  0x0000000180046b46 in cygthread::stub (arg=arg@entry=0x18023c840 <threads>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:91
#5  0x0000000180047716 in _cygtls::call2 (this=0x10ece00,
    func=0x180046ac0 <cygthread::stub(void*)>, arg=0x18023c840 <threads>, buf=buf@entry=0x10ecd20)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#6  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#7  0x00007ff8d68626ad in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#8  0x00007ff8d866a9f8 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#9  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thr 6
[Switching to thread 6 (Thread 8604.0x3030)]
#0  0x00007ff8d86af7a4 in ntdll!ZwWaitForMultipleObjects () from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff8d86af7a4 in ntdll!ZwWaitForMultipleObjects () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff8d5c6fcf9 in WaitForMultipleObjectsEx () from C:\WINDOWS\System32\KernelBase.dll
#2  0x00007ff8d5c6fbfe in WaitForMultipleObjects () from C:\WINDOWS\System32\KernelBase.dll
#3  0x000000018004791a in cygwait (object=object@entry=0x0, timeout=timeout@entry=0x288caf8,
    mask=mask@entry=5) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygwait.cc:75
#4  0x000000018006ef58 in cygwait (mask=5, howlong=40, h=0x0)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygwait.h:45
#5  cygwait (howlong=40, h=0x0) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygwait.h:51
#6  cygwait (howlong=40) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygwait.h:57
#7  fhandler_console::cons_master_thread (p=0x288cba0, ttyp=0x180000000)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/fhandler_console.cc:569
#8  0x000000018007b6ee in cons_master_thread (arg=0x180359790)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/fhandler_console.cc:216
#9  0x0000000180046583 in cygthread::callfunc (this=this@entry=0x18023c898 <threads+88>,
    issimplestub=issimplestub@entry=false)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:48
#10 0x0000000180046b46 in cygthread::stub (arg=arg@entry=0x18023c898 <threads+88>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:91
#11 0x0000000180047716 in _cygtls::call2 (this=0x288ce00,
    func=0x180046ac0 <cygthread::stub(void*)>, arg=0x18023c898 <threads+88>,
    buf=buf@entry=0x288cd20) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#12 0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#13 0x00007ff8d68626ad in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#14 0x00007ff8d866a9f8 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#15 0x0000000000000000 in ?? ()
--Type <RET> for more, q to quit, c to continue without paging--
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thr 7
[Switching to thread 7 (Thread 8604.0x968)]
#0  0x00007ff8d86aecd4 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff8d86aecd4 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff8d5c4480e in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
#2  0x000000010044a497 in blas_thread_server (arg=0x0) at blas_server_win32.c:233
#3  0x0000000180047716 in _cygtls::call2 (this=0x2a8ce00, func=0x10044a462 <blas_thread_server>,
    arg=0x0, buf=buf@entry=0x2a8cd20)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#4  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#5  0x00007ff8d68626ad in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#6  0x00007ff8d866a9f8 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#7  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thr 8
[Switching to thread 8 (Thread 8604.0x291c)]
#0  0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff8d864536e in ntdll!RtlClearThreadWorkOnBehalfTicket ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x0000000180047716 in _cygtls::call2 (this=0x2c8ce00,
    func=0x7ff8d8645080 <ntdll!RtlClearThreadWorkOnBehalfTicket+112>, arg=0x7f0ce0,
    buf=buf@entry=0x2c8cd20) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#3  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#4  0x00007ff8d68626ad in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#5  0x00007ff8d866a9f8 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#6  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thr 9
[Switching to thread 9 (Thread 8604.0x3364)]
#0  0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff8d86b28a4 in ntdll!ZwWaitForWorkViaWorkerFactory ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff8d864536e in ntdll!RtlClearThreadWorkOnBehalfTicket ()
   from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x0000000180047716 in _cygtls::call2 (this=0x2e8ce00,
    func=0x7ff8d8645080 <ntdll!RtlClearThreadWorkOnBehalfTicket+112>, arg=0x7f0ce0,
    buf=buf@entry=0x2e8cd20) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#3  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>)
    at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#4  0x00007ff8d68626ad in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#5  0x00007ff8d866a9f8 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#6  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

(this was built with msys-mingw64 (msys2 version 20190524), gcc 11.3 on a quad-core SkylakeX (Xeon W-2123) under Windows 11)

mseminatore · 2023-06-27T23:02:07Z

@martin-frbg Let me open a new branch on my side, with a PR to try to fix this. In optimizing away locks I removed one that may be questionable. That was a lock to handle calling exec_blas_async() while worker threads are still processing.

I did not think I was able to find a case where that happens. As a result, I removed a lock and logic that tested for and appended a new work queue to an existing one if an async work-ahead call was made.

I can put that code back and see if it fixes the problem.

mseminatore · 2023-06-27T23:28:04Z

It looks to me like the only place in the entire codebase where exec_blas_async() is explicitly called is from lapack/getrf/getrf_parallel.c @ line 615. If that call can happen before the worker thread queue is drained work can be lost. So this looks like a race condition that maybe shows up only in this specific hw/sw configuration and test.

I will put the protection on that case back in place and send a new PR so we can see if that fixes it.

mseminatore · 2023-06-27T23:28:36Z

See PR #4112 for a possible fix.

mseminatore · 2023-06-28T03:14:02Z

OK, so I don't think that fixed it.

There must be an issue with exec_blas_async_wait() - my guess is the hang is related to the logic in that join point. getrf_parallel.c does some strange things explicitly calling exec_blas_async() and exec_blas_async_wait() along with inner_basic_thread(). I'll try to unravel that logic to understand what may be going on.

martin-frbg · 2023-06-28T07:51:40Z

I guess I'll revert your contribution for now, just to avoid having the problematic behavior end up in somebody's build (or even distribution)

mseminatore · 2023-06-28T16:58:18Z

Yes, thank you and sorry. I will study the code further to see if I can come up with a better solution.

Do you know if it is possible to run that particular test that was failing in isolation? I'd like to debug it to understand the calling pattern better.

martin-frbg · 2023-06-28T17:13:55Z

Sure - in lapack-netlib/TESTING, run LIN/xlintstds.exe with dstest.in as (standard) input. You can edit dstest.in to cut down the number of matrix sizes it tries, and to keep it from testing dposv as well. (Have not checked which matrix size shows the problem first though - backtrace suggests it is M=101)

mseminatore added 2 commits June 23, 2023 19:42

fix OpenMathLib#4063 threading perf issues on Windows

d301649

fix OpenMathLib#4063 remove unused pool_lock

8caabc5

mseminatore mentioned this pull request Jun 24, 2023

Threading performance issue on Windows #4063

Closed

update contributors

427f9f2

mseminatore added 2 commits June 24, 2023 14:35

attempt to fix old mingw gcc issue

7783a9a

fix missing #endif

d6991dd

martin-frbg added this to the 0.3.24 milestone Jun 25, 2023

martin-frbg linked an issue Jun 25, 2023 that may be closed by this pull request

Threading performance issue on Windows #4063

Closed

martin-frbg merged commit 7acff1b into OpenMathLib:develop Jun 25, 2023

mseminatore deleted the windows_perf branch June 25, 2023 20:17

martin-frbg mentioned this pull request Jun 28, 2023

Revert "Improve Windows threading performance scaling" #4113

Merged

This was referenced Jul 19, 2023

Ensure that early calls to blas_set_num_threads will not overwrite unrelated memory #4151

Merged

restore lock in exec_blas_async to prevent race on queue #4112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Windows threading performance scaling #4103

Improve Windows threading performance scaling #4103

mseminatore commented Jun 24, 2023

mseminatore commented Jun 24, 2023

martin-frbg commented Jun 24, 2023

mseminatore commented Jun 24, 2023 via email •

edited

Loading

martin-frbg commented Jun 24, 2023

mmuetzel commented Jun 27, 2023

martin-frbg commented Jun 27, 2023

mseminatore commented Jun 27, 2023 via email •

edited

Loading

martin-frbg commented Jun 27, 2023

martin-frbg commented Jun 27, 2023

mseminatore commented Jun 27, 2023 •

edited

Loading

mseminatore commented Jun 27, 2023

mseminatore commented Jun 27, 2023

mseminatore commented Jun 28, 2023

martin-frbg commented Jun 28, 2023

mseminatore commented Jun 28, 2023

martin-frbg commented Jun 28, 2023 •

edited

Loading

Improve Windows threading performance scaling #4103

Improve Windows threading performance scaling #4103

Conversation

mseminatore commented Jun 24, 2023

mseminatore commented Jun 24, 2023

martin-frbg commented Jun 24, 2023

mseminatore commented Jun 24, 2023 via email • edited Loading

martin-frbg commented Jun 24, 2023

mmuetzel commented Jun 27, 2023

martin-frbg commented Jun 27, 2023

mseminatore commented Jun 27, 2023 via email • edited Loading

martin-frbg commented Jun 27, 2023

martin-frbg commented Jun 27, 2023

mseminatore commented Jun 27, 2023 • edited Loading

mseminatore commented Jun 27, 2023

mseminatore commented Jun 27, 2023

mseminatore commented Jun 28, 2023

martin-frbg commented Jun 28, 2023

mseminatore commented Jun 28, 2023

martin-frbg commented Jun 28, 2023 • edited Loading

mseminatore commented Jun 24, 2023 via email •

edited

Loading

mseminatore commented Jun 27, 2023 via email •

edited

Loading

mseminatore commented Jun 27, 2023 •

edited

Loading

martin-frbg commented Jun 28, 2023 •

edited

Loading