dpotrf + dpotri: Windows vs Linux #4886

AllinCottrell · 2024-08-29T18:46:28Z

I've come across what looks like an anomalous difference in performance inverting a positive definite matrix using dpotrf() and dpotri(), on Windows as compared with Linux. This is on a dual-boot SkylakeX laptop, using OpenBLAS 0.3.28, compiled with gcc 14.2.0 on Arch Linux and cross-compiled with x86_64-mingw-w64-mingw32-gcc 14.2.0 for Windows 11, in both cases using OpenMP for threading. The configuration flags are mostly the same for the two OpenBLAS builds, except that the Windows build uses DYNAMIC_ARCH=1 but the Linux one is left to auto-detect SkylakeX.

The context is a Gibbs sampling operation with many thousands of iterations, so the performance difference becomes very striking. My test rig iterates inversion of a sequence of p.d. matrices of moderate size, from dimension 4 to 64 by powers of 2. Given the moderate size, multi-threading is not really worthwhile. Best performance is achieved by setting OMP_NUM_THREADS=1; in that case the rig runs very fast on both platforms, with Windows marginally slower than Linux. But if I set the number of OMP threads to equal the number of physical cores (4), which is the default in the program I'm working with,

there's just a slight degradation of performance on Linux, but
the performance on Windows becomes really horrible, 10 or more times slower than Linux.

I'd be very grateful if anyone can offer insight into what might be going on here. I'd be happy to supply more details depending on what might be relevant.

martin-frbg · 2024-08-29T19:18:18Z

can you set OPENBLAS_VERBOSE=2 in the Windows environment please, just to be sure that it uses SKYLAKEX there too as expected ? there may be a few places in the code where OpenMP is handled differently on the two platforms, and I guess the libgomp runtime on Windows may differ from the Linux implementation too... I'm currently at a conference with limited access to decent hardware, so it may take me a few days to investigate

AllinCottrell · 2024-08-29T21:10:21Z

Thanks for looking into this, Martin. I can confirm that SKYLAKEX is detected on Windows.

AllinCottrell · 2024-09-13T15:20:47Z

Any more thoughts on this?

martin-frbg · 2024-09-17T13:00:53Z

Thoughts have been few and far between as I caught covid in the meantime. Sorry, nothing obvious in the OpenBLAS codebase comes to mind even now. I guess you could try if setting OMP_WAIT_POLICY=passive has any influence on this misbehaviour.
Did you perchance use earlier versions of OpenBLAS before which did not show this ? Otherwise, a small, self-contained test case would be helpful for tracking this down.

AllinCottrell · 2024-09-17T21:23:45Z

Thanks, Martin. OMP_WAIT_POLICY=passive does have an influence: it makes the problem a good deal worse! We have used earlier versions of OpenBLAS. We noticed the problem only recently, by chance -- it may well have been there before, unnoticed. Anyway I'm attaching a self-contained test-case and I'll inline below the results from running it on a few systems.

Aside from the relatively extreme problem on Windows, it seems to me that in general the matrix size at which multi-threading kicks in is much too small for optimality. In lapack/potrf/potrf_L_parallel.c there's this clause:

if (n <= GEMM_UNROLL_N * 4) {
    info = POTRF_L_SINGLE(args, NULL, range_n, sa, sb, 0);
    return info;
}

In many cases the default value of DGEMM_UNROLL_N is 4, so this policy would start multi-threading at n = 17. From experimentation on several machines with various Intel and AMD processors I think the threshold should be much higher, in the range 100-150.

Anyway, here are the results I have from the test case. The times are for 50000 replications of inversion of a p.d. matrix. "default" means letting OpenBLAS decide how many threads to use, and "single" means forcing use of a single thread. All the machines referenced below are quad-core.

Arch Linux, OpenBLAS 0.3.28, blascore HASWELL
Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz

Times in seconds plus ratio default/single:

  n   default    single     d/s
  4    0.0456    0.0471    0.97
  8    0.0656    0.0652    1.01
 16    0.1386    0.1378    1.01
 17    0.5386    0.1498    3.60
 32    0.7247    0.4257    1.70
 64    2.4502    1.6720    1.47

Windows 11, OpenBLAS 0.3.28, blascore SKYLAKEX
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80 GHz

Times in seconds plus ratio default/single:

  n   default    single     d/s
  4    0.0400    0.0680    0.59
  8    2.2960    0.0940   24.43
 16    6.7490    0.1860   36.28
 17    8.9470    0.2000   44.73
 32   15.9350    0.5670   28.10
 64   36.5520    2.2400   16.32

Arch Linux, OpenBLAS 0.3.28, blascore SKYLAKEX
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80 GHz

Times in seconds plus ratio default/single:

  n   default    single     d/s
  4    0.0416    0.0425    0.98
  8    0.2349    0.0546    4.30
 16    0.6412    0.1092    5.87
 17    0.8500    0.1417    6.00
 32    1.6334    0.3540    4.61
 64    4.8043    1.3387    3.59

Fedora, OpenBLAS 0.3.21, blascore SANDYBRIDGE
Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

Times in seconds plus ratio default/single:

  n   default    single     d/s
  4    0.0872    0.0813    1.07
  8    0.1041    0.1049    0.99
 16    0.6036    0.1979    3.05
 17    0.8675    0.2146    4.04
 32    1.7614    0.6149    2.86
 64    6.5231    2.8875    2.26

invpd.c.txt

martin-frbg · 2024-09-20T19:24:22Z

Thank you very much. Unfortunately I did not manage to do much so far, but at least this does not appear to be a recent regression.

martin-frbg · 2024-11-28T11:45:25Z

From individual timing of the two functions, the problem appears to be specifically related to POTRI rather than POTRF. There used to be a reimplementation of POTRI in OpenBLAS but this was disabled ten years ago in #410 due to problems with the code (and subsequent suggestions that the function itself posed no bottleneck). The POTRI from Reference-LAPACK is basically a frontend for TRTRI, which again OpenBLAS reimplements. Contrary to the one for POTRF, this reimplementation currently uses full-on multithreading even for the smallest workloads, which is addressed by #4994
Why this leads to pronounced hangs in Windows is still unclear to me, but it appears to be a misfeature of the implementation of libgomp there.

AllinCottrell · 2024-11-29T00:24:12Z

Thanks, Martin. I too did some more timing tests, and I agree that it's POTRI rather than POTRF that's the trigger for the slowdown on Windows.

martin-frbg · 2024-11-29T08:15:50Z

The behaviour when compiled with LLVM19 (and its libomp) appears to be a lot more Linux-like even without my small correction from #4994.
Unfortunately that version still appears to have some problem with AVX512 code on Windows, although OpenBLAS itself builds and tests without errors. Thought we could finally put those behind us, but your test code crashes with weird symptoms unless I build OpenBLAS for the Haswell target... At least it seems to confirms that something in the way GNU libgomp implements thread idling or locking is to blame for the extremely poor performance originally observed

marcingretl · 2024-11-29T19:42:56Z

Hi,
look at these results:

clang under msys2 or native ms windows clang (yes, used with MS SDK) or msvc (cl.exe): all these three combinations result in very similar timings:

Using default threading...
Forcing single-threaded behavior...
Times in seconds plus ratio default/single:
n default single d/s
4 0.1426 0.0902 1.58
8 0.1254 0.1251 1.00
16 0.2436 0.2429 1.00
17 6.2650 6.0594 1.03
32 3.6997 3.6944 1.00
64 11.5803 11.5690 1.00

gcc under msys2:
Using default threading...
Forcing single-threaded behavior...
Times in seconds plus ratio default/single:
n default single d/s
4 0.0910 0.0950 0.96
8 0.1260 0.1250 1.01
16 0.2420 0.2410 1.00
17 6.0890 0.2610 23.33
32 3.7180 0.7400 5.02
64 11.6410 3.1500 3.70

OpenBLAS 0.3.28, Win10, i5-4460 @3.2GHz

Marcin

martin-frbg · 2024-11-29T20:18:58Z

hi Marcin, thanks for the data - is this with USE_OPENMP=1 as well ? Very similar timing for the native MSVC build is a bit surprising as that would be using generic C kernels instead of the optimized GEMM (MSVC still does not support our unix-y style of assembly)

marcingretl · 2024-11-30T11:26:44Z

Hi,
yes: with USE_OPENMP=1 results are pretty much the same. (In addition I re-created *.def and *.lib files.)
On screenshot: left-hand side are msys2 results (gcc, next clang); right-hand side are native MS results (clang, next msvc).

So it might be related with 'libgomp', but it may also have something to do with gcc itself.

martin-frbg · 2024-11-30T11:59:40Z

interesting, thanks - maybe the Windows11 scheduler plays a role as well (at least with my Zen 5/5c). at least PR #4994 should not hurt in any case - I think

AllinCottrell · 2024-11-30T12:59:30Z

Marcin, in your results above it seems that only in the gcc case (top left) is single-threading actually being imposed. In the other cases supposed "single-threading" makes no difference. In my test code "single" is specified by

omp_set_num_threads(1);

and this doesn't seem to be doing anything in the clang and msvc cases.

marcingretl · 2024-11-30T13:43:29Z

Ahh, you're right!!!
So, forcing single-threaded via OMP_NUM_THREADS=1 yields int the following results (left-hand side clang under msys2, right-hand side native clang and later msvc:

marcingretl · 2024-11-30T16:02:36Z

omp_set_num_threads(1);

and this doesn't seem to be doing anything in the clang and msvc cases.

Allin, but shouldn't omp_set_num_threads()/omp_get_num_threads() be used inside #pragma block? (This is how I understand the OpenMP standard).
I'm asking because when I did something like that, both functions started to for me with clang, though only on Linux.

AllinCottrell · 2024-11-30T17:45:43Z

My impression is that #pragma is needed only when launching a team of threads.

martin-frbg linked a pull request Nov 27, 2024 that will close this issue

Disable multithreading in ?TRTRI for small workloads #4994

Open

martin-frbg mentioned this issue Nov 29, 2024

Support building with flang on windows #4768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dpotrf + dpotri: Windows vs Linux #4886

dpotrf + dpotri: Windows vs Linux #4886

AllinCottrell commented Aug 29, 2024 •

edited

Loading

martin-frbg commented Aug 29, 2024

AllinCottrell commented Aug 29, 2024

AllinCottrell commented Sep 13, 2024

martin-frbg commented Sep 17, 2024

AllinCottrell commented Sep 17, 2024 •

edited

Loading

martin-frbg commented Sep 20, 2024

martin-frbg commented Nov 28, 2024

AllinCottrell commented Nov 29, 2024

martin-frbg commented Nov 29, 2024

marcingretl commented Nov 29, 2024

martin-frbg commented Nov 29, 2024

marcingretl commented Nov 30, 2024

martin-frbg commented Nov 30, 2024

AllinCottrell commented Nov 30, 2024

marcingretl commented Nov 30, 2024

marcingretl commented Nov 30, 2024

AllinCottrell commented Nov 30, 2024

dpotrf + dpotri: Windows vs Linux #4886

dpotrf + dpotri: Windows vs Linux #4886

Comments

AllinCottrell commented Aug 29, 2024 • edited Loading

martin-frbg commented Aug 29, 2024

AllinCottrell commented Aug 29, 2024

AllinCottrell commented Sep 13, 2024

martin-frbg commented Sep 17, 2024

AllinCottrell commented Sep 17, 2024 • edited Loading

martin-frbg commented Sep 20, 2024

martin-frbg commented Nov 28, 2024

AllinCottrell commented Nov 29, 2024

martin-frbg commented Nov 29, 2024

marcingretl commented Nov 29, 2024

martin-frbg commented Nov 29, 2024

marcingretl commented Nov 30, 2024

martin-frbg commented Nov 30, 2024

AllinCottrell commented Nov 30, 2024

marcingretl commented Nov 30, 2024

marcingretl commented Nov 30, 2024

AllinCottrell commented Nov 30, 2024

AllinCottrell commented Aug 29, 2024 •

edited

Loading

AllinCottrell commented Sep 17, 2024 •

edited

Loading