performance issue with dnrm2 on zen2 processor (AMD ryzen7 3700x) #4188

elc42 · 2023-08-08T17:16:05Z

Hello,
We notice a strange performance of dnrm2 compared to ddot: the performance of dnrm2 is very low compared to ddot.
This is observed with v0.3.23 and v0.3.18, other versions weren't tested.
Below 2 source files that reproduce the problem (sorry for txt extension)
Rename cxxopts.txt and TestOpenBLASNrm2Anomaly.txt to cxxopts.hpp and TestOpenBLASNrm2Anomaly.cpp .
you should change:
const double HighResTimer::m_sys_freq_mhz = 3600;
with the correct frequency for your system.

cxxopts.txt
TestOpenBLASNrm2Anomaly.txt

example:
./a.out --size-range 1000,10000,1000 --ntest 10000

op = ddot

number of test = 10000

n;perf

1000;25045.4
2000;23645.3
3000;16745.6
4000;16813.4
5000;16731.8
6000;17073.6
7000;16958.8
8000;17054.1
9000;16946.9
10000;17089.1

./a.out --nrm2 --size-range 1000,10000,1000 --ntest 10000

op = nrm2

number of test = 10000

n;perf

1000;4342.59
2000;4336.29
3000;4338.41
4000;4342.47
5000;4343.73
6000;4336.11
7000;4340.86
8000;4341.84
9000;4339.63
10000;4337.51

martin-frbg · 2023-08-08T17:50:59Z

ddot on x86_64 is parallelized, dnrm2 currently is not - what do the timings look like if you force both to run single-threaded via OPENBLAS_NUM_THREADS=1 ?

elc42 · 2023-08-09T01:00:49Z

Hello,
setting OPENBLAS_NUM_THREADS=1 don't change anything.

XiWeiGu · 2023-08-09T03:53:25Z

Hello, setting OPENBLAS_NUM_THREADS=1 don't change anything.

When the size of the array is less than or equal to 10000, ddot cannot be parallelized on x86_64, and multi-threading cannot be enabled. The relevant code is in the file 'kernel/x86_64/ddot.c'

        if (inc_x == 0 || inc_y == 0 || n <= 10000)
                nthreads = 1;
        else
                nthreads = num_cpu_avail(1);

So, this should not be related to parallelization.

martin-frbg · 2023-08-09T07:19:57Z

Yes, I misread the array size, Another factor is that dnrm2 on x86_64 is a very old assembly kernel from the original GotoBLAS of 15 years ago, while ddot makes use of AVX2. But nrm2 is due for an overhaul anyway, as the reference implementation of BLAS has updated its definition of ?NRM2 recently. (#4130 #4186)

XiWeiGu · 2023-08-09T08:48:55Z

Yes, I misread the array size, Another factor is that dnrm2 on x86_64 is a very old assembly kernel from the original GotoBLAS of 15 years ago, while ddot makes use of AVX2.

I agree with your statement that the reason for the slower performance of dnrm2 compared to ddot is likely due to the outdated optimization code of dnrm2.

But nrm2 is due for an overhaul anyway, as the reference implementation of BLAS has updated its definition of ?NRM2 recently. (#4130 #4186)

This is truly an urgent and substantial task. If we make changes to the C implementation, it implies that all platform optimizations will become ineffective. What is the current progress on this matter? I have some experience with platform optimizations, which might be helpful.

martin-frbg · 2023-08-09T13:38:41Z

I'm struggling to get anything done this week, so far I have only had a quick look at adding the "negative INCX" support to the existing codes as suggested as a quick workaround for the SciPy folks (seems doable but is essentially putting makeup on a pig). And I just checked that the existing "generic C" implementation has even poorer performance than the old assembly (that is at least an unrolled loop).

martin-frbg mentioned this issue Aug 9, 2023

Prepare the interface for INCX < 0 in the new NRM2 implementation from BLAS 3.10 #4189

Merged

martin-frbg mentioned this issue Nov 13, 2023

Add tests for NRM2 that involve scaling (from Reference-LAPACK PR 925) #4313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance issue with dnrm2 on zen2 processor (AMD ryzen7 3700x) #4188

performance issue with dnrm2 on zen2 processor (AMD ryzen7 3700x) #4188

elc42 commented Aug 8, 2023

martin-frbg commented Aug 8, 2023

elc42 commented Aug 9, 2023

XiWeiGu commented Aug 9, 2023

martin-frbg commented Aug 9, 2023 •

edited

Loading

XiWeiGu commented Aug 9, 2023

martin-frbg commented Aug 9, 2023

performance issue with dnrm2 on zen2 processor (AMD ryzen7 3700x) #4188

performance issue with dnrm2 on zen2 processor (AMD ryzen7 3700x) #4188

Comments

elc42 commented Aug 8, 2023

op = ddot

number of test = 10000

n;perf

op = nrm2

number of test = 10000

n;perf

martin-frbg commented Aug 8, 2023

elc42 commented Aug 9, 2023

XiWeiGu commented Aug 9, 2023

martin-frbg commented Aug 9, 2023 • edited Loading

XiWeiGu commented Aug 9, 2023

martin-frbg commented Aug 9, 2023

martin-frbg commented Aug 9, 2023 •

edited

Loading