Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance issue with dnrm2 on zen2 processor (AMD ryzen7 3700x) #4188

Open
elc42 opened this issue Aug 8, 2023 · 6 comments
Open

performance issue with dnrm2 on zen2 processor (AMD ryzen7 3700x) #4188

elc42 opened this issue Aug 8, 2023 · 6 comments

Comments

@elc42
Copy link

elc42 commented Aug 8, 2023

Hello,
We notice a strange performance of dnrm2 compared to ddot: the performance of dnrm2 is very low compared to ddot.
This is observed with v0.3.23 and v0.3.18, other versions weren't tested.
Below 2 source files that reproduce the problem (sorry for txt extension)
Rename cxxopts.txt and TestOpenBLASNrm2Anomaly.txt to cxxopts.hpp and TestOpenBLASNrm2Anomaly.cpp .
you should change:
const double HighResTimer::m_sys_freq_mhz = 3600;
with the correct frequency for your system.

cxxopts.txt
TestOpenBLASNrm2Anomaly.txt

example:
./a.out --size-range 1000,10000,1000 --ntest 10000

op = ddot

number of test = 10000

n;perf

1000;25045.4
2000;23645.3
3000;16745.6
4000;16813.4
5000;16731.8
6000;17073.6
7000;16958.8
8000;17054.1
9000;16946.9
10000;17089.1

./a.out --nrm2 --size-range 1000,10000,1000 --ntest 10000

op = nrm2

number of test = 10000

n;perf

1000;4342.59
2000;4336.29
3000;4338.41
4000;4342.47
5000;4343.73
6000;4336.11
7000;4340.86
8000;4341.84
9000;4339.63
10000;4337.51

@martin-frbg
Copy link
Collaborator

ddot on x86_64 is parallelized, dnrm2 currently is not - what do the timings look like if you force both to run single-threaded via OPENBLAS_NUM_THREADS=1 ?

@elc42
Copy link
Author

elc42 commented Aug 9, 2023

Hello,
setting OPENBLAS_NUM_THREADS=1 don't change anything.

@XiWeiGu
Copy link
Contributor

XiWeiGu commented Aug 9, 2023

Hello, setting OPENBLAS_NUM_THREADS=1 don't change anything.

When the size of the array is less than or equal to 10000, ddot cannot be parallelized on x86_64, and multi-threading cannot be enabled. The relevant code is in the file 'kernel/x86_64/ddot.c'

        if (inc_x == 0 || inc_y == 0 || n <= 10000)
                nthreads = 1;
        else
                nthreads = num_cpu_avail(1);

So, this should not be related to parallelization.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 9, 2023

Yes, I misread the array size, Another factor is that dnrm2 on x86_64 is a very old assembly kernel from the original GotoBLAS of 15 years ago, while ddot makes use of AVX2. But nrm2 is due for an overhaul anyway, as the reference implementation of BLAS has updated its definition of ?NRM2 recently. (#4130 #4186)

@XiWeiGu
Copy link
Contributor

XiWeiGu commented Aug 9, 2023

Yes, I misread the array size, Another factor is that dnrm2 on x86_64 is a very old assembly kernel from the original GotoBLAS of 15 years ago, while ddot makes use of AVX2.

I agree with your statement that the reason for the slower performance of dnrm2 compared to ddot is likely due to the outdated optimization code of dnrm2.

But nrm2 is due for an overhaul anyway, as the reference implementation of BLAS has updated its definition of ?NRM2 recently. (#4130 #4186)

This is truly an urgent and substantial task. If we make changes to the C implementation, it implies that all platform optimizations will become ineffective. What is the current progress on this matter? I have some experience with platform optimizations, which might be helpful.

@martin-frbg
Copy link
Collaborator

I'm struggling to get anything done this week, so far I have only had a quick look at adding the "negative INCX" support to the existing codes as suggested as a quick workaround for the SciPy folks (seems doable but is essentially putting makeup on a pig). And I just checked that the existing "generic C" implementation has even poorer performance than the old assembly (that is at least an unrolled loop).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants