-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance issue with dnrm2 on zen2 processor (AMD ryzen7 3700x) #4188
Comments
ddot on x86_64 is parallelized, dnrm2 currently is not - what do the timings look like if you force both to run single-threaded via OPENBLAS_NUM_THREADS=1 ? |
Hello, |
When the size of the array is less than or equal to 10000, ddot cannot be parallelized on x86_64, and multi-threading cannot be enabled. The relevant code is in the file 'kernel/x86_64/ddot.c'
So, this should not be related to parallelization. |
Yes, I misread the array size, Another factor is that dnrm2 on x86_64 is a very old assembly kernel from the original GotoBLAS of 15 years ago, while ddot makes use of AVX2. But nrm2 is due for an overhaul anyway, as the reference implementation of BLAS has updated its definition of ?NRM2 recently. (#4130 #4186) |
I agree with your statement that the reason for the slower performance of dnrm2 compared to ddot is likely due to the outdated optimization code of dnrm2.
This is truly an urgent and substantial task. If we make changes to the C implementation, it implies that all platform optimizations will become ineffective. What is the current progress on this matter? I have some experience with platform optimizations, which might be helpful. |
I'm struggling to get anything done this week, so far I have only had a quick look at adding the "negative INCX" support to the existing codes as suggested as a quick workaround for the SciPy folks (seems doable but is essentially putting makeup on a pig). And I just checked that the existing "generic C" implementation has even poorer performance than the old assembly (that is at least an unrolled loop). |
Hello,
We notice a strange performance of dnrm2 compared to ddot: the performance of dnrm2 is very low compared to ddot.
This is observed with v0.3.23 and v0.3.18, other versions weren't tested.
Below 2 source files that reproduce the problem (sorry for txt extension)
Rename cxxopts.txt and TestOpenBLASNrm2Anomaly.txt to cxxopts.hpp and TestOpenBLASNrm2Anomaly.cpp .
you should change:
const double HighResTimer::m_sys_freq_mhz = 3600;
with the correct frequency for your system.
cxxopts.txt
TestOpenBLASNrm2Anomaly.txt
example:
./a.out --size-range 1000,10000,1000 --ntest 10000
op = ddot
number of test = 10000
n;perf
1000;25045.4
2000;23645.3
3000;16745.6
4000;16813.4
5000;16731.8
6000;17073.6
7000;16958.8
8000;17054.1
9000;16946.9
10000;17089.1
./a.out --nrm2 --size-range 1000,10000,1000 --ntest 10000
op = nrm2
number of test = 10000
n;perf
1000;4342.59
2000;4336.29
3000;4338.41
4000;4342.47
5000;4343.73
6000;4336.11
7000;4340.86
8000;4341.84
9000;4339.63
10000;4337.51
The text was updated successfully, but these errors were encountered: