Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILP64 OpenBLAS gives different result from regular OpenBLAS #2779

Closed
Zha0q1 opened this issue Aug 13, 2020 · 26 comments
Closed

ILP64 OpenBLAS gives different result from regular OpenBLAS #2779

Zha0q1 opened this issue Aug 13, 2020 · 26 comments
Labels

Comments

@Zha0q1
Copy link

Zha0q1 commented Aug 13, 2020

ILP64 OpenBLAS built with

CXX="g++ -fPIC" CC="gcc -fPIC" make -j DYNAMIC_ARCH=1 DYNAMIC_OLDER=1 USE_OPENMP=1 INTERFACE64=1 BINARY=64 NO_SHARED=0 NO_LAPACK=0
#include <iostream>
#include <cblas.h>
int main ( int argc, char* argv[] ) {
    const long n = ((long)1 << 31) - 1;
    std::cout << n <<std::endl;
    float*  A = new float[n];
    float*  B = new float[n];
    float*  C = new float[1];
    for(long i =0; i <n; i++){
        A[i] = 1;
        B[i] = 1;

    }
    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, 1, n,  1.0, A, n, B, 1, 0.0, C, 1);
    std::cout << *C <<std::endl;
    delete[] A;
    delete[] B;
    delete[] C;
    return 0;
}

Compiled with g++ gemm.cpp -I /usr/local/include/ -lopenblas

On ILP64 OpenBLAS machine:

2147483647
1.93274e+09

On vanilla OpenBLAS machine:

2147483647
2.14748e+09

I think I am linking correctly, because if I change n to (long)1 << 31) + 1, then
ILP64

2147483649
1.93274e+09

Vanilla

2147483649
 ** On entry to SGEMM  parameter number  5 had an illegal value
0
@brada4
Copy link
Contributor

brada4 commented Aug 14, 2020

Is it same CPU core type?
vanilla result looks like pedantic addition, other might be FMA or other way around.
for completeness you could calculate same in the loop acc=0.0 ; for i in 1..n acc = acc+(a[i]*b[[i]) ;;

@martin-frbg
Copy link
Collaborator

Make sure that you use the right flavor of include directory and LD_LIBRARY_PATH for your tests as well. I can not reproduce your problem on Haswell-class hardware at least.

@Zha0q1
Copy link
Author

Zha0q1 commented Aug 14, 2020

Is it same CPU core type?
vanilla result looks like pedantic addition, other might be FMA or other way around.
for completeness you could calculate same in the loop acc=0.0 ; for i in 1..n acc = acc+(a[i]*b[[i]) ;;

Yes they are the same machine type.
I tried to add in a loop:

#include <iostream>
#include <cblas.h>
int main ( int argc, char* argv[] ) {
    const int64_t n = ((int64_t)1 << 31) - 1;
    std::cout << n <<std::endl;
    float*  A = new float[n];
    float*  B = new float[n];
    float*  C = new float[1];
    for(long i =0; i <n; i++){
        A[i] = 1;
        B[i] = 1;

    }
    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, 1, n,  1.0, A, n, B, 1, 0.0, C, 1);
    std::cout << *C <<std::endl;

    float acc = 0.0;
    for (long i=0; i<n; i++)
        acc += A[i] * B[i];
    std::cout << acc << std::endl;

    delete[] A;
    delete[] B;
    delete[] C;
    return 0;
}

the results are:
ILP64 OpenBLAS

2147483647
1.93274e+09
1.67772e+07

Vanilla OpenBLAS

2147483647
2.14748e+09
1.67772e+07

Make sure that you use the right flavor of include directory and LD_LIBRARY_PATH for your tests as well. I can not reproduce your problem on Haswell-class hardware at least.

Do I need to add any flags? g++ gemm.cpp -I /usr/local/include/ -lopenblas. I checked at the executable is linking to the correct libopenblas.so

@brada4
Copy link
Contributor

brada4 commented Aug 14, 2020

@Zha0q1 you can check library loaded by ld.so with ldd command
What the naive loop reaches is limit of "significand"
https://en.wikipedia.org/wiki/Floating-point_arithmetic#IEEE_754:_floating_point_in_modern_computers
i.e 2^24 (it rounds properly even near the end) i.e close to 1,6777216 ^ 10^7

setting acc to double will give result matching expected integer solution.
e.g lua programming language has no integers, it says double-s represent particular integer range exactly.

I am puzzled why IPL64 and "normal" versions do not use same blocking algorithm to get similar results. i.e if you add 64 elements in same loop then to accumulator you get more, like in "normal" example

@Zha0q1
Copy link
Author

Zha0q1 commented Aug 14, 2020

@Zha0q1 you can check library loaded by ld.so with ldd command
What the naive loop reaches is limit of "significand"
https://en.wikipedia.org/wiki/Floating-point_arithmetic#IEEE_754:_floating_point_in_modern_computers
i.e 2^24 (it rounds properly even near the end) i.e close to 1,6777216 ^ 10^7

setting acc to double will give result matching expected integer solution.
e.g lua programming language has no integers, it says double-s represent particular integer range exactly.

I am puzzled why IPL64 and "normal" versions do not use same blocking algorithm to get similar results. i.e if you add 64 elements in same loop then to accumulator you get more, like in "normal" example

After changing acc to double I got:

2147483647
1.93274e+09
2.14748e+09

Also ldd gives:

(nomkl) ubuntu@ip-172-31-10-124:~$ ldd /lib/x86_64-linux-gnu/libdl.so.2
	linux-vdso.so.1 (0x00007ffe471a3000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7e96daa000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f7e9739f000)

I am puzzled why IPL64 and "normal" versions do not use same blocking algorithm to get similar results. i.e if you add 64 elements in same loop then to accumulator you get more, like in "normal" example

Sorry I still don't quire understand. Is there a way to fix this?

@brada4
Copy link
Contributor

brada4 commented Aug 14, 2020

Thats property of floating point numbers, depending on algorithm you may saturate significand or get the arihmetically correct result.
There is nothing to fix, any result between those is "correct". use type with enough significand if you intend to sum wide range of values, like double served in this case, or quad will serve to int64.

@Zha0q1
Copy link
Author

Zha0q1 commented Aug 14, 2020

Thats property of floating point numbers, depending on algorithm you may saturate significand or get the arihmetically correct result.
There is nothing to fix, any result between those is "correct". use type with enough significand if you intend to sum wide range of values, like double served in this case, or quad will serve to int64.

But when I use vanilla openblas it does give the arithmetically correct result on the same input. Is there a way to make ILP openblas consistent with that?

@brada4
Copy link
Contributor

brada4 commented Aug 14, 2020

reference BLAS gives 16megs result, thats the absolute truth in this case
Using dgemm via scripting gets int32 biggest

R> a<-matrix(1,2**31-1,1)
R> crossprod(a)

(needs 32+GB RAM to get through)

@brada4
Copy link
Contributor

brada4 commented Aug 14, 2020

The fix you submitted in numpy gets the aritmetic result, its c++ equivalent is using dgemm.

@Zha0q1
Copy link
Author

Zha0q1 commented Aug 14, 2020

The fix you submitted in numpy gets the aritmetic result, its c++ equivalent is using dgemm.

Yeah right, but with vanilla openblas even sgemm gives correct result

@brada4
Copy link
Contributor

brada4 commented Aug 14, 2020

Yeah right, but with vanilla openblas even sgemm gives correct result

Yep, but I dont get clue how ;-)

@Zha0q1
Copy link
Author

Zha0q1 commented Aug 14, 2020

Make sure that you use the right flavor of include directory and LD_LIBRARY_PATH for your tests as well. I can not reproduce your problem on Haswell-class hardware at least.

Are you trying with ILP 64 build?

@brada4
Copy link
Contributor

brada4 commented Aug 14, 2020

Actually sgemv clamps in same place but sdot gets a bit further thanks to blocking being implied.

@Zha0q1
Copy link
Author

Zha0q1 commented Aug 14, 2020

Actually sgemv clamps in same place but sdot gets a bit further thanks to blocking being implied.

Ohh I see what you mean now, so it's similar to loop unrolling? So can we control blocking by some means...?

@brada4
Copy link
Contributor

brada4 commented Aug 14, 2020

control blocking

Not really, it is just unrolled in source file once. Deviation from reference knowingly is not a fix nor improvement. Lapack solvers sort of expect close to standard behaviour in such border cases.
namely dgemm fits your bill just fine.

@brada4
Copy link
Contributor

brada4 commented Aug 14, 2020

Report from (Fedora EPEL int64 0.3.3 sandybridge) - sixteen megs

@access2rohit
Copy link

access2rohit commented Aug 14, 2020

There is nothing to fix, any result between those is "correct". use type with enough significand if you intend to sum wide range of values, like double served in this case, or quad will serve to int64.

@brada4 I don't understand how can you suggest that we will loose precision when int64 indexing of array/tensor is used and data type still remains the same

How can changing indexing type have affect on precision of output ?

@brada4
Copy link
Contributor

brada4 commented Aug 14, 2020

Expected precision is as of standard netlib blas with single addition per loop. If you need to sum vaues more than significand+1 apart you need bigger floating point variable with obviously bigger significand range.
normally you lose N bits of precision in 2.something ^ N floating point operations, so this case is badly marginal since by point of 2^24 computations not even a single bit can be trusted while using float
Probably that EDIT: s//floating ops per output dot) can be computed on entry of BLAS function and sort of warning given.

Take another mxnet-ish example - we need dot product of huge vector, ok, we get one.
Then we sort it - > we get quite another result
Then we sort in reverse -> yet another

I think there is plainly no way to increase robustness of naive reference algorithm facing of marginal input. Additional precision coming from FMA or blocking is surprising, but dont trust too much of it. It wont work that way when actual adjacent inputs are significand range apart.

@martin-frbg
Copy link
Collaborator

Folks can we get some information on what hardware (OpenBLAS TARGET) you are running this code on ? As I wrote earlier today, I get identical results on Kaby Lake (Haswell TARGET) when I set both the include path and the LD_LIBRARY_PATH to the install location of the desired OpenBLAS build. (Also make sure that you actually include the cblas.h created by make install in the destination directory, do not simply take the cblas.h from the source directory)

@Zha0q1
Copy link
Author

Zha0q1 commented Aug 14, 2020

Folks can we get some information on what hardware (OpenBLAS TARGET) you are running this code on ? As I wrote earlier today, I get identical results on Kaby Lake (Haswell TARGET) when I set both the include path and the LD_LIBRARY_PATH to the install location of the desired OpenBLAS build. (Also make sure that you actually include the cblas.h created by make install in the destination directory, do not simply take the cblas.h from the source directory)

Right. I am using a ec2 instance with 64 cores. With cat /proc/cpuinfo I got (one of the cores)

processor	: 18
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
stepping	: 1
microcode	: 0xb000038
cpu MHz		: 2732.152
cache size	: 46080 KB
physical id	: 1
siblings	: 32
core id		: 2
cpu cores	: 16
apicid		: 68
initial apicid	: 68
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida

@brada4
Copy link
Contributor

brada4 commented Aug 15, 2020

Further back ... ldd has to be run against binary executable where you dont know what it would import , not the loader itself.
Actually I LD_PRELOAD - ed distro shipped older OpenBLAS while linking to distro -lblas

@martin-frbg
Copy link
Collaborator

I bet the difference is not 32bit- vs 64bit integer, but different number of cores leading to different splitting of the workload and
slight differences in rounding. At least I see no difference on an 48-core AMD Epyc (basically using the same HASWELL kernels as your Broadwell-generation Xeon), and around 0.1e9 difference on a pair of 44 and 96 core ARMV8 hosts on drone.io where I have not yet managed to get all combinations of possible hardware configurations represented in individual CI runs.

@Zha0q1
Copy link
Author

Zha0q1 commented Aug 15, 2020

I bet the difference is not 32bit- vs 64bit integer, but different number of cores leading to different splitting of the workload and
slight differences in rounding. At least I see no difference on an 48-core AMD Epyc (basically using the same HASWELL kernels as your Broadwell-generation Xeon), and around 0.1e9 difference on a pair of 44 and 96 core ARMV8 hosts on drone.io where I have not yet managed to get all combinations of possible hardware configurations represented in individual CI runs.

I was also able to reproduce the same difference on the same machine, with only difference in 32 and 64 openblas

@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 15, 2020

Are you sure that you #included the right version of cblas.h, and linked the appropriate libopenblas in each case ? Unfortunately I do not have access to any bigger x86_64 architecture hardware than the Epyc7401 at drone.io (or rather packet.com) , but I do not see offhand why going to 64 cores would make a difference (nor why the larger addressable range would matter).
EDIT: results for the ARMV8 are:
96core ThunderX w/o INTERFACE64 2.27275e9, INTERFACE64=1 2.27275e9
44core (Ampere 8180?) 2.14748e9, INTERFACE64=1 2.14748e9
(the two use different SGEMM kernels). Error accumulation can be a bitch, use DGEMM if you need the precision.

@Zha0q1
Copy link
Author

Zha0q1 commented Aug 20, 2020

Update:
I did more tests and found out that if I build both LP64 and ILP64 openblas dev from source (with other configs all being the same) then the results are identical.

The difference I described earlier comes from the fact that my ec2 instance ships with a spacial, possibly optimized, openblas binary.

EC2 openblas (LP64, int32): no loss of precision on my example 2.14748e+09
openblas dev (ILP64, int64): loss of precision 1.93274e+09
openblas dev (LP64, int32): loss of precision 1.93274e+09

The EC2 openblas version is:

ubuntu@ip-172-31-11-14:~$ ll /usr/local/lib/ |grep blas
lrwxrwxrwx  1 root root         30 Aug  7 01:38 libopenblas.a -> libopenblas_haswellp-r0.2.20.a
lrwxrwxrwx  1 root root         31 Aug  7 01:38 libopenblas.so -> libopenblas_haswellp-r0.2.20.so*
lrwxrwxrwx  1 root root         31 Aug  7 01:38 libopenblas.so.0 -> libopenblas_haswellp-r0.2.20.so*
-rw-r--r--  1 root root   28498048 Aug  7 01:12 libopenblas_haswellp-r0.2.20.a
-rwxr-xr-x  1 root root   14344848 Aug  7 01:13 libopenblas_haswellp-r0.2.20.so*

My openblas dev build is:

-rw-r--r--  1 root root   65650108 Aug 15 21:38 libopenblasp-r0.3.10.dev.a
-rwxr-xr-x  1 root root   39439000 Aug 15 21:38 libopenblasp-r0.3.10.dev.so*

Do you think this numerical difference is due to different openblas version, different build config or something else?

@martin-frbg
Copy link
Collaborator

The SGEMM kernel for Haswell targets was changed earlier this year (#2361) to increase performance, perhaps this introduced different rounding due to FMA operations. You could try reverting that change (undo the changes in both kernel/x86_64/KERNEL.HASWELL and param.h) or do a quick test with 0.3.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants