ILP64 OpenBLAS gives different result from regular OpenBLAS #2779

Zha0q1 · 2020-08-13T22:55:50Z

ILP64 OpenBLAS built with

CXX="g++ -fPIC" CC="gcc -fPIC" make -j DYNAMIC_ARCH=1 DYNAMIC_OLDER=1 USE_OPENMP=1 INTERFACE64=1 BINARY=64 NO_SHARED=0 NO_LAPACK=0

#include <iostream>
#include <cblas.h>
int main ( int argc, char* argv[] ) {
    const long n = ((long)1 << 31) - 1;
    std::cout << n <<std::endl;
    float*  A = new float[n];
    float*  B = new float[n];
    float*  C = new float[1];
    for(long i =0; i <n; i++){
        A[i] = 1;
        B[i] = 1;

    }
    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, 1, n,  1.0, A, n, B, 1, 0.0, C, 1);
    std::cout << *C <<std::endl;
    delete[] A;
    delete[] B;
    delete[] C;
    return 0;
}

Compiled with g++ gemm.cpp -I /usr/local/include/ -lopenblas

On ILP64 OpenBLAS machine:

2147483647
1.93274e+09

On vanilla OpenBLAS machine:

2147483647
2.14748e+09

I think I am linking correctly, because if I change n to (long)1 << 31) + 1, then
ILP64

2147483649
1.93274e+09

Vanilla

2147483649
 ** On entry to SGEMM  parameter number  5 had an illegal value
0

The text was updated successfully, but these errors were encountered:

brada4 · 2020-08-14T10:27:31Z

Is it same CPU core type?
vanilla result looks like pedantic addition, other might be FMA or other way around.
for completeness you could calculate same in the loop acc=0.0 ; for i in 1..n acc = acc+(a[i]*b[[i]) ;;

martin-frbg · 2020-08-14T12:50:51Z

Make sure that you use the right flavor of include directory and LD_LIBRARY_PATH for your tests as well. I can not reproduce your problem on Haswell-class hardware at least.

Zha0q1 · 2020-08-14T17:16:01Z

Is it same CPU core type?
vanilla result looks like pedantic addition, other might be FMA or other way around.
for completeness you could calculate same in the loop acc=0.0 ; for i in 1..n acc = acc+(a[i]*b[[i]) ;;

Yes they are the same machine type.
I tried to add in a loop:

#include <iostream>
#include <cblas.h>
int main ( int argc, char* argv[] ) {
    const int64_t n = ((int64_t)1 << 31) - 1;
    std::cout << n <<std::endl;
    float*  A = new float[n];
    float*  B = new float[n];
    float*  C = new float[1];
    for(long i =0; i <n; i++){
        A[i] = 1;
        B[i] = 1;

    }
    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, 1, n,  1.0, A, n, B, 1, 0.0, C, 1);
    std::cout << *C <<std::endl;

    float acc = 0.0;
    for (long i=0; i<n; i++)
        acc += A[i] * B[i];
    std::cout << acc << std::endl;

    delete[] A;
    delete[] B;
    delete[] C;
    return 0;
}

the results are:
ILP64 OpenBLAS

2147483647
1.93274e+09
1.67772e+07

Vanilla OpenBLAS

2147483647
2.14748e+09
1.67772e+07

Make sure that you use the right flavor of include directory and LD_LIBRARY_PATH for your tests as well. I can not reproduce your problem on Haswell-class hardware at least.

Do I need to add any flags? g++ gemm.cpp -I /usr/local/include/ -lopenblas. I checked at the executable is linking to the correct libopenblas.so

brada4 · 2020-08-14T18:54:49Z

@Zha0q1 you can check library loaded by ld.so with ldd command
What the naive loop reaches is limit of "significand"
https://en.wikipedia.org/wiki/Floating-point_arithmetic#IEEE_754:_floating_point_in_modern_computers
i.e 2^24 (it rounds properly even near the end) i.e close to 1,6777216 ^ 10^7

setting acc to double will give result matching expected integer solution.
e.g lua programming language has no integers, it says double-s represent particular integer range exactly.

I am puzzled why IPL64 and "normal" versions do not use same blocking algorithm to get similar results. i.e if you add 64 elements in same loop then to accumulator you get more, like in "normal" example

Zha0q1 · 2020-08-14T19:03:26Z

@Zha0q1 you can check library loaded by ld.so with ldd command
What the naive loop reaches is limit of "significand"
https://en.wikipedia.org/wiki/Floating-point_arithmetic#IEEE_754:_floating_point_in_modern_computers
i.e 2^24 (it rounds properly even near the end) i.e close to 1,6777216 ^ 10^7

setting acc to double will give result matching expected integer solution.
e.g lua programming language has no integers, it says double-s represent particular integer range exactly.

I am puzzled why IPL64 and "normal" versions do not use same blocking algorithm to get similar results. i.e if you add 64 elements in same loop then to accumulator you get more, like in "normal" example

After changing acc to double I got:

2147483647
1.93274e+09
2.14748e+09

Also ldd gives:

(nomkl) ubuntu@ip-172-31-10-124:~$ ldd /lib/x86_64-linux-gnu/libdl.so.2
	linux-vdso.so.1 (0x00007ffe471a3000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7e96daa000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f7e9739f000)

I am puzzled why IPL64 and "normal" versions do not use same blocking algorithm to get similar results. i.e if you add 64 elements in same loop then to accumulator you get more, like in "normal" example

Sorry I still don't quire understand. Is there a way to fix this?

brada4 · 2020-08-14T19:13:28Z

Thats property of floating point numbers, depending on algorithm you may saturate significand or get the arihmetically correct result.
There is nothing to fix, any result between those is "correct". use type with enough significand if you intend to sum wide range of values, like double served in this case, or quad will serve to int64.

Zha0q1 · 2020-08-14T19:16:56Z

Thats property of floating point numbers, depending on algorithm you may saturate significand or get the arihmetically correct result.
There is nothing to fix, any result between those is "correct". use type with enough significand if you intend to sum wide range of values, like double served in this case, or quad will serve to int64.

But when I use vanilla openblas it does give the arithmetically correct result on the same input. Is there a way to make ILP openblas consistent with that?

brada4 · 2020-08-14T19:30:00Z

reference BLAS gives 16megs result, thats the absolute truth in this case
Using dgemm via scripting gets int32 biggest

R> a<-matrix(1,2**31-1,1)
R> crossprod(a)

(needs 32+GB RAM to get through)

brada4 · 2020-08-14T19:31:28Z

The fix you submitted in numpy gets the aritmetic result, its c++ equivalent is using dgemm.

Zha0q1 · 2020-08-14T19:32:45Z

The fix you submitted in numpy gets the aritmetic result, its c++ equivalent is using dgemm.

Yeah right, but with vanilla openblas even sgemm gives correct result

brada4 · 2020-08-14T19:33:22Z

Yeah right, but with vanilla openblas even sgemm gives correct result

Yep, but I dont get clue how ;-)

Zha0q1 · 2020-08-14T19:39:06Z

Make sure that you use the right flavor of include directory and LD_LIBRARY_PATH for your tests as well. I can not reproduce your problem on Haswell-class hardware at least.

Are you trying with ILP 64 build?

brada4 · 2020-08-14T19:44:15Z

Actually sgemv clamps in same place but sdot gets a bit further thanks to blocking being implied.

Zha0q1 · 2020-08-14T19:48:16Z

Actually sgemv clamps in same place but sdot gets a bit further thanks to blocking being implied.

Ohh I see what you mean now, so it's similar to loop unrolling? So can we control blocking by some means...?

brada4 · 2020-08-14T19:49:53Z

control blocking

Not really, it is just unrolled in source file once. Deviation from reference knowingly is not a fix nor improvement. Lapack solvers sort of expect close to standard behaviour in such border cases.
namely dgemm fits your bill just fine.

brada4 · 2020-08-14T19:59:01Z

Report from (Fedora EPEL int64 0.3.3 sandybridge) - sixteen megs

access2rohit · 2020-08-14T20:09:24Z

There is nothing to fix, any result between those is "correct". use type with enough significand if you intend to sum wide range of values, like double served in this case, or quad will serve to int64.

@brada4 I don't understand how can you suggest that we will loose precision when int64 indexing of array/tensor is used and data type still remains the same

How can changing indexing type have affect on precision of output ?

brada4 · 2020-08-14T20:14:40Z

Expected precision is as of standard netlib blas with single addition per loop. If you need to sum vaues more than significand+1 apart you need bigger floating point variable with obviously bigger significand range.
normally you lose N bits of precision in 2.something ^ N floating point operations, so this case is badly marginal since by point of 2^24 computations not even a single bit can be trusted while using float
Probably that EDIT: s//floating ops per output dot) can be computed on entry of BLAS function and sort of warning given.

Take another mxnet-ish example - we need dot product of huge vector, ok, we get one.
Then we sort it - > we get quite another result
Then we sort in reverse -> yet another

I think there is plainly no way to increase robustness of naive reference algorithm facing of marginal input. Additional precision coming from FMA or blocking is surprising, but dont trust too much of it. It wont work that way when actual adjacent inputs are significand range apart.

martin-frbg · 2020-08-14T21:48:14Z

Folks can we get some information on what hardware (OpenBLAS TARGET) you are running this code on ? As I wrote earlier today, I get identical results on Kaby Lake (Haswell TARGET) when I set both the include path and the LD_LIBRARY_PATH to the install location of the desired OpenBLAS build. (Also make sure that you actually include the cblas.h created by make install in the destination directory, do not simply take the cblas.h from the source directory)

Zha0q1 · 2020-08-14T22:10:50Z

Folks can we get some information on what hardware (OpenBLAS TARGET) you are running this code on ? As I wrote earlier today, I get identical results on Kaby Lake (Haswell TARGET) when I set both the include path and the LD_LIBRARY_PATH to the install location of the desired OpenBLAS build. (Also make sure that you actually include the cblas.h created by make install in the destination directory, do not simply take the cblas.h from the source directory)

Right. I am using a ec2 instance with 64 cores. With cat /proc/cpuinfo I got (one of the cores)

processor	: 18
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
stepping	: 1
microcode	: 0xb000038
cpu MHz		: 2732.152
cache size	: 46080 KB
physical id	: 1
siblings	: 32
core id		: 2
cpu cores	: 16
apicid		: 68
initial apicid	: 68
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida

brada4 · 2020-08-15T17:30:38Z

Further back ... ldd has to be run against binary executable where you dont know what it would import , not the loader itself.
Actually I LD_PRELOAD - ed distro shipped older OpenBLAS while linking to distro -lblas

martin-frbg · 2020-08-15T21:06:08Z

I bet the difference is not 32bit- vs 64bit integer, but different number of cores leading to different splitting of the workload and
slight differences in rounding. At least I see no difference on an 48-core AMD Epyc (basically using the same HASWELL kernels as your Broadwell-generation Xeon), and around 0.1e9 difference on a pair of 44 and 96 core ARMV8 hosts on drone.io where I have not yet managed to get all combinations of possible hardware configurations represented in individual CI runs.

Zha0q1 · 2020-08-15T21:09:05Z

I bet the difference is not 32bit- vs 64bit integer, but different number of cores leading to different splitting of the workload and
slight differences in rounding. At least I see no difference on an 48-core AMD Epyc (basically using the same HASWELL kernels as your Broadwell-generation Xeon), and around 0.1e9 difference on a pair of 44 and 96 core ARMV8 hosts on drone.io where I have not yet managed to get all combinations of possible hardware configurations represented in individual CI runs.

I was also able to reproduce the same difference on the same machine, with only difference in 32 and 64 openblas

martin-frbg · 2020-08-15T21:21:35Z

Are you sure that you #included the right version of cblas.h, and linked the appropriate libopenblas in each case ? Unfortunately I do not have access to any bigger x86_64 architecture hardware than the Epyc7401 at drone.io (or rather packet.com) , but I do not see offhand why going to 64 cores would make a difference (nor why the larger addressable range would matter).
EDIT: results for the ARMV8 are:
96core ThunderX w/o INTERFACE64 2.27275e9, INTERFACE64=1 2.27275e9
44core (Ampere 8180?) 2.14748e9, INTERFACE64=1 2.14748e9
(the two use different SGEMM kernels). Error accumulation can be a bitch, use DGEMM if you need the precision.

Zha0q1 · 2020-08-20T01:33:00Z

Update:
I did more tests and found out that if I build both LP64 and ILP64 openblas dev from source (with other configs all being the same) then the results are identical.

The difference I described earlier comes from the fact that my ec2 instance ships with a spacial, possibly optimized, openblas binary.

EC2 openblas (LP64, int32): no loss of precision on my example 2.14748e+09
openblas dev (ILP64, int64): loss of precision 1.93274e+09
openblas dev (LP64, int32): loss of precision 1.93274e+09

The EC2 openblas version is:

ubuntu@ip-172-31-11-14:~$ ll /usr/local/lib/ |grep blas
lrwxrwxrwx  1 root root         30 Aug  7 01:38 libopenblas.a -> libopenblas_haswellp-r0.2.20.a
lrwxrwxrwx  1 root root         31 Aug  7 01:38 libopenblas.so -> libopenblas_haswellp-r0.2.20.so*
lrwxrwxrwx  1 root root         31 Aug  7 01:38 libopenblas.so.0 -> libopenblas_haswellp-r0.2.20.so*
-rw-r--r--  1 root root   28498048 Aug  7 01:12 libopenblas_haswellp-r0.2.20.a
-rwxr-xr-x  1 root root   14344848 Aug  7 01:13 libopenblas_haswellp-r0.2.20.so*

My openblas dev build is:

-rw-r--r--  1 root root   65650108 Aug 15 21:38 libopenblasp-r0.3.10.dev.a
-rwxr-xr-x  1 root root   39439000 Aug 15 21:38 libopenblasp-r0.3.10.dev.so*

Do you think this numerical difference is due to different openblas version, different build config or something else?

martin-frbg · 2020-08-20T07:13:30Z

The SGEMM kernel for Haswell targets was changed earlier this year (#2361) to increase performance, perhaps this introduced different rounding due to FMA operations. You could try reverting that change (undo the changes in both kernel/x86_64/KERNEL.HASWELL and param.h) or do a quick test with 0.3.7

martin-frbg added the Support label Aug 14, 2020

Zha0q1 mentioned this issue Aug 14, 2020

Numpy Dot Large Tensor Fix apache/mxnet#18925

Merged

Zha0q1 mentioned this issue Aug 14, 2020

DLAMI OpenBLAS Behaviors Different from Build from Source OpenBLAS apache/mxnet#18926

Open

martin-frbg closed this as completed Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILP64 OpenBLAS gives different result from regular OpenBLAS #2779

ILP64 OpenBLAS gives different result from regular OpenBLAS #2779

Zha0q1 commented Aug 13, 2020 •

edited

Loading

brada4 commented Aug 14, 2020

martin-frbg commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020 •

edited

Loading

brada4 commented Aug 14, 2020 •

edited

Loading

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 14, 2020

brada4 commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 14, 2020 •

edited

Loading

brada4 commented Aug 14, 2020

access2rohit commented Aug 14, 2020 •

edited

Loading

brada4 commented Aug 14, 2020 •

edited

Loading

martin-frbg commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 15, 2020

martin-frbg commented Aug 15, 2020

Zha0q1 commented Aug 15, 2020

martin-frbg commented Aug 15, 2020 •

edited

Loading

Zha0q1 commented Aug 20, 2020 •

edited

Loading

martin-frbg commented Aug 20, 2020

ILP64 OpenBLAS gives different result from regular OpenBLAS #2779

ILP64 OpenBLAS gives different result from regular OpenBLAS #2779

Comments

Zha0q1 commented Aug 13, 2020 • edited Loading

brada4 commented Aug 14, 2020

martin-frbg commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020 • edited Loading

brada4 commented Aug 14, 2020 • edited Loading

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 14, 2020

brada4 commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 14, 2020 • edited Loading

brada4 commented Aug 14, 2020

access2rohit commented Aug 14, 2020 • edited Loading

brada4 commented Aug 14, 2020 • edited Loading

martin-frbg commented Aug 14, 2020

Zha0q1 commented Aug 14, 2020

brada4 commented Aug 15, 2020

martin-frbg commented Aug 15, 2020

Zha0q1 commented Aug 15, 2020

martin-frbg commented Aug 15, 2020 • edited Loading

Zha0q1 commented Aug 20, 2020 • edited Loading

martin-frbg commented Aug 20, 2020

Zha0q1 commented Aug 13, 2020 •

edited

Loading

Zha0q1 commented Aug 14, 2020 •

edited

Loading

brada4 commented Aug 14, 2020 •

edited

Loading

brada4 commented Aug 14, 2020 •

edited

Loading

access2rohit commented Aug 14, 2020 •

edited

Loading

brada4 commented Aug 14, 2020 •

edited

Loading

martin-frbg commented Aug 15, 2020 •

edited

Loading

Zha0q1 commented Aug 20, 2020 •

edited

Loading