Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS hanged when testing multithreaded affinity #2341

Closed
MacChen02 opened this issue Dec 17, 2019 · 13 comments
Closed

OpenBLAS hanged when testing multithreaded affinity #2341

MacChen02 opened this issue Dec 17, 2019 · 13 comments

Comments

@MacChen02
Copy link
Contributor

OpenBLAS hanged when testing multithreaded affinity.

hang

Enviroment: ARMV8 CentOS 7.6, OpenBLAS-0.3.7
Compile cmd: make TARGET=ARMV8 CC=gcc FC=gfortran DEBUG=1 NO_AFFINITY=0 -j96
Execute cmd: export OMP_NUM_THREADS=32 && ./dgemm.goto 6000 6000

The problem can be reproduced by simulating an abnormal situation. During the region of code manually stopping the process, OpenBLAS-0.3.7/driver/others/init.c
code

First exit abnormally before blas_unlock(&common -> lock) , the value of common->lock is 1. This shared memory already exist,common->lock=1 and common->magic=SH_MAGIC,function blas_lock entering infinite loop in next time, the programme will be hanged.

If the problem happened, it makes openblas unavailable.

I provide a patch file, checking the value of common->lock first.
0001-hang-multithread-affinity.patch.txt

@brada4 @martin-frbg

@brada4
Copy link
Contributor

brada4 commented Dec 19, 2019

Looking at it

  • you must start checking thread magic so we manipulate our thread (current logic is exactly reverse)
  • then you zap things only in places you need to if()

No time during xmas, you show the root cause of a long hidden problem that openblas messes with other threads....

@martin-frbg
Copy link
Collaborator

Not sure if the existing code (which dates back to GotoBLAS) is actually incorrect for normal operations. Perhaps there needs to be a separate pass to pick up the bits from threads that met an unexpected fate ?

@brada4
Copy link
Contributor

brada4 commented Dec 19, 2019

Distinguish "ours" from "main" and "others"

@MacChen02
Copy link
Contributor Author

@brada4 @martin-frbg
How about the patch file?It can solve the problem of abnormal interruption.

The other thread may encounter the following two situations:

  1. the common->lock is held
    1. common->shmid is alive, the thread installs the nop instructs, waiting...
    2. common->shmid is dead, it states that other threads exited abnormally, the thread should clear the abnormally value in this time.
  2. the common->lock is free, it's ok.

I think the "common -> magic != SH_MAGIC" used to waiting other thread handling the numa mapping and so on.

@martin-frbg
Copy link
Collaborator

Sorry, I still do not see how this could occur in a real-life situation, rather than willfully knocking down threads during the early initialization phase of OpenBLAS ?

@MacChen02
Copy link
Contributor Author

@martin-frbg

The problem has appeared on my device several times, and x86 platform( Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz ) has also appeared several times. It has a small probability in some abnormal situations.

I just reproduced the problem by simulating an abnormal situation, it actually occurs during the early initialization phase of OpenBLAS.

@brada4
Copy link
Contributor

brada4 commented Jan 4, 2020

Does it affect default build that does not play with affinity and allows 10 years fresher operating system scheduler to place processes at processors?

Where you say NUMA it is actually placing threads in order to CPUs, nothing modern there. Improved robustness will help even there too.

More modern approach (given absence of good NUMA awareness, like memory-to-cpu binding) would be here:
https://www.postgresql.org/message-id/[email protected]

@martin-frbg
Copy link
Collaborator

BTW which compiler are you using ? Seems CentOS 7.6 comes with a very old version of gcc (4.8.5)
by default. (I'd still like to understand, and if possible fix, the underlying issue of the "abnormal situations" leading to unexpected, unhandled thread death)

@MacChen02
Copy link
Contributor Author

@brada4
The problem doesn't affect the default build without affinity, just happen in affinity situation.
It has nothing to do with NUMA features.

@martin-frbg
The version of gcc is 4.8.5.
The root cause of the problem, such as manually aborting the program. The probability is small, i met it, so the issue was born.

Other possible reasons have not been thought of yet.

@brada4
Copy link
Contributor

brada4 commented Jan 9, 2020

There is newer compiler in softwarecollections.org named devtoolset-?-gcc

@MacChen02
Copy link
Contributor Author

@brada4
I don't use devtoolset-?-gcc as system compiler.
What different from gcc for this problem?

@brada4
Copy link
Contributor

brada4 commented Jan 10, 2020

It is selectable , you dont have to change system compiler
a bit dated instruction here:
https://github.com/xianyi/OpenBLAS/wiki/faq#binutils

@martin-frbg
Copy link
Collaborator

If the scenario is "only" about killing a thread at an inopportune moment where it holds a lock, changing the compiler is unlikely to improve anything. I am still worried that that simple patch could create an equally undersirable and much less obvious problem where a thread could be wrongly pronounced dead during normal operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants