-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
threadpool: skip polling for unused threads #9461
threadpool: skip polling for unused threads #9461
Conversation
Haven't looked at the changes yet, but on Mac have you tried running the thread sanitizer? It detects data races when running CPU-only mode, even with this PR. Started happening after #8672. Here are the steps to run it: cmake \
-DCMAKE_BUILD_TYPE=Debug \
-DLLAMA_SANITIZE_THREAD=ON \
-DGGML_METAL=OFF \
-DGGML_LLAMAFILE=OFF ..
make -j && ./bin/llama-simple -m ${model} -p "hello" -ngl 0 You should see output like this:
We have to find a way to resolve these warnings. |
Oh. I assumed those are benign. We did have a bunch of similar thread sanitizer warnings on x86-64 with openmp before the threadpool changes. So I figured it's some minor overlap in matmul kernels. |
@ggerganov @slaren Quick update on the thread sanitizer warnings.
Built for Mac with OpenMP and Thread Sanitizer using LLVM 18 installed via homebrew.
And we're getting a bunch of those warnings.
That's why I assumed those are sort of known / benign. Perhaps, they are not? The threadpool support makes timing/behavior very similar to openmp that's why those warnings are now showing up in the default builds (ie threadpool is enabled by default on the Mac with Apple toolchains). As I mentioned earlier we do see a bunch of those sanitizer warnings on x86-64 with openmp/threadpool as well. How do you guys want to proceed? |
I don't think that the warnings reported by address sanitizer here are benign. OpenMP has known compatibility issues with Address Sanitizer since it is not aware of the synchronization mechanism used by OpenMP, but this should not happen when using plain pthreads and atomics. I believe that this is due to using relaxed memory order in It could probably be done with more relaxed memory order, but these changes (on top of this PR) seem to fix the tsan warnings: diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index c3b462b3..a49d3992 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -3188,7 +3188,7 @@ static void ggml_barrier(struct ggml_threadpool * threadpool) {
}
#else
static void ggml_barrier(struct ggml_threadpool * threadpool) {
- int n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_relaxed);
+ int n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_seq_cst);
if (n_threads == 1) {
return;
}
@@ -3196,16 +3196,16 @@ static void ggml_barrier(struct ggml_threadpool * threadpool) {
atomic_int * n_barrier = &threadpool->n_barrier;
atomic_int * n_barrier_passed = &threadpool->n_barrier_passed;
- int passed_old = atomic_load_explicit(n_barrier_passed, memory_order_relaxed);
+ int passed_old = atomic_load_explicit(n_barrier_passed, memory_order_seq_cst);
if (atomic_fetch_add(n_barrier, 1) == n_threads - 1) {
// last thread
atomic_store(n_barrier, 0);
- atomic_fetch_add_explicit(n_barrier_passed, 1, memory_order_relaxed);
+ atomic_fetch_add_explicit(n_barrier_passed, 1, memory_order_seq_cst);
} else {
// wait for other threads
while (true) {
- if (atomic_load_explicit(n_barrier_passed, memory_order_relaxed) != passed_old) {
+ if (atomic_load_explicit(n_barrier_passed, memory_order_seq_cst) != passed_old) {
return;
}
ggml_thread_cpu_relax();
@@ -12879,7 +12879,8 @@ UseGgmlGemm1:;
if (ith == 0) {
// Every thread starts at ith, so the first unprocessed chunk is nth. This save a bit of coordination right at the start.
- atomic_store_explicit(¶ms->threadpool->current_chunk, nth, memory_order_relaxed);
+ //atomic_store_explicit(¶ms->threadpool->current_chunk, nth, memory_order_relaxed);
+ atomic_store(¶ms->threadpool->current_chunk, nth);
}
ggml_barrier(params->threadpool);
@@ -12990,7 +12991,8 @@ UseGgmlGemm2:;
break;
}
- current_chunk = atomic_fetch_add_explicit(¶ms->threadpool->current_chunk, 1, memory_order_relaxed);
+ //current_chunk = atomic_fetch_add_explicit(¶ms->threadpool->current_chunk, 1, memory_order_relaxed);
+ current_chunk = atomic_fetch_add(¶ms->threadpool->current_chunk, 1);
}
} |
@slaren This patch fixes the sanitizer warnings on my end.
Yes, I agree. Reading on the internet about this, it appears that when OpenMP is enabled, the sanitizers can report issues, which is to be expected and there is not much we can do about it. We just have to make sure there are no issues when OpenMP is off. |
I did a bunch of digging and actually convinced myself that the warnings are benign :). In our case, we just need to make sure that the thread sanitizer understands that ggml_barrier() enforces ordering. Take a look at the latest updates. I made everything explicit and documented in the Thread sanitizer is happy now and performance looks good (same as before). |
Quick update. I realized In terms of the overall correctness I further convinced myself that we should be good on that. As I mentioned above, the main thing we need to make sure is that all CPUs finish processing an Op (matmul, copy, etc) and that all of the memory writes complete before we exit The updates to the threadpool state itself are done only from the main thread, under the mutex (which is also a full barrier) and when the worker threads are either spinning on If you can think of a scenario where we do have a true race condition do let me know. Maybe I'm missing something.
|
I am not convinced that the |
I'd going to try to convince you :)
There is no need for the threads to complete Op processing at the same time. Re: just doing strict ordering everywhere. It's hard to measure the overhead with high-level tests. |
I have no doubt that what you are saying is true in practice for the specific hardware. It certainly is for x86 where all atomic load/stores have rel/acq semantics and, chances are, both versions of the code generate the exact same asm. I defer to your knowledge about the way this works in ARM. But ultimately we are not programming for any specific hardware, we are programming for the C virtual machine and the semantics specified thereof. Quoting cppreference.com:
The important part here is and the load in thread B reads a value written by the store in thread A. Thread 0 in your example does not load a value written by thread 1 or thread 2, so there is no guarantee that it will see the writes that happened before |
@slaren Sorry for the delayed response. The threading/sequence example I provided above is actually generic and assumes the C/C++ memory order semantics (not a specific arch). Perhaps, I shortened the ops a bit too much. The Here is the reference from the same source (https://en.cppreference.com/w/c/atomic/memory_order)
On the arm64 (armv8.2-a and up) that translates to LDADDAL instruction. In other words, once all the threads go through that btw The Thread Sanitizer issue I linked to earlier (about the fences) is similar in the sense that this 'atomic_fetch_add_explicit(n_barrier, memory_order_seq_cst)' is acting as a full fence. And OpenMP causes the exact same confusion for the Thread Sanitizer. Now, the new test that I added (it does tons of ggml_barriers) did highlight the need for making M2 Max and Snapdragon Gen 3 are looking good. But I didn't yet get a chance to do more testing on the Ryzen, EPYC and X-Elite yet. Will try to do that later today and provide an update. |
Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions).
Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit.
This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead.
b71d8a0
to
c4411d5
Compare
Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order.
Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency.
fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.
e833ca4
to
eb55059
Compare
Take a look at the latest. Seems like this should be good to go:
Tested on M2 Max, Snapdragon Gen3 (S24 Ultra), Ryzen 3950X, EPYC 7543.
Performance-wise things are looking good. On the arm64 our threadpool does quite a bit better than OpenMP. Here are some microbenchmark numbers using that new test. S24 UltraOur default Android NDK armv8.7 build with and without OpenMP.
AMD Ryzen 9 3950XLLVM 18 build.
|
Looks good to me. I get these results with 13900k
|
Definitely makes sense to followup on improving n_threads > 8 cases on x86-64 in the next iteration (i.e as part of threadpool v3 that we discussed). |
I've done a few tests on M1 Pro, M2 Ultra and Ryzen 9 5950X and all seems good. Thank you. |
OpenMP with metal is broken after this commit on a M2 with Sequoia 15.0, using clang 18.1.8.
|
Oh. Interesting. Can you please share how exactly you build it? |
Ok. I was able to reproduce it on M2 Max with Sequoia 15 and llvm 18.
Interestingly enough. If just a single layer runs on on the CPU then it works fine
I'll try to figure out what exactly broke with OpenMP in this case. It's not immediately obvious. |
Fixed in |
* threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.
Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1).
For smaller graphs/models/prompts the unused threads may end up always polling and never sleeping because we keep getting new graphs to work on.
This PR adds support for skipping the polling for unused threads (ith >= n_threads_cur). They simply go to sleep and we wake them up when we get a new graph to work on.
n_threads_cur
is now anatomic_int
to explicitly tell the compiler and thread sanitizer that it is written from one thread and read from other threads (free from race conditions). All loads and stores use relaxed memory order so there is no additional overhead.Here are some scenarios with the default build on M2 Max, with debug prints for n_thread updates, and for threads going to sleep.
Full offload (Metal)
8 threads are started. Only 1 is active, so the other 7 skip the polling and go to sleep.
CPU only
8 threads are started, and they are all active. hybrid-polling enabled by default prevents them from going to sleep.
No KV offload
8 threads are started, and we alternate between using all 8 and just one for different parts of the graph.