-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
per-caller-thread limits #76
Comments
As far as I know, it's not possible in a multi-threaded setting, because we can only set the maximum number of threads of the native C libraries (BLAS, openmp) globally. I think MKL has a way to somehow specify a thread local max threads but I'm not sure it solves this issue. It might be different if your algorithm uses multi-processing. Then you should be able to set the number of threads in each subprocess. You'd still have to keep track on the budget at each step, the scheduling part is outside of the scope of threadpoolctl. |
The standard OpenMP and BLAS APIs do not provide a generic way to do this as @jeremiedbb said above. It would be great to lobby BLAS implementation developers to provide a consistent API to set the parallelism budget on a per-BLAS-call thread-local basis. I think the BLIS developers intended to do so a while ago but I have not followed their development recently and as far as I know OpenBLAS does not provide anything like this. |
@jeremiedbb seems correct in that, if one uses multi-processing, it is possible to use |
Based on the last reply, I have the feeling that we can close this issue. |
Note that the workaround has significant disadvantages:
|
What you describe is something far beyond the scope of threadpoolctl. I think what you want is close to what TBB offers with a full-fludged task scheduler. However that would require all the threaded tools of the ecosystem (BLAS, machine learning libraries, signal processing libraries...) to use TBB instead of OpenMP... and currently in the Python word for instance, Cython does not have syntactic support for interfacing with a TBB runtime (as far as I know). Also note that TBB has its limitations w.r.t. over-subscription in practical deployment scenarii like docker containers, see: uxlfoundation/oneTBB#190 . They might be fixable though. |
I can reopen with the issue with a more descriptive title, however it's unlikely to ever be solved because major BLAS implementations (e.g. OpenBLAS) do not offer such control (maybe BLIS does?) and this is not part of the OpenMP standard either (as far as I know). |
It is indeed a tough problem and might not be solvable in general, and OpenMP/BLAS etc. don't make it easy. It is much easier is "everyone" agrees on a single scheduler technology (e.g. TBB). This is a place where Julia has an advantage being new, and having multi-threaded scheduler within the language from an early stage, most packages tend to just use it so there's automatic balancing across multi-threaded apps. That said, if we don't have it as an open issue then things wouldn't ever get any better... |
The thing is that this problem will not be solved in |
The
threadpool_limits
are global. This makes it difficult to avoid oversubscription when invoking parallel operations (e.g., Numpy functions) from within a parallel divide-and-conquer algorithm.Ideally, parallel multi-threading frameworks would be fully multi-threaded-aware, that is, have a limit on the total number of threads used, regardless of how many threads are generating requests. This however seems too much to ask for :-(
A simpler modification would be to set per-caller-thread limits. This way, a divide-and-conquer algorithm could, at each step, subdivide the total budget of threads. As an secondary upside, a budget of odd number of (2n+1) threads could be split to (n) threads for one sub-task and (n+1) threads for another, fully utilizing all threads, rather than setting a global budget of (n) threads per each (missing out on one) or (n+1) for each (oversubscribing).
Is such finer-grained control over thread limits possible? If so, I'd love to see support for it in
threadpoolctl
.The text was updated successfully, but these errors were encountered: