-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oversubscription slowdown when run under container CPU quotas #190
Comments
Libraries version numbers:
|
Hi, we ran into the same problem several weeks ago. Sadly, the IPC solution described in the PR referenced above is not an option for us in terms of performance, and we had to resort back to reading cgroup files (https://github.com/root-project/root/blob/a7495ae4f697f9bf285835f004af3f14f330b0eb/core/imt/src/TPoolManager.cxx#L32). However, we are reluctant to believe we are the only ones running into this problem when the virtualization of hardware resources is so widespread nowadays. Is this something the TBB team is thinking of addressing in the near future? Do you see it as something to be figured out on the user side, or do you agree TBB should be taking care of it? |
One possible solution is to launch your program on a restricted number of CPU cores using the |
However I am not sure sure how to select the CPU ids to make sure that containers running on the same host do not use overlapping affinity masks. |
Thanks, @ogrisel. Unfortunately, setting affinity masks is not something we are interested in, only on exceptional cases (for instance, dealing with NUMA domains). For our case, I'd rather read once the cgroup files. In any case, I am still interested on an answer from the TBB developers on future support for this issue. |
@tbbdev please pay attention |
Can you clarify, please, how TBB can interpret CPU quotas? As I understand, the mentioned quotas specify the part of CPU time available for a particular application. It does not set the number of threads or affinity constraints. So, it is responsible OS to schedule threads in accordance with the quota. |
Hi, @alexey-katranov. In this context, my suggestion would be to interpret the CPU quota/period ratio to decide the default number of threads. If you read a quota of two times the period, that indicates you can use up to 2 CPUs worth of runtime for that period, which would map to two threads assuming they are busy 100% of the time. You could round up real numbers to the nearest integer. A common example nowadays: You have 2 containers sharing a node with 8 cores, both of them running multithreaded workloads. These containers aren't pinned to particular CPUs but are assigned a fraction of the bandwidth of the machine, using CFS Bandwidth Control (for instance, launching the containers with Docker’s --cpu option). The scheduler implements this by running all threads of a control group for a fraction (cfs_quota) of an execution period (cfs_period), and when they used up all the quota, it will stop them. They remain stopped until the start of the next cycle. The problem arises when, in multithreaded workloads, the quota/period ratio is lower than the number of logical cores (TBB's default). For instance, if in this 8 core machine, the cfs_quota is 800ms and the period is 200ms, we get a quota/period ratio of 4. This means that we can use up to 4 CPUs worth of runtime. To use this quota efficiently, it would be great if we spawned 4 threads in each container, but tbb will spawn 8. In this case, each thread will run in a different CPU, but only for 50% of an execution period, as we can only use up to 4 CPUs worth of runtime. After half of the period, all 8 threads will be yanked out of execution by the operating system and put to wait until they can run again during the next period. These context changes turn out to be very costly. Just scale up the above example to a machine with 100 logical cores, and each container gets assigned two CPUs worth of quota. Now, 100 threads will be allowed to use as much CPU as two threads at full speed. That means that they constantly have to be switched out and switched in, and they are waiting 98% of the cfs_period. That's not even counting the overhead due to context-switching 100 threads. |
Hi, @alexey-katranov Didn't get an answer on on future support for this issue :) |
You have |
@jeremyong, as I understand, the issue is to deploy existing TBB-based application, so there is no possibility to recompile the application. |
Would it be possible to have an environment variable to set the equivalent of |
If you have 2 apps (A & B) deployed in containers and each app/container has a quota of 50% of the total CPUs of the host, app A will never be able to use more than 50% of the CPUs even if B is idle in any case. So I don't see how the solution proposed by @xvallspl could be detrimental. |
Is it desired behavior that an application cannot utilize more than 50% of the CPU? At first glance, it seems inefficient. |
Yes because they can be different services running concurrently in different containers on the same host and you don't one to have one service that can degrade the performance of another. In compute intensive scenarios, you could have many spark / dask workers, each allocated with 4 CPU threads via CFS quotas running on a cluster of big machines with 48 physical cores each. The containers would be scheduled by a spark or dask scheduler orchestrator that talks to kubernetes to dynamically allocate more or less workers based on the cluster-wise load (pending tasks). So it's the job of the spark / dask scheduler to allocate and release the resource efficiently and of kubernetes to pack pods / docker containers on the physical machines (and possibly to talk to the underlying cloud infrastructure to dynamically provision or release new machines to be part of the cluster). But the workers should not trigger over-subscription by trying to each use 48 threads when they are only allowed to use 4 CPU each. |
Sure I do not want one service to degrade the performance of another. If each service can utilize its quota it is Ok but my concern what if one service needs more but another service do not need all resources. In that case, can OS increase quota allocation for the first service for some time? Or the quota mechanism can lead to system under utilization if some process do not use its quota. |
@alexey-katranov The problem you are describing has to be solved when configuring the quota. Once the quota has been set, your service won't be able to go over it, independently of being executed across multiple processors or not. As @ogrisel pointed out, this is a desired, intended behaviour. It can lead to resource under utilization, but that's not the point. The idea behind CPU bandwith control is to limit the amount of resources a task or a group of tasks can consume. Sorry for the late reply, I missed this conversation. |
For reference, loky (an alternative to |
@alexey-katranov is this issue still relevant? |
When running an application in a Linux container environment (e.g. docker containers) it is often the case that the orchestrator configuration (kubernetes, docker compose/swarm) puts CPU quotas via Linux cgroup to avoid having one container app to use all the CPU of the neighboring apps running on the same host.
However TBB does not seem to introspect
/sys/fs/cgroup/cpu/cpu.cfs_quota_us
//sys/fs/cgroup/cpu/cpu.cfs_period_us
to understand how many tasks it can run concurrently resulting in significant slowdown caused by over-subscription. As it is not possible to set aTBB_NUM_THREADS
environment variable in the container deployment configuration, this makes it challenging to efficiently deploy TBB-enabled apps on docker-managed servers.Here is a reproducing settup using numpy from the default anaconda channel on a host machine with 48 threads (24 physical cores):
By using a sequential execution or OpenMP with appropriately configured environment, the problem disappears:
Of course if OpenMP is used without setting
OMP_NUM_THREADS
to match the docker CPU quota, on also get a similar over-subscription problem as encountered with TBB:Edit: the first version of this report mentioned
MKL_THREADING_LAYER=omp
instead ofMKL_THREADING_LAYER=tbb
in the first command (with duration 20.227s). I confirm that we also get 20s+ withMKL_THREADING_LAYER=tbb
.The text was updated successfully, but these errors were encountered: