Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oversubscription slowdown when run under container CPU quotas #190

Open
ogrisel opened this issue Oct 25, 2019 · 19 comments
Open

Oversubscription slowdown when run under container CPU quotas #190

ogrisel opened this issue Oct 25, 2019 · 19 comments

Comments

@ogrisel
Copy link

ogrisel commented Oct 25, 2019

When running an application in a Linux container environment (e.g. docker containers) it is often the case that the orchestrator configuration (kubernetes, docker compose/swarm) puts CPU quotas via Linux cgroup to avoid having one container app to use all the CPU of the neighboring apps running on the same host.

However TBB does not seem to introspect /sys/fs/cgroup/cpu/cpu.cfs_quota_us / /sys/fs/cgroup/cpu/cpu.cfs_period_usto understand how many tasks it can run concurrently resulting in significant slowdown caused by over-subscription. As it is not possible to set a TBB_NUM_THREADS environment variable in the container deployment configuration, this makes it challenging to efficiently deploy TBB-enabled apps on docker-managed servers.

Here is a reproducing settup using numpy from the default anaconda channel on a host machine with 48 threads (24 physical cores):

$ cat oversubscribe_tbb.py
import numpy as np
from time import time

data = np.random.randn(1000, 1000)
print(f"Calling np.linalg.eig shape={data.shape}:",
      end=" ", flush=True)
tic = time()
np.linalg.eig(data)
print(f"{time() - tic:.3f}s")

$ docker run --cpus 2 -ti -v `pwd`:/io continuumio/miniconda3 bash
(base) # conda install -y numpy tbb
(base) # MKL_THREADING_LAYER=tbb python /io/oversubscribe_tbb.py     
one eig, shape=(1000, 1000): 20.227s

By using a sequential execution or OpenMP with appropriately configured environment, the problem disappears:

(base) # MKL_THREADING_LAYER=sequential python /io/oversubscribe_tbb.
py 
Calling np.linalg.eig shape=(1000, 1000): 1.636s
(base) # MKL_THREADING_LAYER=omp OMP_NUM_THREADS=2 python /io/oversub
scribe_tbb.py                                                                           
Calling np.linalg.eig shape=(1000, 1000): 1.484s

Of course if OpenMP is used without setting OMP_NUM_THREADS to match the docker CPU quota, on also get a similar over-subscription problem as encountered with TBB:

(base) # MKL_THREADING_LAYER=omp python /io/oversubscribe_tbb.py     
Calling np.linalg.eig shape=(1000, 1000): 22.703s

Edit: the first version of this report mentioned MKL_THREADING_LAYER=omp instead of MKL_THREADING_LAYER=tbb in the first command (with duration 20.227s). I confirm that we also get 20s+ with MKL_THREADING_LAYER=tbb.

@ogrisel
Copy link
Author

ogrisel commented Oct 25, 2019

Libraries version numbers:

(base) # conda list mkl
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
mkl                       2019.4                      243  
mkl-service               2.3.0            py37he904b0f_0  
mkl_fft                   1.0.14           py37ha843d7b_0  
mkl_random                1.1.0            py37hd6b4f25_0  
(base) # conda list tbb
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
tbb                       2019.8               hfd86e86_0 

@xvallspl
Copy link

xvallspl commented Jan 6, 2020

Hi, we ran into the same problem several weeks ago.

Sadly, the IPC solution described in the PR referenced above is not an option for us in terms of performance, and we had to resort back to reading cgroup files (https://github.com/root-project/root/blob/a7495ae4f697f9bf285835f004af3f14f330b0eb/core/imt/src/TPoolManager.cxx#L32).

However, we are reluctant to believe we are the only ones running into this problem when the virtualization of hardware resources is so widespread nowadays. Is this something the TBB team is thinking of addressing in the near future? Do you see it as something to be figured out on the user side, or do you agree TBB should be taking care of it?

@ogrisel
Copy link
Author

ogrisel commented Jan 17, 2020

One possible solution is to launch your program on a restricted number of CPU cores using the taskset command to match the CPU quotas of the docker container. In this case TBB is able to introspect the affinity constraints and limit the number of threads it spawns to match the available hardware resources.

@ogrisel
Copy link
Author

ogrisel commented Jan 17, 2020

However I am not sure sure how to select the CPU ids to make sure that containers running on the same host do not use overlapping affinity masks.

@xvallspl
Copy link

One possible solution is to launch your program on a restricted number of CPU cores using the taskset command to match the CPU quotas of the docker container. In this case TBB is able to introspect the affinity constraints and limit the number of threads it spawns to match the available hardware resources.

Thanks, @ogrisel. Unfortunately, setting affinity masks is not something we are interested in, only on exceptional cases (for instance, dealing with NUMA domains). For our case, I'd rather read once the cgroup files.

In any case, I am still interested on an answer from the TBB developers on future support for this issue.

@anton-malakhov
Copy link

@tbbdev please pay attention

@alexey-katranov
Copy link
Contributor

Can you clarify, please, how TBB can interpret CPU quotas? As I understand, the mentioned quotas specify the part of CPU time available for a particular application. It does not set the number of threads or affinity constraints. So, it is responsible OS to schedule threads in accordance with the quota.

@xvallspl
Copy link

xvallspl commented Feb 6, 2020

Hi, @alexey-katranov.

In this context, my suggestion would be to interpret the CPU quota/period ratio to decide the default number of threads. If you read a quota of two times the period, that indicates you can use up to 2 CPUs worth of runtime for that period, which would map to two threads assuming they are busy 100% of the time. You could round up real numbers to the nearest integer.

A common example nowadays: You have 2 containers sharing a node with 8 cores, both of them running multithreaded workloads. These containers aren't pinned to particular CPUs but are assigned a fraction of the bandwidth of the machine, using CFS Bandwidth Control (for instance, launching the containers with Docker’s --cpu option).

The scheduler implements this by running all threads of a control group for a fraction (cfs_quota) of an execution period (cfs_period), and when they used up all the quota, it will stop them. They remain stopped until the start of the next cycle. The problem arises when, in multithreaded workloads, the quota/period ratio is lower than the number of logical cores (TBB's default). For instance, if in this 8 core machine, the cfs_quota is 800ms and the period is 200ms, we get a quota/period ratio of 4. This means that we can use up to 4 CPUs worth of runtime. To use this quota efficiently, it would be great if we spawned 4 threads in each container, but tbb will spawn 8. In this case, each thread will run in a different CPU, but only for 50% of an execution period, as we can only use up to 4 CPUs worth of runtime. After half of the period, all 8 threads will be yanked out of execution by the operating system and put to wait until they can run again during the next period. These context changes turn out to be very costly.

Just scale up the above example to a machine with 100 logical cores, and each container gets assigned two CPUs worth of quota. Now, 100 threads will be allowed to use as much CPU as two threads at full speed. That means that they constantly have to be switched out and switched in, and they are waiting 98% of the cfs_period. That's not even counting the overhead due to context-switching 100 threads.

@xvallspl
Copy link

Hi, @alexey-katranov

Didn't get an answer on on future support for this issue :)

@jeremyong
Copy link

You have tbb::global_control, so why not observe the quota yourself, pass the desired concurrency limit as a command line argument/environment variable and just set it yourself?

@alexey-katranov
Copy link
Contributor

@jeremyong, as I understand, the issue is to deploy existing TBB-based application, so there is no possibility to recompile the application.
@xvallspl, I am thinking about the downside that TBB cannot utilize the system if other dockers do not use their quotas. E.g., if we have two dockers with equal quotas but the only one docker requires parallelism (e.g. with TBB) we will have only about a half of possible utilization. Is it expected?

@ogrisel
Copy link
Author

ogrisel commented Mar 20, 2020

You have tbb::global_control, so why not observe the quota yourself, pass the desired concurrency limit as a command line argument/environment variable and just set it yourself?

Would it be possible to have an environment variable to set the equivalent of tbb::global_control? This would make it possible for sysadmin / devops in charge of the kubernetes / docker deployment configuration to tune their services for their production infrastructure without having to change the code of the applications they deploy.

@ogrisel
Copy link
Author

ogrisel commented Mar 20, 2020

@xvallspl, I am thinking about the downside that TBB cannot utilize the system if other dockers do not use their quotas. E.g., if we have two dockers with equal quotas but the only one docker requires parallelism (e.g. with TBB) we will have only about a half of possible utilization. Is it expected?

If you have 2 apps (A & B) deployed in containers and each app/container has a quota of 50% of the total CPUs of the host, app A will never be able to use more than 50% of the CPUs even if B is idle in any case.

So I don't see how the solution proposed by @xvallspl could be detrimental.

@alexey-katranov
Copy link
Contributor

alexey-katranov commented Mar 20, 2020

app A will never be able to use more than 50% of the CPUs even if B is idle in any case

Is it desired behavior that an application cannot utilize more than 50% of the CPU? At first glance, it seems inefficient.

@ogrisel
Copy link
Author

ogrisel commented Mar 20, 2020

Is it desired behavior that an application cannot utilize more than 50% of the CPU? At first glance, it seems inefficient.

Yes because they can be different services running concurrently in different containers on the same host and you don't one to have one service that can degrade the performance of another.

In compute intensive scenarios, you could have many spark / dask workers, each allocated with 4 CPU threads via CFS quotas running on a cluster of big machines with 48 physical cores each. The containers would be scheduled by a spark or dask scheduler orchestrator that talks to kubernetes to dynamically allocate more or less workers based on the cluster-wise load (pending tasks).

So it's the job of the spark / dask scheduler to allocate and release the resource efficiently and of kubernetes to pack pods / docker containers on the physical machines (and possibly to talk to the underlying cloud infrastructure to dynamically provision or release new machines to be part of the cluster). But the workers should not trigger over-subscription by trying to each use 48 threads when they are only allowed to use 4 CPU each.

@alexey-katranov
Copy link
Contributor

Yes because they can be different services running concurrently in different containers on the same host and you don't one to have one service that can degrade the performance of another.

Sure I do not want one service to degrade the performance of another. If each service can utilize its quota it is Ok but my concern what if one service needs more but another service do not need all resources. In that case, can OS increase quota allocation for the first service for some time? Or the quota mechanism can lead to system under utilization if some process do not use its quota.

@xvallspl
Copy link

@alexey-katranov The problem you are describing has to be solved when configuring the quota. Once the quota has been set, your service won't be able to go over it, independently of being executed across multiple processors or not.

As @ogrisel pointed out, this is a desired, intended behaviour. It can lead to resource under utilization, but that's not the point. The idea behind CPU bandwith control is to limit the amount of resources a task or a group of tasks can consume.

Sorry for the late reply, I missed this conversation.

@ogrisel
Copy link
Author

ogrisel commented Oct 2, 2020

For reference, loky (an alternative to concurrent.futures from the Python standard library) is CPU quota aware using the following logic:

https://github.com/joblib/loky/blob/f15594a44420abfa9b398be7ff3c9180c2858bf4/loky/backend/context.py#L181-L195

@arunparkugan
Copy link

@alexey-katranov is this issue still relevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants