-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it possible to limit memory usage of processes #3055
Comments
Yes, we would need this as well. We want to evaluate our AutoML system under various conditions. Like how well it works with various sizes of search space, what is the best pipeline it can find given time limit, but also what it can best find given memory limit. For example, being allowed to construct pipelines which take 20 GB of RAM can find different solutions than if you can use at most 1 GB of RAM. To me it seems the best way to achieve this is to limit how much memory a worker can consume. There seems to be resource module to set this. So what is tricky here is to decide how much should be this limit. Is this Should all workers get an equal memory chunk or should we just make sure that all workers together on a node do not use more memory? Do we want to require memory limit per job (we could set memory limit before starting a job, and then remove if after). I think setting per job could help us doing evaluation I mentioned above. And setting for all workers together (but not per worker) could help us with another situation, which is that on Kubernetes, if you limit resources on the container with hard limit, Kubernetes will kill your pod if it consumes more resources. So your pod has to make sure it does not use more resources, which is currently lacking in Ray how to set. In that case it would be better if all workers together have a limit together, and then probably the worker who is last to cross that limit, will get killed (and job terminated). |
One solution I am now trying is setting Based on simple testing it seems it works well. |
Nice, maybe we should add a flag to have an option to do that automatically. |
with open('/proc/{pid}/oom_score_adj'.format(pid=os.getpid()), 'w') as oom_score_adj_file:
oom_score_adj_file.write('1000') |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Unstale. |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Unstale. |
Currently it is possible that if the worker processes take up too much memory, the linux OOM killer will kill random processes, which looks like random failures.
We should investigate how to limit memory consumption of workers and actors (hopefully there is a way to impose a limit on the sum of the memory consumption of multiple processes) and notify the user if memory consumption of custom code goes over.
(I was just running into this with one of our users)
The text was updated successfully, but these errors were encountered: