Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to limit memory usage of processes #3055

Open
pcmoritz opened this issue Oct 13, 2018 · 8 comments
Open

Make it possible to limit memory usage of processes #3055

pcmoritz opened this issue Oct 13, 2018 · 8 comments
Labels
enhancement Request for new feature and/or capability k8s P3 Issue moderate in impact or severity

Comments

@pcmoritz
Copy link
Contributor

pcmoritz commented Oct 13, 2018

Currently it is possible that if the worker processes take up too much memory, the linux OOM killer will kill random processes, which looks like random failures.

We should investigate how to limit memory consumption of workers and actors (hopefully there is a way to impose a limit on the sum of the memory consumption of multiple processes) and notify the user if memory consumption of custom code goes over.

(I was just running into this with one of our users)

@mitar
Copy link
Member

mitar commented Mar 6, 2019

Yes, we would need this as well. We want to evaluate our AutoML system under various conditions. Like how well it works with various sizes of search space, what is the best pipeline it can find given time limit, but also what it can best find given memory limit. For example, being allowed to construct pipelines which take 20 GB of RAM can find different solutions than if you can use at most 1 GB of RAM.

To me it seems the best way to achieve this is to limit how much memory a worker can consume. There seems to be resource module to set this.

So what is tricky here is to decide how much should be this limit. Is this total_limit / number_of_workers? total_limit / number_of_workers_on_a_node? (total_limit - plasma_limit / number_of_workers?

Should all workers get an equal memory chunk or should we just make sure that all workers together on a node do not use more memory? Do we want to require memory limit per job (we could set memory limit before starting a job, and then remove if after).

I think setting per job could help us doing evaluation I mentioned above. And setting for all workers together (but not per worker) could help us with another situation, which is that on Kubernetes, if you limit resources on the container with hard limit, Kubernetes will kill your pod if it consumes more resources. So your pod has to make sure it does not use more resources, which is currently lacking in Ray how to set. In that case it would be better if all workers together have a limit together, and then probably the worker who is last to cross that limit, will get killed (and job terminated).

@mitar
Copy link
Member

mitar commented Mar 7, 2019

One solution I am now trying is setting /proc/pid/oom_score_adj to 1000 for all Ray workers. This makes Ray workers the most likely to be killed by OOM. Because I run Ray workers inside one Docker container, I can then set memory limit (cgroup) on the whole Docker container, which then means that OOM will be invoked when the container gets over the limit, but killing workers first before anything else in the container.

Based on simple testing it seems it works well.

@robertnishihara
Copy link
Collaborator

Nice, maybe we should add a flag to have an option to do that automatically.

@mitar
Copy link
Member

mitar commented Mar 7, 2019

            with open('/proc/{pid}/oom_score_adj'.format(pid=os.getpid()), 'w') as oom_score_adj_file:
                oom_score_adj_file.write('1000')

@stale
Copy link

stale bot commented Nov 15, 2020

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 15, 2020
@mitar
Copy link
Member

mitar commented Nov 16, 2020

Unstale.

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 16, 2020
@stale
Copy link

stale bot commented Mar 16, 2021

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 16, 2021
@mitar
Copy link
Member

mitar commented Mar 16, 2021

Unstale.

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 16, 2021
@rkooo567 rkooo567 added enhancement Request for new feature and/or capability P3 Issue moderate in impact or severity labels Mar 31, 2021
@richardliaw richardliaw added the k8s label Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability k8s P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests

5 participants