Make it possible to limit memory usage of processes #3055

pcmoritz · 2018-10-13T01:08:34Z

Currently it is possible that if the worker processes take up too much memory, the linux OOM killer will kill random processes, which looks like random failures.

We should investigate how to limit memory consumption of workers and actors (hopefully there is a way to impose a limit on the sum of the memory consumption of multiple processes) and notify the user if memory consumption of custom code goes over.

(I was just running into this with one of our users)

mitar · 2019-03-06T22:12:56Z

Yes, we would need this as well. We want to evaluate our AutoML system under various conditions. Like how well it works with various sizes of search space, what is the best pipeline it can find given time limit, but also what it can best find given memory limit. For example, being allowed to construct pipelines which take 20 GB of RAM can find different solutions than if you can use at most 1 GB of RAM.

To me it seems the best way to achieve this is to limit how much memory a worker can consume. There seems to be resource module to set this.

So what is tricky here is to decide how much should be this limit. Is this total_limit / number_of_workers? total_limit / number_of_workers_on_a_node? (total_limit - plasma_limit / number_of_workers?

Should all workers get an equal memory chunk or should we just make sure that all workers together on a node do not use more memory? Do we want to require memory limit per job (we could set memory limit before starting a job, and then remove if after).

I think setting per job could help us doing evaluation I mentioned above. And setting for all workers together (but not per worker) could help us with another situation, which is that on Kubernetes, if you limit resources on the container with hard limit, Kubernetes will kill your pod if it consumes more resources. So your pod has to make sure it does not use more resources, which is currently lacking in Ray how to set. In that case it would be better if all workers together have a limit together, and then probably the worker who is last to cross that limit, will get killed (and job terminated).

mitar · 2019-03-07T08:15:22Z

One solution I am now trying is setting /proc/pid/oom_score_adj to 1000 for all Ray workers. This makes Ray workers the most likely to be killed by OOM. Because I run Ray workers inside one Docker container, I can then set memory limit (cgroup) on the whole Docker container, which then means that OOM will be invoked when the container gets over the limit, but killing workers first before anything else in the container.

Based on simple testing it seems it works well.

robertnishihara · 2019-03-07T08:30:06Z

Nice, maybe we should add a flag to have an option to do that automatically.

mitar · 2019-03-07T15:39:20Z

            with open('/proc/{pid}/oom_score_adj'.format(pid=os.getpid()), 'w') as oom_score_adj_file:
                oom_score_adj_file.write('1000')

stale · 2020-11-15T11:45:06Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

mitar · 2020-11-16T19:57:42Z

Unstale.

stale · 2021-03-16T20:59:35Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

mitar · 2021-03-16T21:40:08Z

Unstale.

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 15, 2020

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 16, 2020

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 16, 2021

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 16, 2021

rkooo567 added enhancement Request for new feature and/or capability P3 Issue moderate in impact or severity labels Mar 31, 2021

richardliaw added the k8s label Jul 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it possible to limit memory usage of processes #3055

Make it possible to limit memory usage of processes #3055

pcmoritz commented Oct 13, 2018 •

edited

Loading

mitar commented Mar 6, 2019 •

edited

Loading

mitar commented Mar 7, 2019 •

edited

Loading

robertnishihara commented Mar 7, 2019

mitar commented Mar 7, 2019 •

edited

Loading

stale bot commented Nov 15, 2020

mitar commented Nov 16, 2020

stale bot commented Mar 16, 2021

mitar commented Mar 16, 2021

Make it possible to limit memory usage of processes #3055

Make it possible to limit memory usage of processes #3055

Comments

pcmoritz commented Oct 13, 2018 • edited Loading

mitar commented Mar 6, 2019 • edited Loading

mitar commented Mar 7, 2019 • edited Loading

robertnishihara commented Mar 7, 2019

mitar commented Mar 7, 2019 • edited Loading

stale bot commented Nov 15, 2020

mitar commented Nov 16, 2020

stale bot commented Mar 16, 2021

mitar commented Mar 16, 2021

pcmoritz commented Oct 13, 2018 •

edited

Loading

mitar commented Mar 6, 2019 •

edited

Loading

mitar commented Mar 7, 2019 •

edited

Loading

mitar commented Mar 7, 2019 •

edited

Loading