-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] (cgroups 11/n) Raylet will move system processes into cgroup on startup #56522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
to perform cgroup operations. Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
instead of clone for older kernel headers < 5.7 (which is what we have in CI) Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
fix CI. Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
|
@edoakes I addressed all of your comments except for the RAY_CHECK. I'll leave that as a TODO. There are a few more in cgroups 12/n that need to be upgraded too. |
|
I've kicked off MacOS and Windows tests to be extra super duper ultra sure that post-merge won't break. Lets wait for them to pass and I'll ping for merge. |
|
CI post-merge is in bad shape. All failures are unrelated:
@edoakes this should be good to merge. |
|
CI is a little too red for me to be comfortable merging this, don't want to get in the habit of force merging. Let's hold off until test issues are resolved. |
Signed-off-by: irabbani <[email protected]>
|
Test failure: https://buildkite.com/ray-project/premerge/builds/49657#019977d7-530a-4b78-8296-3e948219c6c0/179-5349 Don't see on tracker -- relevant? |
…n startup (ray-project#56522) This PR stacks on ray-project#56352 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move the system processes into the system cgroup on startup if resource isolation is enabled. It introduces the following * A new raylet cli arg `--system-pids` which is a comma-separated string of pids of system processes that are started before the raylet. As of today, it contains * On the head node: gcs_server, dashboard_api_server, ray client server, monitor (autoscaler) * On every node (including head): process subreaper, log monitor. * End-to-end integration tests for resource isolation with the Ray SDK (`ray.init`) and the Ray CLI (`ray --start`) There are a few rough edges (I've added a comment on the PR where relevant): 1. The construction of ResourceIsolationConfig is spread across multiple call-sites (create the object, add the object store memory, add the system pids). The big positive of doing it this way was to fail fast on invalid user input (in scripts.py and worker.py). I think it needs to have at least two components: the user input (cgroup_path, system_reserved_memory, ...) and the derived input (system_pids, total_system_reserved_memory). 2. How to determine which processes should be moved? Right now I'm using `self.all_processes` in `node.py`. It _should_ contain all processes started so far, but there's no guarantee. 3. How intrusive should the integration test be? Should we count the number of pids inside the system cgroup? (This was answered in ray-project#56549) 4. How should a user setup multiple nodes on the same VM? I haven't written an integration test for it yet because there are multiple options for how to set this up. --------- Signed-off-by: irabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: zac <[email protected]>
…n startup (#56522) This PR stacks on #56352 . For more details about the resource isolation project see #54703. This PR the makes the raylet move the system processes into the system cgroup on startup if resource isolation is enabled. It introduces the following * A new raylet cli arg `--system-pids` which is a comma-separated string of pids of system processes that are started before the raylet. As of today, it contains * On the head node: gcs_server, dashboard_api_server, ray client server, monitor (autoscaler) * On every node (including head): process subreaper, log monitor. * End-to-end integration tests for resource isolation with the Ray SDK (`ray.init`) and the Ray CLI (`ray --start`) There are a few rough edges (I've added a comment on the PR where relevant): 1. The construction of ResourceIsolationConfig is spread across multiple call-sites (create the object, add the object store memory, add the system pids). The big positive of doing it this way was to fail fast on invalid user input (in scripts.py and worker.py). I think it needs to have at least two components: the user input (cgroup_path, system_reserved_memory, ...) and the derived input (system_pids, total_system_reserved_memory). 2. How to determine which processes should be moved? Right now I'm using `self.all_processes` in `node.py`. It _should_ contain all processes started so far, but there's no guarantee. 3. How intrusive should the integration test be? Should we count the number of pids inside the system cgroup? (This was answered in #56549) 4. How should a user setup multiple nodes on the same VM? I haven't written an integration test for it yet because there are multiple options for how to set this up. --------- Signed-off-by: irabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>
…cation cgroup (#56549) This PR stacks on #56522 . For more details about the resource isolation project see #54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in #55063. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…n startup (ray-project#56522) This PR stacks on ray-project#56352 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move the system processes into the system cgroup on startup if resource isolation is enabled. It introduces the following * A new raylet cli arg `--system-pids` which is a comma-separated string of pids of system processes that are started before the raylet. As of today, it contains * On the head node: gcs_server, dashboard_api_server, ray client server, monitor (autoscaler) * On every node (including head): process subreaper, log monitor. * End-to-end integration tests for resource isolation with the Ray SDK (`ray.init`) and the Ray CLI (`ray --start`) There are a few rough edges (I've added a comment on the PR where relevant): 1. The construction of ResourceIsolationConfig is spread across multiple call-sites (create the object, add the object store memory, add the system pids). The big positive of doing it this way was to fail fast on invalid user input (in scripts.py and worker.py). I think it needs to have at least two components: the user input (cgroup_path, system_reserved_memory, ...) and the derived input (system_pids, total_system_reserved_memory). 2. How to determine which processes should be moved? Right now I'm using `self.all_processes` in `node.py`. It _should_ contain all processes started so far, but there's no guarantee. 3. How intrusive should the integration test be? Should we count the number of pids inside the system cgroup? (This was answered in ray-project#56549) 4. How should a user setup multiple nodes on the same VM? I haven't written an integration test for it yet because there are multiple options for how to set this up. --------- Signed-off-by: irabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Marco Stephan <[email protected]>
…cation cgroup (ray-project#56549) This PR stacks on ray-project#56522 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in ray-project#55063. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Marco Stephan <[email protected]>
…n startup (#56522) This PR stacks on #56352 . For more details about the resource isolation project see #54703. This PR the makes the raylet move the system processes into the system cgroup on startup if resource isolation is enabled. It introduces the following * A new raylet cli arg `--system-pids` which is a comma-separated string of pids of system processes that are started before the raylet. As of today, it contains * On the head node: gcs_server, dashboard_api_server, ray client server, monitor (autoscaler) * On every node (including head): process subreaper, log monitor. * End-to-end integration tests for resource isolation with the Ray SDK (`ray.init`) and the Ray CLI (`ray --start`) There are a few rough edges (I've added a comment on the PR where relevant): 1. The construction of ResourceIsolationConfig is spread across multiple call-sites (create the object, add the object store memory, add the system pids). The big positive of doing it this way was to fail fast on invalid user input (in scripts.py and worker.py). I think it needs to have at least two components: the user input (cgroup_path, system_reserved_memory, ...) and the derived input (system_pids, total_system_reserved_memory). 2. How to determine which processes should be moved? Right now I'm using `self.all_processes` in `node.py`. It _should_ contain all processes started so far, but there's no guarantee. 3. How intrusive should the integration test be? Should we count the number of pids inside the system cgroup? (This was answered in #56549) 4. How should a user setup multiple nodes on the same VM? I haven't written an integration test for it yet because there are multiple options for how to set this up. --------- Signed-off-by: irabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>
…cation cgroup (#56549) This PR stacks on #56522 . For more details about the resource isolation project see #54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in #55063. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>
…n startup (#56522) This PR stacks on #56352 . For more details about the resource isolation project see #54703. This PR the makes the raylet move the system processes into the system cgroup on startup if resource isolation is enabled. It introduces the following * A new raylet cli arg `--system-pids` which is a comma-separated string of pids of system processes that are started before the raylet. As of today, it contains * On the head node: gcs_server, dashboard_api_server, ray client server, monitor (autoscaler) * On every node (including head): process subreaper, log monitor. * End-to-end integration tests for resource isolation with the Ray SDK (`ray.init`) and the Ray CLI (`ray --start`) There are a few rough edges (I've added a comment on the PR where relevant): 1. The construction of ResourceIsolationConfig is spread across multiple call-sites (create the object, add the object store memory, add the system pids). The big positive of doing it this way was to fail fast on invalid user input (in scripts.py and worker.py). I think it needs to have at least two components: the user input (cgroup_path, system_reserved_memory, ...) and the derived input (system_pids, total_system_reserved_memory). 2. How to determine which processes should be moved? Right now I'm using `self.all_processes` in `node.py`. It _should_ contain all processes started so far, but there's no guarantee. 3. How intrusive should the integration test be? Should we count the number of pids inside the system cgroup? (This was answered in #56549) 4. How should a user setup multiple nodes on the same VM? I haven't written an integration test for it yet because there are multiple options for how to set this up. --------- Signed-off-by: irabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>
…cation cgroup (ray-project#56549) This PR stacks on ray-project#56522 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in ray-project#55063. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>
…n startup (ray-project#56522) This PR stacks on ray-project#56352 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move the system processes into the system cgroup on startup if resource isolation is enabled. It introduces the following * A new raylet cli arg `--system-pids` which is a comma-separated string of pids of system processes that are started before the raylet. As of today, it contains * On the head node: gcs_server, dashboard_api_server, ray client server, monitor (autoscaler) * On every node (including head): process subreaper, log monitor. * End-to-end integration tests for resource isolation with the Ray SDK (`ray.init`) and the Ray CLI (`ray --start`) There are a few rough edges (I've added a comment on the PR where relevant): 1. The construction of ResourceIsolationConfig is spread across multiple call-sites (create the object, add the object store memory, add the system pids). The big positive of doing it this way was to fail fast on invalid user input (in scripts.py and worker.py). I think it needs to have at least two components: the user input (cgroup_path, system_reserved_memory, ...) and the derived input (system_pids, total_system_reserved_memory). 2. How to determine which processes should be moved? Right now I'm using `self.all_processes` in `node.py`. It _should_ contain all processes started so far, but there's no guarantee. 3. How intrusive should the integration test be? Should we count the number of pids inside the system cgroup? (This was answered in ray-project#56549) 4. How should a user setup multiple nodes on the same VM? I haven't written an integration test for it yet because there are multiple options for how to set this up. --------- Signed-off-by: irabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…cation cgroup (ray-project#56549) This PR stacks on ray-project#56522 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in ray-project#55063. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…n startup (ray-project#56522) This PR stacks on ray-project#56352 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move the system processes into the system cgroup on startup if resource isolation is enabled. It introduces the following * A new raylet cli arg `--system-pids` which is a comma-separated string of pids of system processes that are started before the raylet. As of today, it contains * On the head node: gcs_server, dashboard_api_server, ray client server, monitor (autoscaler) * On every node (including head): process subreaper, log monitor. * End-to-end integration tests for resource isolation with the Ray SDK (`ray.init`) and the Ray CLI (`ray --start`) There are a few rough edges (I've added a comment on the PR where relevant): 1. The construction of ResourceIsolationConfig is spread across multiple call-sites (create the object, add the object store memory, add the system pids). The big positive of doing it this way was to fail fast on invalid user input (in scripts.py and worker.py). I think it needs to have at least two components: the user input (cgroup_path, system_reserved_memory, ...) and the derived input (system_pids, total_system_reserved_memory). 2. How to determine which processes should be moved? Right now I'm using `self.all_processes` in `node.py`. It _should_ contain all processes started so far, but there's no guarantee. 3. How intrusive should the integration test be? Should we count the number of pids inside the system cgroup? (This was answered in ray-project#56549) 4. How should a user setup multiple nodes on the same VM? I haven't written an integration test for it yet because there are multiple options for how to set this up. --------- Signed-off-by: irabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…cation cgroup (ray-project#56549) This PR stacks on ray-project#56522 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in ray-project#55063. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
This PR stacks on #56352 .
For more details about the resource isolation project see #54703.
This PR the makes the raylet move the system processes into the system cgroup on startup if resource isolation is enabled.
It introduces the following
--system-pidswhich is a comma-separated string of pids of system processes that are started before the raylet. As of today, it containsray.init) and the Ray CLI (ray --start)There are a few rough edges (I've added a comment on the PR where relevant):
self.all_processesinnode.py. It should contain all processes started so far, but there's no guarantee.