Skip to content
Merged
Show file tree
Hide file tree
Changes from 181 commits
Commits
Show all changes
183 commits
Select commit Hold shift + click to select a range
05c4dbc
[core] (cgroups 1/n) Adding a sys/fs filesystem driver
israbbani Jul 24, 2025
645f9a0
adding the copyright
israbbani Jul 24, 2025
2bb2c5b
Adding a fallback for creating processes inside cgroups with fork/exec
israbbani Jul 24, 2025
4793094
adding a pause in the tests to see what's up with the container
israbbani Jul 25, 2025
85d0ebf
Update src/ray/common/cgroup2/cgroup_driver_interface.h
israbbani Jul 25, 2025
3a5a020
Comments
israbbani Jul 25, 2025
68b0c93
Merge branch 'irabbani/cgroups-1' of github.com:ray-project/ray into …
israbbani Jul 25, 2025
f52354b
Putting the cgroupv2 tests into a separate target
israbbani Jul 29, 2025
148d04d
removing test sleep
israbbani Jul 29, 2025
d3f8b79
Removing a docstring
israbbani Jul 29, 2025
d76ff15
enabling CI tests
israbbani Jul 29, 2025
2798ea5
fixing absl imports
israbbani Jul 29, 2025
3fda505
commenting local
israbbani Jul 29, 2025
9e1e931
doxygen formatting
israbbani Jul 29, 2025
f066f34
Merge branch 'master' into irabbani/cgroups-1
israbbani Jul 30, 2025
e6b4926
removing integration tests
israbbani Jul 30, 2025
f4e0cb2
final cleanup
israbbani Jul 30, 2025
544ba83
iwyu
israbbani Jul 30, 2025
669ba99
Merge branch 'master' into irabbani/cgroups-1
israbbani Jul 30, 2025
2e341d6
we cpplintin!
israbbani Jul 30, 2025
9e46ce6
Update src/ray/common/cgroup2/sysfs_cgroup_driver.cc
israbbani Jul 30, 2025
7c745c5
Apply suggestions from code review
israbbani Jul 30, 2025
d7eb863
bug
israbbani Jul 30, 2025
ff64534
Merge branch 'irabbani/cgroups-1' of github.com:ray-project/ray into …
israbbani Jul 30, 2025
da4b475
[core] Integration tests for SysFsCgroupDriver.
israbbani Jul 30, 2025
37e205f
Cleaning up cgroup_test_utils and attempting to
israbbani Jul 30, 2025
7b83932
broken
israbbani Jul 31, 2025
b911d25
up
israbbani Jul 31, 2025
63506dc
upup
israbbani Jul 31, 2025
e6f1ae9
Merge branch 'master' into irabbani/cgroups-2
israbbani Jul 31, 2025
ead9de1
up
israbbani Jul 31, 2025
d0bcf4d
Adding shell scripts to do cgroup setup/teardown
israbbani Aug 27, 2025
08c36d8
Merge branch 'master' into irabbani/cgroups-2
israbbani Aug 27, 2025
758955a
Merged and fixed a few issues
israbbani Aug 27, 2025
e59ac62
fixing test target for CI
israbbani Aug 27, 2025
8866592
maybe this will trigger tests
israbbani Aug 27, 2025
5364a1d
runforever
israbbani Aug 27, 2025
c77e1f8
up
israbbani Aug 27, 2025
fe54541
up
israbbani Aug 27, 2025
67b21d5
cleaning up todos and docs
israbbani Aug 27, 2025
6e6bc32
one more
israbbani Aug 27, 2025
c399d45
adding separate target for unit tests now
israbbani Aug 27, 2025
2cb4f6e
typo
israbbani Aug 27, 2025
dd25a97
come unit and integration test targets
israbbani Aug 27, 2025
4a95598
missing flag
israbbani Aug 27, 2025
cc51788
plz work
israbbani Aug 28, 2025
d31eb1a
one more
israbbani Aug 28, 2025
d43a5d3
[core] Adding CgroupManager to create, modify, and delete the cgroup
israbbani Sep 3, 2025
a458406
disabling cgroup test
israbbani Sep 3, 2025
01023b9
Addressing feedback
israbbani Sep 3, 2025
bb5d866
ci change
israbbani Sep 3, 2025
f4a8553
Begrudingly using the random id generator from id.h
israbbani Sep 3, 2025
17d1008
instructions for running locally
israbbani Sep 3, 2025
3423eab
adding instructions to run locally
israbbani Sep 3, 2025
5357ea3
Merge branch 'master' into irabbani/cgroups-2
israbbani Sep 3, 2025
1ecfdda
Merge branch 'irabbani/cgroups-2' into irabbani/cgroups-3
israbbani Sep 3, 2025
17c07da
Cleaning up comments
israbbani Sep 3, 2025
e044fcd
fixing ci
israbbani Sep 3, 2025
b59dbc4
Merge branch 'irabbani/cgroups-2' of github.com:ray-project/ray into …
israbbani Sep 3, 2025
13eee38
Merge branch 'irabbani/cgroups-2' into irabbani/cgroups-3
israbbani Sep 3, 2025
f698183
ci
israbbani Sep 3, 2025
3b37051
Removing the no_windows tags and replacing it with the bazel
israbbani Sep 3, 2025
946ec90
Merge branch 'irabbani/cgroups-3' of github.com:ray-project/ray into …
israbbani Sep 4, 2025
ca63baa
Merge branch 'irabbani/cgroups-2' into irabbani/cgroups-3
israbbani Sep 4, 2025
0fe9113
Merge branch 'master' into irabbani/cgroups-3
israbbani Sep 4, 2025
398ef88
[core] cgroups (4/n) adding constraint bounds checking to the
israbbani Sep 4, 2025
ca83426
Merge branch 'master' into irabbani/cgroups-3
israbbani Sep 4, 2025
f7f04db
Merge branch 'irabbani/cgroups-3' into irabbani/cgroups-4
israbbani Sep 4, 2025
dfd9b07
Build with clang to find bugs locally!
israbbani Sep 4, 2025
1884da5
Merge branch 'irabbani/cgroups-4' of github.com:ray-project/ray into …
israbbani Sep 4, 2025
e0bbac8
[core] (cgroups 5/n) Adding methods the following methods to
israbbani Sep 4, 2025
2457558
Merge branch 'master' into irabbani/cgroups-3
israbbani Sep 4, 2025
36101f4
[core] (cgroups 6/n) CgroupManager cleans up the entire cgroup hierarchy
israbbani Sep 4, 2025
a145a81
Adding a very long happy path test
israbbani Sep 5, 2025
fc85704
Merge branch 'irabbani/cgroups-3' into irabbani/cgroups-4
israbbani Sep 5, 2025
b5f6c5e
Addressing feedback.
israbbani Sep 5, 2025
03c731e
Merge branch 'master' into irabbani/cgroups-3
edoakes Sep 5, 2025
6fc9652
[core] (cgroups 7/n) cleaning up old cgroup integration code for raylet
israbbani Sep 5, 2025
4aeabf4
Merge branch 'irabbani/cgroups-3' into irabbani/cgroups-4
israbbani Sep 5, 2025
44a5844
Merge branch 'irabbani/cgroups-4' into irabbani/cgroups-5
israbbani Sep 5, 2025
4de334c
Merge branch 'irabbani/cgroups-5' into irabbani/cgroups-6
israbbani Sep 5, 2025
70ed06d
Merge branch 'irabbani/cgroups-6' into irabbani/cgroups-7
israbbani Sep 5, 2025
89f49e5
cleaning up and adding comments
israbbani Sep 5, 2025
dd0bf98
Merge branch 'irabbani/cgroups-6' of github.com:ray-project/ray into …
israbbani Sep 5, 2025
fd39d7e
removing a few more references
israbbani Sep 5, 2025
4e8e20c
Merge branch 'irabbani/cgroups-7' of github.com:ray-project/ray into …
israbbani Sep 5, 2025
9304014
[core] (cgroups 8/n) Wiring CgroupManager into the raylet. Creating
israbbani Sep 5, 2025
eee8982
Merge branch 'master' into irabbani/cgroups-5
israbbani Sep 6, 2025
20bf0dd
Merge branch 'irabbani/cgroups-6' into irabbani/cgroups-7
israbbani Sep 6, 2025
699f3ba
Adding better error messages for creating the CgroupManager.
israbbani Sep 6, 2025
58e0c1a
Removing node_manager configs
israbbani Sep 6, 2025
e2e957d
unnecessary comment
israbbani Sep 6, 2025
9e7a2ef
Merge branch 'irabbani/cgroups-7' into irabbani/cgroups-8
israbbani Sep 6, 2025
fd9ef0d
Merge branch 'irabbani/cgroups-5' into irabbani/cgroups-6
israbbani Sep 6, 2025
f86a010
Merge branch 'irabbani/cgroups-6' into irabbani/cgroups-7
israbbani Sep 8, 2025
29a7ae4
Merge branch 'irabbani/cgroups-7' into irabbani/cgroups-8
israbbani Sep 8, 2025
1c17e3f
Merge branch 'master' into irabbani/cgroups-6
israbbani Sep 8, 2025
a843dd4
Merge branch 'irabbani/cgroups-6' into irabbani/cgroups-7
israbbani Sep 8, 2025
43a180e
Merge branch 'irabbani/cgroups-7' into irabbani/cgroups-8
israbbani Sep 8, 2025
4c3469b
[core] (cgroups 9/n) End-to-end integration of ray start with cgroups
israbbani Sep 8, 2025
b577600
Merge branch 'irabbani/cgroups-8' into irabbani/cgroups-9
israbbani Sep 8, 2025
7ca430b
fixing ci testing arg
israbbani Sep 8, 2025
7583fae
Merge branch 'irabbani/cgroups-9' of github.com:ray-project/ray into …
israbbani Sep 8, 2025
a3164a4
bad merge
israbbani Sep 8, 2025
b1d8f39
removing unused test files
israbbani Sep 8, 2025
299040a
accidentally modified the print
israbbani Sep 8, 2025
b80da98
Merge branch 'irabbani/cgroups-8' into irabbani/cgroups-9
israbbani Sep 8, 2025
4a581d7
[core] (cgroups 7/n) cleaning up old cgroup integration code for rayl…
israbbani Sep 9, 2025
df8f925
Merge branch 'master' into irabbani/cgroups-6
israbbani Sep 9, 2025
0059039
Merge branch 'master' into irabbani/cgroups-6
israbbani Sep 9, 2025
9342ed5
Merge branch 'irabbani/cgroups-6' into irabbani/cgroups-8
israbbani Sep 9, 2025
5b89821
fixing ci
israbbani Sep 9, 2025
4af9d3f
Merge branch 'irabbani/cgroups-8' of github.com:ray-project/ray into …
israbbani Sep 9, 2025
e386cb5
Merge branch 'master' into irabbani/cgroups-6
israbbani Sep 9, 2025
74612b0
Merge branch 'irabbani/cgroups-6' into irabbani/cgroups-8
israbbani Sep 9, 2025
dcf558b
Merge branch 'irabbani/cgroups-8' into irabbani/cgroups-9
israbbani Sep 9, 2025
32a96f2
bad merge
israbbani Sep 9, 2025
72a2796
Merge branch 'irabbani/cgroups-9' of github.com:ray-project/ray into …
israbbani Sep 9, 2025
1f67fd1
up
israbbani Sep 9, 2025
69125b9
up
israbbani Sep 9, 2025
247bc40
fixing tests
israbbani Sep 10, 2025
9b759cd
up
israbbani Sep 10, 2025
04cf359
one more
israbbani Sep 10, 2025
1aa14ea
up
israbbani Sep 10, 2025
7b0518b
Missing test fixture inside the conftest
israbbani Sep 11, 2025
1e9b214
fixing node manager test fixture
israbbani Sep 11, 2025
f847c13
[core] Adding support in CgroupManager and CgroupDriver to move process
israbbani Sep 9, 2025
e0bc7ae
Merge branch 'irabbani/cgroups-9' into irabbani/cgroups-10
israbbani Sep 11, 2025
98d6dcd
deleted a fixture
israbbani Sep 11, 2025
06a9e6d
Merge branch 'irabbani/cgroups-9' into irabbani/cgroups-10
israbbani Sep 11, 2025
cc3af83
Merge branch 'master' into irabbani/cgroups-6
israbbani Sep 11, 2025
161dd95
[core] (cgroups 8/n) Wiring CgroupManager into the raylet. (#56297)
israbbani Sep 11, 2025
0db3fa5
plz pass ci
israbbani Sep 11, 2025
a64f258
one more
israbbani Sep 11, 2025
323d0c7
deleting unused log lines
israbbani Sep 11, 2025
e5f77fe
Merge branch 'irabbani/cgroups-6' into irabbani/cgroups-9
israbbani Sep 11, 2025
ab86642
trying without --build-type
israbbani Sep 11, 2025
92bb7ce
Merge branch 'master' into irabbani/cgroups-6
israbbani Sep 11, 2025
d819ace
Adding tests and documentation
israbbani Sep 11, 2025
9533518
Added the manual tag to exclude resource isolation tests from regular
israbbani Sep 11, 2025
7e29e73
Merge branch 'irabbani/cgroups-9' into irabbani/cgroups-10
israbbani Sep 11, 2025
7581d1c
Ignoring return value for EXPECT_DEATH in unit tests
israbbani Sep 11, 2025
7dba5bf
Merge branch 'irabbani/cgroups-10' of github.com:ray-project/ray into…
israbbani Sep 11, 2025
67f025f
Merge branch 'irabbani/cgroups-6' into irabbani/cgroups-9
israbbani Sep 11, 2025
2d64808
Merge branch 'master' into irabbani/cgroups-9
israbbani Sep 12, 2025
2c94f96
fixing format and precommit after merge
israbbani Sep 12, 2025
06f3461
deleting file
israbbani Sep 12, 2025
1329a4e
Merge branch 'master' into irabbani/cgroups-9
israbbani Sep 12, 2025
64d7421
Merge branch 'irabbani/cgroups-9' into irabbani/cgroups-10
israbbani Sep 12, 2025
e54dcb4
Merge branch 'master' into irabbani/cgroups-9
israbbani Sep 12, 2025
7591e52
Merge branch 'irabbani/cgroups-9' into irabbani/cgroups-10
israbbani Sep 12, 2025
9d828f4
Adding documentation. Deleting use of RAY_CHECK_WITH_DISPLAY
israbbani Sep 12, 2025
7cadb90
Merge branch 'irabbani/cgroups-10' of github.com:ray-project/ray into…
israbbani Sep 12, 2025
512c278
Merge branch 'master' into irabbani/cgroups-9
israbbani Sep 14, 2025
67e45f1
Merge branch 'irabbani/cgroups-9' into irabbani/cgroups-10
israbbani Sep 14, 2025
1b52efa
[core] (cgroups 11/n) The raylet will nove move system processes
israbbani Sep 12, 2025
c728e71
Merge branch 'irabbani/cgroups-10' into irabbani/cgroups-11
israbbani Sep 15, 2025
be297f4
Fixing failing tests and cleaning up log messages
israbbani Sep 15, 2025
7ac21fc
dead code
israbbani Sep 15, 2025
5ffa1f6
Merge branch 'master' into irabbani/cgroups-9
israbbani Sep 15, 2025
9831d88
Merge branch 'master' into irabbani/cgroups-9
israbbani Sep 15, 2025
134666b
removing monkeypatch and fixing ci
israbbani Sep 15, 2025
1f5f75f
Merge branch 'irabbani/cgroups-9' of github.com:ray-project/ray into …
israbbani Sep 15, 2025
b61f2cf
ci
israbbani Sep 15, 2025
4e79027
Merge branch 'master' into irabbani/cgroups-9
israbbani Sep 16, 2025
74e93b7
commenting out local test path
israbbani Sep 16, 2025
8638043
Merge branch 'irabbani/cgroups-9' of github.com:ray-project/ray into …
israbbani Sep 16, 2025
fe6cb92
fixing comment
israbbani Sep 16, 2025
c72414d
Merge branch 'irabbani/cgroups-9' into irabbani/cgroups-10
israbbani Sep 16, 2025
bdd2a12
Merge branch 'irabbani/cgroups-10' into irabbani/cgroups-11
israbbani Sep 16, 2025
06b466f
Merge branch 'master' of github.com:ray-project/ray into irabbani/cgr…
israbbani Sep 18, 2025
547210c
feedback
israbbani Sep 18, 2025
ac295a0
more feedback
israbbani Sep 18, 2025
79fc924
Merge branch 'master' into irabbani/cgroups-11
israbbani Sep 18, 2025
fda4446
deleting the comment
israbbani Sep 18, 2025
5142b7a
Merge branch 'irabbani/cgroups-11' of github.com:ray-project/ray into…
israbbani Sep 18, 2025
0144b38
oops
israbbani Sep 18, 2025
d51242b
feedback
israbbani Sep 18, 2025
30326a8
hmm
israbbani Sep 18, 2025
d40a91a
Merge branch 'master' into irabbani/cgroups-11
israbbani Sep 19, 2025
7ef0cb6
Merge branch 'master' into irabbani/cgroups-11
israbbani Sep 23, 2025
93d0473
feedback
israbbani Sep 23, 2025
7738c28
Merge branch 'master' into irabbani/cgroups-11
israbbani Sep 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions python/ray/_private/node.py
Original file line number Diff line number Diff line change
Expand Up @@ -1185,6 +1185,10 @@ def start_raylet(
create_err=True,
)

self.resource_isolation_config.add_system_pids(
self._get_system_processes_for_resource_isolation()
)

process_info = ray._private.services.start_raylet(
self.redis_address,
self.gcs_address,
Expand Down Expand Up @@ -1427,6 +1431,15 @@ def start_ray_processes(self):

self.start_raylet(plasma_directory, fallback_directory, object_store_memory)

def _get_system_processes_for_resource_isolation(self) -> str:
"""Returns a list of system processes that will be isolated by raylet.

NOTE: If a new system process is started before the raylet starts up, it needs to be
added to self.all_processes so it can be moved into the raylet's managed cgroup
hierarchy.
"""
return ",".join(str(p[0].process.pid) for p in self.all_processes.values())

def _kill_process_type(
self,
process_type,
Expand Down
6 changes: 5 additions & 1 deletion python/ray/_private/resource_isolation_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,10 @@ def __init__(
system_reserved_cpu: Optional[float] = None,
system_reserved_memory: Optional[int] = None,
):

self._resource_isolation_enabled = enable_resource_isolation
self.cgroup_path = cgroup_path
self.system_reserved_memory = system_reserved_memory
self.system_pids = ""
# cgroupv2 cpu.weight calculated from system_reserved_cpu
# assumes ray uses all available cores.
self.system_reserved_cpu_weight: int = None
Expand Down Expand Up @@ -115,6 +115,10 @@ def add_object_store_memory(self, object_store_memory: int):
)
self._constructed = True

def add_system_pids(self, system_pids: str):
"""A comma-separated list of pids to move into the system cgroup."""
self.system_pids = system_pids

@staticmethod
def _validate_and_get_cgroup_path(cgroup_path: Optional[str]) -> str:
"""Returns the ray_constants.DEFAULT_CGROUP_PATH if cgroup_path is not
Expand Down
1 change: 1 addition & 0 deletions python/ray/_private/services.py
Original file line number Diff line number Diff line change
Expand Up @@ -1904,6 +1904,7 @@ def start_raylet(
command.append(
f"--system-reserved-memory-bytes={resource_isolation_config.system_reserved_memory}"
)
command.append(f"--system-pids={resource_isolation_config.system_pids}")

if raylet_stdout_filepath:
command.append(f"--stdout_filepath={raylet_stdout_filepath}")
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
import os
import sys
from pathlib import Path

import pytest
from click.testing import CliRunner

import ray
import ray._private.ray_constants as ray_constants
import ray._private.utils as utils
import ray.scripts.scripts as scripts
from ray._private.resource_isolation_config import ResourceIsolationConfig

Expand All @@ -21,6 +20,7 @@
#
# Run these commands locally before running the test suite:
# sudo mkdir -p /sys/fs/cgroup/resource_isolation_test
# echo "+cpu +memory" | sudo tee -a /sys/fs/cgroup/resource_isolation_test/cgroup.subtree_control
# sudo chown -R $(whoami):$(whoami) /sys/fs/cgroup/resource_isolation_test/
# sudo chmod -R u+rwx /sys/fs/cgroup/resource_isolation_test/
# echo $$ | sudo tee /sys/fs/cgroup/resource_isolation_test/cgroup.procs
Expand All @@ -32,44 +32,27 @@
# _BASE_CGROUP_PATH = "/sys/fs/cgroup/resource_isolation_test"


def test_resource_isolation_enabled_creates_cgroup_hierarchy(ray_start_cluster):
cluster = ray_start_cluster
base_cgroup = _BASE_CGROUP_PATH
resource_isolation_config = ResourceIsolationConfig(
enable_resource_isolation=True,
cgroup_path=base_cgroup,
system_reserved_memory=1024**3,
system_reserved_cpu=1,
)
# Need to use a worker node because the driver cannot delete the head node.
cluster.add_node(num_cpus=0)
ray.init(address=cluster.address)

worker_node = cluster.add_node(
num_cpus=1, resource_isolation_config=resource_isolation_config
)
worker_node_id = worker_node.node_id
cluster.wait_for_nodes()

# Make sure the worker node is up and running.
@ray.remote
def task():
return "hellodarknessmyoldfriend"

ray.get(task.remote(), timeout=5)

# TODO(#54703): This test is deliberately overspecified right now. The test shouldn't
# care about the cgroup hierarchy. It should just verify that application and system processes
# are started in a cgroup with the correct constraints. This will be updated once cgroup
# process management is completed.
node_cgroup = Path(base_cgroup) / f"ray_node_{worker_node_id}"
# TODO(#54703): This test is deliberately overspecified right now. The test shouldn't
# care about the cgroup hierarchy. It should just verify that application and system processes
# are started in a cgroup with the correct constraints. This will be updated once cgroup
# process management is completed.
def assert_cgroup_hierarchy_exists_for_node(
node_id: str, resource_isolation_config: ResourceIsolationConfig
):
base_cgroup_for_node = resource_isolation_config.cgroup_path
node_cgroup = Path(base_cgroup_for_node) / f"ray_node_{node_id}"
system_cgroup = node_cgroup / "system"
system_leaf_cgroup = system_cgroup / "leaf"
application_cgroup = node_cgroup / "application"
application_leaf_cgroup = application_cgroup / "leaf"

# 1) Check that the cgroup hierarchy is created correctly for the node.
assert node_cgroup.is_dir()
assert system_cgroup.is_dir()
assert system_cgroup.is_dir()
assert system_leaf_cgroup.is_dir()
assert application_cgroup.is_dir()
assert application_leaf_cgroup.is_dir()

# 2) Verify the constraints are applied correctly.
system_cgroup_memory_min = system_cgroup / "memory.min"
Expand All @@ -87,14 +70,24 @@ def task():
10000 - resource_isolation_config.system_reserved_cpu_weight
)

# 3) Gracefully shutting down the node cleans up everything. Don't need to check
# everything. If the base_cgroup is deleted, then all clean up succeeded.
cluster.remove_node(worker_node)
# 3) Check to see that all system pids are inside the system cgroup
system_leaf_cgroup_procs = system_leaf_cgroup / "cgroup.procs"
# At least the raylet process is always moved.
with open(system_leaf_cgroup_procs, "r") as cgroup_procs_file:
lines = cgroup_procs_file.readlines()
assert (
len(lines) > 0
), f"Expected only system process passed into the raylet. Found {lines}"


def assert_cgroup_hierarchy_cleaned_up_for_node(
node_id: str, resource_isolation_config: ResourceIsolationConfig
):
base_cgroup_for_node = resource_isolation_config.cgroup_path
node_cgroup = Path(base_cgroup_for_node) / f"ray_node_{node_id}"
assert not node_cgroup.is_dir()


# The following tests will test integration of resource isolation
# with the 'ray start' command.
@pytest.fixture
def cleanup_ray():
"""Shutdown all ray instances"""
Expand All @@ -114,19 +107,41 @@ def test_ray_start_invalid_resource_isolation_config(cleanup_ray):
assert isinstance(result.exception, ValueError)


def test_ray_start_resource_isolation_config_default_values(monkeypatch, cleanup_ray):
monkeypatch.setattr(utils, "get_num_cpus", lambda *args, **kwargs: 16)
# The DEFAULT_CGROUP_PATH override is only relevant when running locally.
monkeypatch.setattr(ray_constants, "DEFAULT_CGROUP_PATH", _BASE_CGROUP_PATH)

def test_ray_start_resource_isolation_creates_cgroup_hierarchy_and_cleans_up(
monkeypatch, cleanup_ray
):
object_store_memory = 1024**3
system_reserved_memory = 1024**3
system_reserved_cpu = 1
resource_isolation_config = ResourceIsolationConfig(
cgroup_path=_BASE_CGROUP_PATH,
enable_resource_isolation=True,
system_reserved_cpu=system_reserved_cpu,
system_reserved_memory=system_reserved_memory,
)
node_id = ray.NodeID.from_random().hex()
os.environ["RAY_OVERRIDE_NODE_ID_FOR_TESTING"] = node_id
runner = CliRunner()
result = runner.invoke(
scripts.start,
["--head", "--enable-resource-isolation"],
[
"--head",
"--enable-resource-isolation",
"--cgroup-path",
_BASE_CGROUP_PATH,
"--system-reserved-cpu",
system_reserved_cpu,
"--system-reserved-memory",
system_reserved_memory,
"--object-store-memory",
object_store_memory,
],
)
# TODO(#54703): Need to rewrite this test to check for side-effects on the cgroup
# hierarchy once the rest of the implemetation is complete.
assert result.exit_code == 0
resource_isolation_config.add_object_store_memory(object_store_memory)
assert_cgroup_hierarchy_exists_for_node(node_id, resource_isolation_config)
runner.invoke(scripts.stop)
assert_cgroup_hierarchy_cleaned_up_for_node(node_id, resource_isolation_config)


# The following tests will test integration of resource isolation
Expand All @@ -144,50 +159,31 @@ def test_ray_init_resource_isolation_disabled_by_default(ray_shutdown):
assert not node.resource_isolation_config.is_enabled()


def test_ray_init_with_resource_isolation_default_values(monkeypatch, ray_shutdown):
total_system_cpu = 10
monkeypatch.setattr(utils, "get_num_cpus", lambda *args, **kwargs: total_system_cpu)
# The DEFAULT_CGROUP_PATH override is only relevant when running locally.
monkeypatch.setattr(ray_constants, "DEFAULT_CGROUP_PATH", _BASE_CGROUP_PATH)
ray.init(address="local", enable_resource_isolation=True)
node = ray._private.worker._global_node
assert node is not None
assert node.resource_isolation_config.is_enabled()


def test_ray_init_with_resource_isolation_override_defaults(ray_shutdown):
cgroup_path = _BASE_CGROUP_PATH
system_reserved_cpu = 1
system_reserved_memory = 1 * 10**9
object_store_memory = 1 * 10**9
system_reserved_memory = 1024**3
object_store_memory = 1024**3
resource_isolation_config = ResourceIsolationConfig(
enable_resource_isolation=True,
cgroup_path=cgroup_path,
cgroup_path=_BASE_CGROUP_PATH,
system_reserved_cpu=system_reserved_cpu,
system_reserved_memory=system_reserved_memory,
)
resource_isolation_config.add_object_store_memory(object_store_memory)
ray.init(
address="local",
enable_resource_isolation=True,
_cgroup_path=cgroup_path,
_cgroup_path=_BASE_CGROUP_PATH,
system_reserved_cpu=system_reserved_cpu,
system_reserved_memory=system_reserved_memory,
object_store_memory=object_store_memory,
)
node = ray._private.worker._global_node
# TODO(#54703): Need to rewrite this test to check for side-effects on the cgroup
# hierarchy once the rest of the implemetation is complete.
assert node is not None
assert node.resource_isolation_config.is_enabled()
assert (
node.resource_isolation_config.system_reserved_cpu_weight
== resource_isolation_config.system_reserved_cpu_weight
)
assert (
node.resource_isolation_config.system_reserved_memory
== resource_isolation_config.system_reserved_memory
)
node_id = node.node_id
assert_cgroup_hierarchy_exists_for_node(node_id, resource_isolation_config)
ray.shutdown()
assert_cgroup_hierarchy_cleaned_up_for_node(node_id, resource_isolation_config)


if __name__ == "__main__":
Expand Down
2 changes: 1 addition & 1 deletion src/ray/common/cgroup2/cgroup_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ Status CgroupManager::AddProcessToSystemCgroup(const std::string &pid) {
// TODO(#54703): Add link to OSS documentation once available.
RAY_CHECK(!s.IsNotFound()) << "Failed to move process " << pid << " into system cgroup "
<< system_leaf_cgroup_
<< "because the cgroup was not found. "
<< " because the cgroup was not found. "
"If resource isolation is enabled, Ray's cgroup "
"hierarchy must not be modified "
"while Ray is running.";
Expand Down
1 change: 1 addition & 0 deletions src/ray/raylet/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,7 @@ ray_cc_binary(
"//src/ray/util:stream_redirection_options",
"//src/ray/util:time",
"@com_github_gflags_gflags//:gflags",
"@com_google_absl//absl/strings",
"@nlohmann_json",
],
)
23 changes: 22 additions & 1 deletion src/ray/raylet/main.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
#include <utility>
#include <vector>

#include "absl/strings/str_split.h"
#include "gflags/gflags.h"
#include "nlohmann/json.hpp"
#include "ray/common/asio/instrumented_io_context.h"
Expand Down Expand Up @@ -143,6 +144,10 @@ DEFINE_int64(system_reserved_memory_bytes,
"be applied as a memory.min constraint to the system cgroup. If "
"enable-resource-isolation is true, then this cannot be -1");

DEFINE_string(system_pids,
"",
"A comma-separated list of pids to move into the system cgroup.");

absl::flat_hash_map<std::string, std::string> parse_node_labels(
const std::string &labels_json_str) {
absl::flat_hash_map<std::string, std::string> labels;
Expand Down Expand Up @@ -253,6 +258,7 @@ int main(int argc, char *argv[]) {
const std::string cgroup_path = FLAGS_cgroup_path;
const int64_t system_reserved_cpu_weight = FLAGS_system_reserved_cpu_weight;
const int64_t system_reserved_memory_bytes = FLAGS_system_reserved_memory_bytes;
const std::string system_pids = FLAGS_system_pids;

RAY_CHECK_NE(FLAGS_cluster_id, "") << "Expected cluster ID.";
ray::ClusterID cluster_id = ray::ClusterID::FromHex(FLAGS_cluster_id);
Expand All @@ -271,10 +277,11 @@ int main(int argc, char *argv[]) {
"system_reserved_cpu_weight must be set to a value between [1,10000]";
RAY_CHECK_NE(system_reserved_memory_bytes, -1)
<< "Failed to start up raylet. If enable_resource_isolation is set to true, "
"system_reserved_memory_byres must be set to a value > 0";
"system_reserved_memory_bytes must be set to a value > 0";

std::unique_ptr<ray::SysFsCgroupDriver> cgroup_driver =
std::make_unique<ray::SysFsCgroupDriver>();

ray::StatusOr<std::unique_ptr<ray::CgroupManager>> cgroup_manager_s =
ray::CgroupManager::Create(std::move(cgroup_path),
node_id,
Expand All @@ -294,6 +301,20 @@ int main(int argc, char *argv[]) {
<< "Resource isolation with cgroups is only supported in linux. Please set "
"enable_resource_isolation to false. This is likely a misconfiguration.";
#endif

// Move system processes into the system cgroup.
std::vector<std::string> system_pids_to_move = absl::StrSplit(system_pids, ",");
system_pids_to_move.emplace_back(std::to_string(ray::GetPID()));
for (const auto &pid : system_pids_to_move) {
ray::Status s = cgroup_manager->AddProcessToSystemCgroup(pid);
// TODO(#54703): This could be upgraded to a RAY_CHECK.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this should be a RAY_CHECK or not yet. CgroupManager::AddProcessToSystemCgroup already fails a RAY_CHECK if the cgroup doesn't exist or permissions are wrong. The only other errors that are possible here are if the pid is invalid or the process doesn't exist.

I think there's a very strong case to be made that if a system process does not exist when you try to move it into the cgroup, we should fail fast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that pids are passed to the raylet by argument I think the case for ray_check is strong. Let's do it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, RAY_CHECK it boss (with a good error)

if (!s.ok()) {
RAY_LOG(WARNING) << absl::StrFormat(
"Failed to move process %s into system cgroup with error %s",
pid,
s.ToString());
}
}
}

// Configuration for the node manager.
Expand Down