Skip to content

Conversation

@israbbani
Copy link
Contributor

@israbbani israbbani commented Oct 16, 2025

This PR stacks on #57776.

For more details about the resource isolation project see #54703.

When Ray calculates the number of cpus available on the machine, it checks to see if it's running in a container. However, it truncates the number of cpus.

In this PR,

  • If the number of CPUs on the machine is <= DEFAULT_MIN_SYSTEM_RESERVED_CPU_CORES, then raise a ValueError. Previously, this was < DEFAULT_MIN_SYSTEM_RESERVED_CPU_CORES.
  • Return fractional CPUs from ray._private.utils.get_num_cpus if an optional parameter is set to True.

cpus available on the machine. This will prevent us from rounding down
when running in a container that has cpu.max set.

Signed-off-by: irabbani <[email protected]>
@israbbani israbbani changed the base branch from master to irabbani/cgroups-18 October 16, 2025 17:37
@israbbani israbbani added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels Oct 16, 2025
@israbbani israbbani marked this pull request as ready for review October 16, 2025 18:02
@israbbani israbbani requested a review from a team as a code owner October 16, 2025 18:02
@israbbani
Copy link
Contributor Author

Tested on Anyscale w/ a 2 core machine. Works with default parameters now.

lscpu | grep "CPU(s)"
CPU(s):                                  2

cat /sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/user/cpu.weight
4445

cat /sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/system/cpu.weight
5555

From the logs

(base) ray@ip-10-0-251-150:~/default$ grep "CgroupManager" /tmp/ray/session_latest/logs/raylet.out
{"asctime":"2025-10-16 12:49:53,974","levelname":"I","message":"Initializing CgroupManager at base cgroup at '/sys/fs/cgroup'. Ray's cgroup hierarchy will under the node cgroup at '/sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4' with [memory, cpu] controllers enabled. The system cgroup at '/sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/system' will have [memory] controllers enabled with [cpu.weight=5555, memory.min=5946149682] constraints. The user cgroup '/sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/user' will have no controllers enabled with [cpu.weight=4445] constraints. The user cgroup will contain the [/sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/user/workers, /sys/fs/cgroup/ray-node_e06784dc2316943d0918f0257d1d2cb24d24605ea9022d745cb23fe4/user/non-ray] cgroups.","component":"raylet","filename":"cgroup_manager.cc","lineno":212}

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment in followup

"""
available_system_cpus = utils.get_num_cpus()
available_system_cpus = utils.get_num_cpus(truncate=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should leave a comment for why we don't truncate

@edoakes edoakes merged commit 40a027e into irabbani/cgroups-18 Oct 16, 2025
5 checks passed
@edoakes edoakes deleted the irabbani/cgroups-19 branch October 16, 2025 20:07
edoakes added a commit that referenced this pull request Oct 17, 2025
)

For more details about the resource isolation project see
#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from #57800.

---------

Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…-project#57776)

For more details about the resource isolation project see
ray-project#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from ray-project#57800.

---------

Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…-project#57776)

For more details about the resource isolation project see
ray-project#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from ray-project#57800.

---------

Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: xgui <[email protected]>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
)

For more details about the resource isolation project see
#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from #57800.

---------

Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…-project#57776)

For more details about the resource isolation project see
ray-project#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from ray-project#57800.

---------

Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…-project#57776)

For more details about the resource isolation project see
ray-project#54703.

This PR moves the driver into the workers cgroup when it registers with
the NodeManager. Also updates the tests to reflect this.

This now includes changes from ray-project#57800.

---------

Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants