Skip to content

Conversation

@israbbani
Copy link
Contributor

@israbbani israbbani commented Jul 24, 2025

Ray is going to use cgroups to implement resource isolation between system processes (raylet, dashboard_agent, gcs) and application processes (workers). See #54703 for more details.

If you're interested in learning more about cgroups, here are a few resources:

The following classes are introduced:

  • CgroupDriver (interface that should be agnostic of the filesystem)
  • SysFsCgroupDriver (implements CgroupDriver by interacting with the /sys/fs/cgroup filesystem)

Ray will currently only the following cgroup controllers with their respective constraints:

  • cpu (with cpu.weight)
  • memory (with memory.min)

I've tried to introduce a few patterns and best practices that are worth pointing out:

Class Design

  • Returning Status/StatusOr from all public methods that can fail to facilitate proper error handling in the caller.
  • Separating the interface from the implementation (see CgroupDriverInterface.h) to facilitate dependency injection for tests. This also future-proofs the class for if/when we implement a Systemd based cgroup driver which is what K8S does.
  • Separating the interface also has the added benefit of improving compile times because the interface header files are lightweight.

Documentation

  • I prefer to use the doxygen format for class and function documentation. These are comments that can be recognized by doxygen tooling and be used to produce API reference docs.
  • Writing documentation for the appropriate level thinking about the user/consumer of the API
    • CgroupDriver is an abstract class, but it's what the consumer of the API will import and use. It's documentation is API focused and outlines failure modes and return types. This is akin to man pages (see man 2 open).
    • SysFsCgroupDriver is a class that is aware of the filesystem. This is going to talk about which files on sys/fs/cgroup are used and how it interacts with them. The important thing here is to document the specific gotchas of those files. The documentation reflects this change in level of abstraction.
  • Most importantly, comments are written for the future reader and not for the writer. They need to be written with the goal of being understood.

Testing

  • Unit tests use no mocking and are written strictly against the public API of the class under test.
  • Unit tests use the naming convention UnitOfWorkConditionUnderTestExpectedOutput.

to perform cgroup operations.

Signed-off-by: irabbani <[email protected]>
@israbbani israbbani added the go add ONLY when ready to merge, run all tests label Jul 24, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @israbbani, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the foundational groundwork for Ray to leverage Linux cgroupv2 for resource management. It introduces a clear interface for cgroup operations and provides a concrete implementation that interacts directly with the sys/fs cgroupv2 pseudo-filesystem, enabling future features for fine-grained control over process resources.

Highlights

  • Cgroup Driver Interface: Introduced CgroupDriverInterface, a new C++ abstract class defining a set of common operations for managing cgroups, including checking cgroupv2 status, creating/validating cgroups, moving processes, enabling/disabling controllers, and applying resource constraints.
  • SysFS Cgroupv2 Driver Implementation: Added SysFsCgroupDriver, an implementation of CgroupDriverInterface that interacts with the Linux sys/fs pseudo-filesystem to perform cgroupv2 operations. This driver handles file system interactions, permissions, and error handling specific to cgroupv2.
  • Cgroup Testing Infrastructure: Introduced comprehensive testing utilities (TempCgroupDirectory, TempFile, StartChildProcessInCgroup) and dedicated integration and unit tests to validate the SysFsCgroupDriver's functionality and ensure correct interaction with the cgroupv2 filesystem.
  • New Status Code for Permissions: Added StatusCode::PermissionDenied to the ray::Status system, along with helper methods, to provide more granular error reporting for operations that fail due to insufficient user permissions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new cgroup driver for interacting with cgroup v2 via the sysfs interface. The changes are extensive, including the driver interface and implementation, testing utilities, and both unit and integration tests. The code is well-structured, and the addition of comprehensive tests is commendable.

I've identified a few critical issues related to compilation errors and unhandled exceptions that must be addressed. I've also included some high and medium severity suggestions to improve correctness, robustness, and code clarity.

Overall, this is a solid foundation for cgroup management within Ray.

Signed-off-by: irabbani <[email protected]>
instead of clone for older kernel headers < 5.7 (which is what we have
in CI)

Signed-off-by: irabbani <[email protected]>
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't review the tests yet

Checks that the cgroup is valid. See return values for details of which
invariants are checked.

@param cgroup the absolute path of the cgroup.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming this is some java docstring style?

we should keep it standard, either:

  1. follow what we currently have
  2. let's commit to migrating to this style if there's a strong benefit to it

(FWIW I find the style you have more readable)

Copy link
Contributor Author

@israbbani israbbani Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the PR description to talk about this. The style is called doxygen. It can be used to automatically generate API docs for developers. I think it has a few advantages

  • I find it more readable what we currently have.
  • The automated API docs will be useful for public contributions and internal contributors to search through APIs. E.g. it can help answer questions like do we have a utility class that joins paths?
  • There are special tags that the docs generator can use e.g. @param, @tparam (for template params), @ref, @see etc. The whole list is here, but we can start with a few of these.

I'd prefer to change our current ones to these and use this as a standard moving forward.

@edoakes
Copy link
Collaborator

edoakes commented Jul 25, 2025

@israbbani re: the test structure, it might be better to separate integration tests into their own subdirectory. That way, bazel test ... can be used to run all unit tests and return quickly. Some ideas for how it could be structured:

  • tests/ <- unit tests
  • integration_tests/ <- integration tests
  • tests/unit/ <- unit tests
  • tests/integration/ <- integration tests
  • tests/*.cc <- unit tests
  • tests/integration/ <- integration tests

@israbbani
Copy link
Contributor Author

@israbbani re: the test structure, it might be better to separate integration tests into their own subdirectory. That way, bazel test ... can be used to run all unit tests and return quickly. Some ideas for how it could be structured:

  • tests/ <- unit tests

  • integration_tests/ <- integration tests

  • tests/unit/ <- unit tests

  • tests/integration/ <- integration tests

  • tests/*.cc <- unit tests

  • tests/integration/ <- integration tests

I'm open to all 3. I think I prefer because it's ambiguous.

  • tests/unit/ <- unit tests
  • tests/integration/ <- integration tests


/**
A utility interface that allows the caller to check if cgroupv2 is mounted correctly
and perform cgroup operations on the system. It currently only supports using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a slight nit/question, we're saying this class currently only supports using the memory and cpu controllers, do we intend to add future support in this class in the future? If so, what needs to be added? How? If we don't intend this class to have an expansion of scope, let's remove this language and just say it 'supports x'.

This feedback though is in 'hot take' territory, I can't quote a style guideline, so take it with a grain of salt. It's mostly just my own preference. When I encounter classes that have comments like "we will add x" or "this right now does y" and then I see the last commit to the file was some years ago, my takeaway is that this is perhaps a dead effort.

Copy link
Collaborator

@edoakes edoakes Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Not too hot at all :)

Unless there are specific plans to add more controllers, I would concisely state what is supported support and leave it at that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I've reworded it. PTAL

If cgroupv2 is not enabled, or enabled along with cgroupv1, returns Invalid
with the appropriate error message.

@see systemd's documentation for more information about unified mode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit(?): make sure doxygen understands how to deal with this (as the source url is on a new line)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works surprisingly well! Here's a screenshot from the generated docs:
image

Comment on lines +91 to +104
ray::StatusOr<std::unique_ptr<TempDirectory>> TempDirectory::Create() {
std::string path = "/tmp/XXXXXX";
char *ret = mkdtemp(path.data());
if (ret == nullptr) {
return ray::Status::UnknownError(
absl::StrFormat("Failed to create a temp directory. "
"Cgroup tests expect tmpfs to be mounted and only run on Linux.\n"
"Error: %s",
strerror(errno)));
}
std::unique_ptr<TempDirectory> temp_dir =
std::make_unique<TempDirectory>(std::move(path));
return ray::StatusOr<std::unique_ptr<TempDirectory>>(std::move(temp_dir));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO for tests we can just RAY_CHECK and abort if we fail setup preconditions. this is fine too, just more work

Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
@israbbani israbbani marked this pull request as ready for review July 30, 2025 18:18
@israbbani israbbani requested a review from a team as a code owner July 30, 2025 18:18
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all nits, 🚢

Comment on lines 113 to 114
// Note: the process needs execute permissions for the cgroup directory
// to traverse the filesystem.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd!

israbbani and others added 5 commits July 30, 2025 18:31
Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
@israbbani israbbani added the core Issues that should be addressed in Ray Core label Jul 30, 2025
@israbbani israbbani requested a review from a team July 30, 2025 23:25
@edoakes edoakes merged commit 6000520 into master Jul 31, 2025
5 checks passed
@edoakes edoakes deleted the irabbani/cgroups-1 branch July 31, 2025 00:03
avibasnet31 pushed a commit to avibasnet31/ray that referenced this pull request Aug 2, 2025
…oup operations. (ray-project#54898)

Ray is going to use cgroups to implement resource isolation between
system processes (raylet, dashboard_agent, gcs) and application
processes (workers). See ray-project#54703
for more details.

If you're interested in learning more about cgroups, here are a few
resources:
- https://docs.kernel.org/admin-guide/cgroup-v2.html (the docs)
- https://www.youtube.com/watch?v=kcnFQgg9ToY (a video) 

The following classes are introduced:
- CgroupDriver (interface that should be agnostic of the filesystem)
- SysFsCgroupDriver (implements CgroupDriver by interacting with the
/sys/fs/cgroup filesystem)

Ray will currently only the following cgroup controllers with their
respective constraints:
- cpu (with cpu.weight) 
- memory (with memory.min) 

I've tried to introduce a few patterns and best practices that are worth
pointing out:

### Class Design
- Returning Status/StatusOr<T> from all public methods that can fail to
facilitate proper error handling in the caller.
- Separating the interface from the implementation (see
CgroupDriverInterface.h) to facilitate dependency injection for tests.
This also future-proofs the class for if/when we implement a Systemd
based cgroup driver which is what K8S does.
- Separating the interface also has the added benefit of improving
compile times because the interface header files are lightweight.

### Documentation
- I prefer to use the doxygen format for class and function
documentation. These are comments that can be recognized by doxygen
tooling and be used to produce API reference docs.
- Writing documentation for the appropriate level thinking about the
user/consumer of the API
- CgroupDriver is an abstract class, but it's what the consumer of the
API will import and use. It's documentation is API focused and outlines
failure modes and return types. This is akin to man pages (see man 2
open).
- SysFsCgroupDriver is a class that is aware of the filesystem. This is
going to talk about which files on sys/fs/cgroup are used and how it
interacts with them. The important thing here is to document the
specific gotchas of those files. The documentation reflects this change
in level of abstraction.
- Most importantly, comments are written for the future reader and not
for the writer. They need to be written with the goal of being
understood.

### Testing
- Unit tests use no mocking and are written strictly against the public
API of the class under test.
- Unit tests use the naming convention
UnitOfWorkConditionUnderTestExpectedOutput.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: avigyabb <[email protected]>
avibasnet31 pushed a commit to avibasnet31/ray that referenced this pull request Aug 2, 2025
…oup operations. (ray-project#54898)

Ray is going to use cgroups to implement resource isolation between
system processes (raylet, dashboard_agent, gcs) and application
processes (workers). See ray-project#54703
for more details.

If you're interested in learning more about cgroups, here are a few
resources:
- https://docs.kernel.org/admin-guide/cgroup-v2.html (the docs)
- https://www.youtube.com/watch?v=kcnFQgg9ToY (a video) 

The following classes are introduced:
- CgroupDriver (interface that should be agnostic of the filesystem)
- SysFsCgroupDriver (implements CgroupDriver by interacting with the
/sys/fs/cgroup filesystem)

Ray will currently only the following cgroup controllers with their
respective constraints:
- cpu (with cpu.weight) 
- memory (with memory.min) 

I've tried to introduce a few patterns and best practices that are worth
pointing out:

### Class Design
- Returning Status/StatusOr<T> from all public methods that can fail to
facilitate proper error handling in the caller.
- Separating the interface from the implementation (see
CgroupDriverInterface.h) to facilitate dependency injection for tests.
This also future-proofs the class for if/when we implement a Systemd
based cgroup driver which is what K8S does.
- Separating the interface also has the added benefit of improving
compile times because the interface header files are lightweight.

### Documentation
- I prefer to use the doxygen format for class and function
documentation. These are comments that can be recognized by doxygen
tooling and be used to produce API reference docs.
- Writing documentation for the appropriate level thinking about the
user/consumer of the API
- CgroupDriver is an abstract class, but it's what the consumer of the
API will import and use. It's documentation is API focused and outlines
failure modes and return types. This is akin to man pages (see man 2
open).
- SysFsCgroupDriver is a class that is aware of the filesystem. This is
going to talk about which files on sys/fs/cgroup are used and how it
interacts with them. The important thing here is to document the
specific gotchas of those files. The documentation reflects this change
in level of abstraction.
- Most importantly, comments are written for the future reader and not
for the writer. They need to be written with the goal of being
understood.

### Testing
- Unit tests use no mocking and are written strictly against the public
API of the class under test.
- Unit tests use the naming convention
UnitOfWorkConditionUnderTestExpectedOutput.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: avigyabb <[email protected]>
elliot-barn pushed a commit that referenced this pull request Aug 4, 2025
…oup operations. (#54898)

Ray is going to use cgroups to implement resource isolation between
system processes (raylet, dashboard_agent, gcs) and application
processes (workers). See #54703
for more details.

If you're interested in learning more about cgroups, here are a few
resources:
- https://docs.kernel.org/admin-guide/cgroup-v2.html (the docs)
- https://www.youtube.com/watch?v=kcnFQgg9ToY (a video) 

The following classes are introduced:
- CgroupDriver (interface that should be agnostic of the filesystem)
- SysFsCgroupDriver (implements CgroupDriver by interacting with the
/sys/fs/cgroup filesystem)

Ray will currently only the following cgroup controllers with their
respective constraints:
- cpu (with cpu.weight) 
- memory (with memory.min) 

I've tried to introduce a few patterns and best practices that are worth
pointing out:

### Class Design
- Returning Status/StatusOr<T> from all public methods that can fail to
facilitate proper error handling in the caller.
- Separating the interface from the implementation (see
CgroupDriverInterface.h) to facilitate dependency injection for tests.
This also future-proofs the class for if/when we implement a Systemd
based cgroup driver which is what K8S does.
- Separating the interface also has the added benefit of improving
compile times because the interface header files are lightweight.

### Documentation
- I prefer to use the doxygen format for class and function
documentation. These are comments that can be recognized by doxygen
tooling and be used to produce API reference docs.
- Writing documentation for the appropriate level thinking about the
user/consumer of the API
- CgroupDriver is an abstract class, but it's what the consumer of the
API will import and use. It's documentation is API focused and outlines
failure modes and return types. This is akin to man pages (see man 2
open).
- SysFsCgroupDriver is a class that is aware of the filesystem. This is
going to talk about which files on sys/fs/cgroup are used and how it
interacts with them. The important thing here is to document the
specific gotchas of those files. The documentation reflects this change
in level of abstraction.
- Most importantly, comments are written for the future reader and not
for the writer. They need to be written with the goal of being
understood.

### Testing
- Unit tests use no mocking and are written strictly against the public
API of the class under test.
- Unit tests use the naming convention
UnitOfWorkConditionUnderTestExpectedOutput.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
kamil-kaczmarek pushed a commit that referenced this pull request Aug 4, 2025
…oup operations. (#54898)

Ray is going to use cgroups to implement resource isolation between
system processes (raylet, dashboard_agent, gcs) and application
processes (workers). See #54703
for more details.

If you're interested in learning more about cgroups, here are a few
resources:
- https://docs.kernel.org/admin-guide/cgroup-v2.html (the docs)
- https://www.youtube.com/watch?v=kcnFQgg9ToY (a video)

The following classes are introduced:
- CgroupDriver (interface that should be agnostic of the filesystem)
- SysFsCgroupDriver (implements CgroupDriver by interacting with the
/sys/fs/cgroup filesystem)

Ray will currently only the following cgroup controllers with their
respective constraints:
- cpu (with cpu.weight)
- memory (with memory.min)

I've tried to introduce a few patterns and best practices that are worth
pointing out:

### Class Design
- Returning Status/StatusOr<T> from all public methods that can fail to
facilitate proper error handling in the caller.
- Separating the interface from the implementation (see
CgroupDriverInterface.h) to facilitate dependency injection for tests.
This also future-proofs the class for if/when we implement a Systemd
based cgroup driver which is what K8S does.
- Separating the interface also has the added benefit of improving
compile times because the interface header files are lightweight.

### Documentation
- I prefer to use the doxygen format for class and function
documentation. These are comments that can be recognized by doxygen
tooling and be used to produce API reference docs.
- Writing documentation for the appropriate level thinking about the
user/consumer of the API
- CgroupDriver is an abstract class, but it's what the consumer of the
API will import and use. It's documentation is API focused and outlines
failure modes and return types. This is akin to man pages (see man 2
open).
- SysFsCgroupDriver is a class that is aware of the filesystem. This is
going to talk about which files on sys/fs/cgroup are used and how it
interacts with them. The important thing here is to document the
specific gotchas of those files. The documentation reflects this change
in level of abstraction.
- Most importantly, comments are written for the future reader and not
for the writer. They need to be written with the goal of being
understood.

### Testing
- Unit tests use no mocking and are written strictly against the public
API of the class under test.
- Unit tests use the naming convention
UnitOfWorkConditionUnderTestExpectedOutput.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Kamil Kaczmarek <[email protected]>
minerharry pushed a commit to minerharry/ray that referenced this pull request Aug 5, 2025
…oup operations. (ray-project#54898)

Ray is going to use cgroups to implement resource isolation between
system processes (raylet, dashboard_agent, gcs) and application
processes (workers). See ray-project#54703
for more details.

If you're interested in learning more about cgroups, here are a few
resources:
- https://docs.kernel.org/admin-guide/cgroup-v2.html (the docs)
- https://www.youtube.com/watch?v=kcnFQgg9ToY (a video)

The following classes are introduced:
- CgroupDriver (interface that should be agnostic of the filesystem)
- SysFsCgroupDriver (implements CgroupDriver by interacting with the
/sys/fs/cgroup filesystem)

Ray will currently only the following cgroup controllers with their
respective constraints:
- cpu (with cpu.weight)
- memory (with memory.min)

I've tried to introduce a few patterns and best practices that are worth
pointing out:

### Class Design
- Returning Status/StatusOr<T> from all public methods that can fail to
facilitate proper error handling in the caller.
- Separating the interface from the implementation (see
CgroupDriverInterface.h) to facilitate dependency injection for tests.
This also future-proofs the class for if/when we implement a Systemd
based cgroup driver which is what K8S does.
- Separating the interface also has the added benefit of improving
compile times because the interface header files are lightweight.

### Documentation
- I prefer to use the doxygen format for class and function
documentation. These are comments that can be recognized by doxygen
tooling and be used to produce API reference docs.
- Writing documentation for the appropriate level thinking about the
user/consumer of the API
- CgroupDriver is an abstract class, but it's what the consumer of the
API will import and use. It's documentation is API focused and outlines
failure modes and return types. This is akin to man pages (see man 2
open).
- SysFsCgroupDriver is a class that is aware of the filesystem. This is
going to talk about which files on sys/fs/cgroup are used and how it
interacts with them. The important thing here is to document the
specific gotchas of those files. The documentation reflects this change
in level of abstraction.
- Most importantly, comments are written for the future reader and not
for the writer. They need to be written with the goal of being
understood.

### Testing
- Unit tests use no mocking and are written strictly against the public
API of the class under test.
- Unit tests use the naming convention
UnitOfWorkConditionUnderTestExpectedOutput.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: minerharry <[email protected]>
minerharry pushed a commit to minerharry/ray that referenced this pull request Aug 5, 2025
…oup operations. (ray-project#54898)

Ray is going to use cgroups to implement resource isolation between
system processes (raylet, dashboard_agent, gcs) and application
processes (workers). See ray-project#54703
for more details.

If you're interested in learning more about cgroups, here are a few
resources:
- https://docs.kernel.org/admin-guide/cgroup-v2.html (the docs)
- https://www.youtube.com/watch?v=kcnFQgg9ToY (a video)

The following classes are introduced:
- CgroupDriver (interface that should be agnostic of the filesystem)
- SysFsCgroupDriver (implements CgroupDriver by interacting with the
/sys/fs/cgroup filesystem)

Ray will currently only the following cgroup controllers with their
respective constraints:
- cpu (with cpu.weight)
- memory (with memory.min)

I've tried to introduce a few patterns and best practices that are worth
pointing out:

### Class Design
- Returning Status/StatusOr<T> from all public methods that can fail to
facilitate proper error handling in the caller.
- Separating the interface from the implementation (see
CgroupDriverInterface.h) to facilitate dependency injection for tests.
This also future-proofs the class for if/when we implement a Systemd
based cgroup driver which is what K8S does.
- Separating the interface also has the added benefit of improving
compile times because the interface header files are lightweight.

### Documentation
- I prefer to use the doxygen format for class and function
documentation. These are comments that can be recognized by doxygen
tooling and be used to produce API reference docs.
- Writing documentation for the appropriate level thinking about the
user/consumer of the API
- CgroupDriver is an abstract class, but it's what the consumer of the
API will import and use. It's documentation is API focused and outlines
failure modes and return types. This is akin to man pages (see man 2
open).
- SysFsCgroupDriver is a class that is aware of the filesystem. This is
going to talk about which files on sys/fs/cgroup are used and how it
interacts with them. The important thing here is to document the
specific gotchas of those files. The documentation reflects this change
in level of abstraction.
- Most importantly, comments are written for the future reader and not
for the writer. They need to be written with the goal of being
understood.

### Testing
- Unit tests use no mocking and are written strictly against the public
API of the class under test.
- Unit tests use the naming convention
UnitOfWorkConditionUnderTestExpectedOutput.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: minerharry <[email protected]>
mjacar pushed a commit to mjacar/ray that referenced this pull request Aug 5, 2025
…oup operations. (ray-project#54898)

Ray is going to use cgroups to implement resource isolation between
system processes (raylet, dashboard_agent, gcs) and application
processes (workers). See ray-project#54703
for more details.

If you're interested in learning more about cgroups, here are a few
resources:
- https://docs.kernel.org/admin-guide/cgroup-v2.html (the docs)
- https://www.youtube.com/watch?v=kcnFQgg9ToY (a video)

The following classes are introduced:
- CgroupDriver (interface that should be agnostic of the filesystem)
- SysFsCgroupDriver (implements CgroupDriver by interacting with the
/sys/fs/cgroup filesystem)

Ray will currently only the following cgroup controllers with their
respective constraints:
- cpu (with cpu.weight)
- memory (with memory.min)

I've tried to introduce a few patterns and best practices that are worth
pointing out:

### Class Design
- Returning Status/StatusOr<T> from all public methods that can fail to
facilitate proper error handling in the caller.
- Separating the interface from the implementation (see
CgroupDriverInterface.h) to facilitate dependency injection for tests.
This also future-proofs the class for if/when we implement a Systemd
based cgroup driver which is what K8S does.
- Separating the interface also has the added benefit of improving
compile times because the interface header files are lightweight.

### Documentation
- I prefer to use the doxygen format for class and function
documentation. These are comments that can be recognized by doxygen
tooling and be used to produce API reference docs.
- Writing documentation for the appropriate level thinking about the
user/consumer of the API
- CgroupDriver is an abstract class, but it's what the consumer of the
API will import and use. It's documentation is API focused and outlines
failure modes and return types. This is akin to man pages (see man 2
open).
- SysFsCgroupDriver is a class that is aware of the filesystem. This is
going to talk about which files on sys/fs/cgroup are used and how it
interacts with them. The important thing here is to document the
specific gotchas of those files. The documentation reflects this change
in level of abstraction.
- Most importantly, comments are written for the future reader and not
for the writer. They need to be written with the goal of being
understood.

### Testing
- Unit tests use no mocking and are written strictly against the public
API of the class under test.
- Unit tests use the naming convention
UnitOfWorkConditionUnderTestExpectedOutput.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Michael Acar <[email protected]>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…oup operations. (ray-project#54898)

Ray is going to use cgroups to implement resource isolation between
system processes (raylet, dashboard_agent, gcs) and application
processes (workers). See ray-project#54703
for more details.

If you're interested in learning more about cgroups, here are a few
resources:
- https://docs.kernel.org/admin-guide/cgroup-v2.html (the docs)
- https://www.youtube.com/watch?v=kcnFQgg9ToY (a video)

The following classes are introduced:
- CgroupDriver (interface that should be agnostic of the filesystem)
- SysFsCgroupDriver (implements CgroupDriver by interacting with the
/sys/fs/cgroup filesystem)

Ray will currently only the following cgroup controllers with their
respective constraints:
- cpu (with cpu.weight)
- memory (with memory.min)

I've tried to introduce a few patterns and best practices that are worth
pointing out:

### Class Design
- Returning Status/StatusOr<T> from all public methods that can fail to
facilitate proper error handling in the caller.
- Separating the interface from the implementation (see
CgroupDriverInterface.h) to facilitate dependency injection for tests.
This also future-proofs the class for if/when we implement a Systemd
based cgroup driver which is what K8S does.
- Separating the interface also has the added benefit of improving
compile times because the interface header files are lightweight.

### Documentation
- I prefer to use the doxygen format for class and function
documentation. These are comments that can be recognized by doxygen
tooling and be used to produce API reference docs.
- Writing documentation for the appropriate level thinking about the
user/consumer of the API
- CgroupDriver is an abstract class, but it's what the consumer of the
API will import and use. It's documentation is API focused and outlines
failure modes and return types. This is akin to man pages (see man 2
open).
- SysFsCgroupDriver is a class that is aware of the filesystem. This is
going to talk about which files on sys/fs/cgroup are used and how it
interacts with them. The important thing here is to document the
specific gotchas of those files. The documentation reflects this change
in level of abstraction.
- Most importantly, comments are written for the future reader and not
for the writer. They need to be written with the goal of being
understood.

### Testing
- Unit tests use no mocking and are written strictly against the public
API of the class under test.
- Unit tests use the naming convention
UnitOfWorkConditionUnderTestExpectedOutput.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: jugalshah291 <[email protected]>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…oup operations. (ray-project#54898)

Ray is going to use cgroups to implement resource isolation between
system processes (raylet, dashboard_agent, gcs) and application
processes (workers). See ray-project#54703
for more details.

If you're interested in learning more about cgroups, here are a few
resources:
- https://docs.kernel.org/admin-guide/cgroup-v2.html (the docs)
- https://www.youtube.com/watch?v=kcnFQgg9ToY (a video)

The following classes are introduced:
- CgroupDriver (interface that should be agnostic of the filesystem)
- SysFsCgroupDriver (implements CgroupDriver by interacting with the
/sys/fs/cgroup filesystem)

Ray will currently only the following cgroup controllers with their
respective constraints:
- cpu (with cpu.weight)
- memory (with memory.min)

I've tried to introduce a few patterns and best practices that are worth
pointing out:

### Class Design
- Returning Status/StatusOr<T> from all public methods that can fail to
facilitate proper error handling in the caller.
- Separating the interface from the implementation (see
CgroupDriverInterface.h) to facilitate dependency injection for tests.
This also future-proofs the class for if/when we implement a Systemd
based cgroup driver which is what K8S does.
- Separating the interface also has the added benefit of improving
compile times because the interface header files are lightweight.

### Documentation
- I prefer to use the doxygen format for class and function
documentation. These are comments that can be recognized by doxygen
tooling and be used to produce API reference docs.
- Writing documentation for the appropriate level thinking about the
user/consumer of the API
- CgroupDriver is an abstract class, but it's what the consumer of the
API will import and use. It's documentation is API focused and outlines
failure modes and return types. This is akin to man pages (see man 2
open).
- SysFsCgroupDriver is a class that is aware of the filesystem. This is
going to talk about which files on sys/fs/cgroup are used and how it
interacts with them. The important thing here is to document the
specific gotchas of those files. The documentation reflects this change
in level of abstraction.
- Most importantly, comments are written for the future reader and not
for the writer. They need to be written with the goal of being
understood.

### Testing
- Unit tests use no mocking and are written strictly against the public
API of the class under test.
- Unit tests use the naming convention
UnitOfWorkConditionUnderTestExpectedOutput.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants