-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] (cgroups 3/n) Creating CgroupManager to setup Ray's cgroup hierarchy and clean it up #56186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
to perform cgroup operations. Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
instead of clone for older kernel headers < 5.7 (which is what we have in CI) Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]>
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
fix CI. Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: Ibrahim Rabbani <[email protected]>
|
Test failures look unrelated (from serve) |
| hdrs = [ | ||
| "cgroup_manager_interface.h", | ||
| ], | ||
| target_compatible_with = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫶
| int64_t max_cpu_weight = supported_constraints_.at(kCPUWeightConstraint).Max(); | ||
| int64_t application_cgroup_cpu_weight = max_cpu_weight - system_reserved_cpu_weight; | ||
|
|
||
| RAY_LOG(INFO) << absl::StrFormat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems a little noisy for an info level, but only once per raylet startup so should be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an SRE, I found log lines like this one very useful when debugging issues. As a rule of thumb I think we should log the configuration each component starts up with (especially if it's only created once in the lifecycle of the application).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. This one is nice that all of the info is logged in one place. We have some other startup logs that are noisy because we log each bit in a separate log line from different components.
| #include "ray/common/cgroup2/scoped_cgroup_operation.h" | ||
| #include "ray/common/status_or.h" | ||
|
|
||
| namespace ray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe should have a namespace cgroup ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to trying it. I haven't really wrapped my head around what best practices should be around namespaces. I've added an item to #54703. I'll play around with it at the end.
| 3. move all processes from the system cgroup into the base cgroup. | ||
| 4. delete the node, system, and application cgroups respectively. | ||
| Cleanup is best-effort. If any step fails, it will log a warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why's this? to avoid crashing before other cleanup can happen at a higher level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I figured we'd want to attempt the rest of the graceful shutdown process of the raylet even if cgroup cleanup didn't succeed fully.
| enabled_controllers_ == other.enabled_controllers_; | ||
| } | ||
| }; | ||
| class FakeCgroupDriver : public CgroupDriverInterface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤌
|
minor comments; ping for merge |
|
@edoakes ready for merge. Thanks! |
…anager. (#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on #56186. For more details about the resource isolation project see #54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…rarchy and clean it up (ray-project#56186) Signed-off-by: irabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: sampan <[email protected]>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: sampan <[email protected]>
…rarchy and clean it up (ray-project#56186) Signed-off-by: irabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: jugalshah291 <[email protected]>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: jugalshah291 <[email protected]>
…rarchy and clean it up (ray-project#56186) Signed-off-by: irabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: yenhong.wong <[email protected]>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: yenhong.wong <[email protected]>
…rarchy and clean it up (ray-project#56186) Signed-off-by: irabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: zac <[email protected]>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: zac <[email protected]>
…rarchy and clean it up (#56186) Signed-off-by: irabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>
…rarchy and clean it up (ray-project#56186) Signed-off-by: irabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…anager. (ray-project#56246) This PR continues to implement the CgroupManager. CgroupManager will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs. This PR stacks on ray-project#56186. For more details about the resource isolation project see ray-project#54703. In this PR: * CgroupManager now bound checks constraints (e.g. cpu.weight is within [1,10000]. * CgroupDriver no longer bound checks constraints. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
This is the first PR introduces the
CgroupManagerInterface. This will be used by the Raylet to manage the cgroup hierarchy. The implementation will be completed in subsequent PRs.This PR stacks on #55063.
For more details about the resource isolation project see #54703.
The cgroup hierarchy for Ray will be:
The current implementation will only support
I've signposted design decisions with comments in the code. Here's a summary:
There are placeholders in the code for future work: