-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] (cgroups 14/n) Clean up bazel targets and support cross-platform build. #57244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CgroupManagerFactory which constructs a cross-platform cgroup manager with selective compilation Signed-off-by: irabbani <[email protected]>
and CgroupManagerFactory are the only public targets. CgroupManagerFactory will delegate to the appropriate implementation for each platform. Signed-off-by: irabbani <[email protected]>
build. The Cgroup subsystem only exposes CgroupManagerInterface and CgroupManagerFactory as public targets. Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
… irabbani/cgroups-14 Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
… irabbani/cgroups-14 Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
Signed-off-by: israbbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
… irabbani/cgroups-14 Signed-off-by: irabbani <[email protected]>
edoakes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
| // TODO(54703): Refactor the configs into a struct called CgroupManagerConfig | ||
| // and delegate input validation and error messages to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah!
| RAY_CHECK(!cgroup_path.empty()) | ||
| << "Failed to start CgroupManager. If enable_resource_isolation is set to true, " | ||
| "cgroup_path cannot be empty."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the general case, should we structure the factories so that they RAY_CHECK on initialization failure or return a status code? the paranoid/perfectionist version would be the latter. practically I'm not sure that it makes much of a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capturing our discussion for posterity.
If there's a panic or a fatal error, I would keep the actual RAY_CHECK as close to the error as possible. You're not worried about clean up and you can dump as much context as possible.
I would return a Status is if there's a chance that the caller can either recover provide more useful information.
However, this has the unpleasant side-effect of having landmines buried deep in the call stack.. which makes me sad.
The antidote to this sadness is to have as few FATAL errors as possible and most failures should be recoverable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invariant checking on component startups or detecting misconfigurations is a valid use of RAY_CHECK.
|
minor comments; ping when ready to merge |
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Ibrahim Rabbani <[email protected]>
Signed-off-by: irabbani <[email protected]>
|
Gonna trigger 1 more set of MacOS and Windows tests. I'll let you know when they pass. |
Signed-off-by: israbbani <[email protected]>
|
@edoakes the MacOS failure looks unrelated to the change or to core: |
8d3516c to
bfd2482
Compare
…rm build. (ray-project#57244) For more details about the resource isolation project see ray-project#54703. This PR introduces two public bazel targets from the `//src/ray/common/cgroup2` subsystem. * `CgroupManagerFactory` is a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux. * `CgroupManagerInterface` is the public API of CgroupManager. It also introduces a few other changes 1. All resource isolation related configuration parsing and input validation has been moved into CgroupManagerFactory. 2. NodeManager now controls the lifecycle (and destruction) of CgroupManager. 3. SysFsCgroupDriver uses a linux header file to find the path of the mount file instead of hardcoding because different linux distributions can use different files. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Josh Kodi <[email protected]>
…ses. (#57269) This PR stacks on #57244. For more details about the resource isolation project see #54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…rm build. (ray-project#57244) For more details about the resource isolation project see ray-project#54703. This PR introduces two public bazel targets from the `//src/ray/common/cgroup2` subsystem. * `CgroupManagerFactory` is a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux. * `CgroupManagerInterface` is the public API of CgroupManager. It also introduces a few other changes 1. All resource isolation related configuration parsing and input validation has been moved into CgroupManagerFactory. 2. NodeManager now controls the lifecycle (and destruction) of CgroupManager. 3. SysFsCgroupDriver uses a linux header file to find the path of the mount file instead of hardcoding because different linux distributions can use different files. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…rm build. (#57244) For more details about the resource isolation project see #54703. This PR introduces two public bazel targets from the `//src/ray/common/cgroup2` subsystem. * `CgroupManagerFactory` is a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux. * `CgroupManagerInterface` is the public API of CgroupManager. It also introduces a few other changes 1. All resource isolation related configuration parsing and input validation has been moved into CgroupManagerFactory. 2. NodeManager now controls the lifecycle (and destruction) of CgroupManager. 3. SysFsCgroupDriver uses a linux header file to find the path of the mount file instead of hardcoding because different linux distributions can use different files. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…ses. (#57269) This PR stacks on #57244. For more details about the resource isolation project see #54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…rm build. (ray-project#57244) For more details about the resource isolation project see ray-project#54703. This PR introduces two public bazel targets from the `//src/ray/common/cgroup2` subsystem. * `CgroupManagerFactory` is a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux. * `CgroupManagerInterface` is the public API of CgroupManager. It also introduces a few other changes 1. All resource isolation related configuration parsing and input validation has been moved into CgroupManagerFactory. 2. NodeManager now controls the lifecycle (and destruction) of CgroupManager. 3. SysFsCgroupDriver uses a linux header file to find the path of the mount file instead of hardcoding because different linux distributions can use different files. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…ses. (ray-project#57269) This PR stacks on ray-project#57244. For more details about the resource isolation project see ray-project#54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…rm build. (ray-project#57244) For more details about the resource isolation project see ray-project#54703. This PR introduces two public bazel targets from the `//src/ray/common/cgroup2` subsystem. * `CgroupManagerFactory` is a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux. * `CgroupManagerInterface` is the public API of CgroupManager. It also introduces a few other changes 1. All resource isolation related configuration parsing and input validation has been moved into CgroupManagerFactory. 2. NodeManager now controls the lifecycle (and destruction) of CgroupManager. 3. SysFsCgroupDriver uses a linux header file to find the path of the mount file instead of hardcoding because different linux distributions can use different files. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: xgui <[email protected]>
…ses. (ray-project#57269) This PR stacks on ray-project#57244. For more details about the resource isolation project see ray-project#54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: xgui <[email protected]>
…rm build. (#57244) For more details about the resource isolation project see #54703. This PR introduces two public bazel targets from the `//src/ray/common/cgroup2` subsystem. * `CgroupManagerFactory` is a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux. * `CgroupManagerInterface` is the public API of CgroupManager. It also introduces a few other changes 1. All resource isolation related configuration parsing and input validation has been moved into CgroupManagerFactory. 2. NodeManager now controls the lifecycle (and destruction) of CgroupManager. 3. SysFsCgroupDriver uses a linux header file to find the path of the mount file instead of hardcoding because different linux distributions can use different files. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>
…ses. (#57269) This PR stacks on #57244. For more details about the resource isolation project see #54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: elliot-barn <[email protected]>
…rm build. (ray-project#57244) For more details about the resource isolation project see ray-project#54703. This PR introduces two public bazel targets from the `//src/ray/common/cgroup2` subsystem. * `CgroupManagerFactory` is a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux. * `CgroupManagerInterface` is the public API of CgroupManager. It also introduces a few other changes 1. All resource isolation related configuration parsing and input validation has been moved into CgroupManagerFactory. 2. NodeManager now controls the lifecycle (and destruction) of CgroupManager. 3. SysFsCgroupDriver uses a linux header file to find the path of the mount file instead of hardcoding because different linux distributions can use different files. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…ses. (ray-project#57269) This PR stacks on ray-project#57244. For more details about the resource isolation project see ray-project#54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
…rm build. (ray-project#57244) For more details about the resource isolation project see ray-project#54703. This PR introduces two public bazel targets from the `//src/ray/common/cgroup2` subsystem. * `CgroupManagerFactory` is a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux. * `CgroupManagerInterface` is the public API of CgroupManager. It also introduces a few other changes 1. All resource isolation related configuration parsing and input validation has been moved into CgroupManagerFactory. 2. NodeManager now controls the lifecycle (and destruction) of CgroupManager. 3. SysFsCgroupDriver uses a linux header file to find the path of the mount file instead of hardcoding because different linux distributions can use different files. --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
…ses. (ray-project#57269) This PR stacks on ray-project#57244. For more details about the resource isolation project see ray-project#54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
For more details about the resource isolation project see #54703.
This PR introduces two public bazel targets from the
//src/ray/common/cgroup2subsystem.CgroupManagerFactoryis a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux.CgroupManagerInterfaceis the public API of CgroupManager.It also introduces a few other changes