Skip to content
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
849370a
[WIP] Cgroups in CI
israbbani Mar 18, 2025
7b14643
Fixing a minor bug. Adding some debugging log lines
israbbani Mar 18, 2025
94e25e5
Fixing test
israbbani Mar 18, 2025
d82dc73
Fixing the test again. Adding the right magic words!
israbbani Mar 18, 2025
51adfd2
Adding a shell test to CI to make sure CI groupv2 is mounted
israbbani Mar 18, 2025
3ad76ea
Adding a proper CI integration test for docker
israbbani Mar 18, 2025
0fe192f
Removing unused code
israbbani Mar 18, 2025
59bbda2
Fixing the CI test
israbbani Mar 18, 2025
7d92f16
Adding --privileged-container flag to CI test
israbbani Mar 19, 2025
da908f9
Fixing CI test dependencies
israbbani Mar 19, 2025
216d578
One more
israbbani Mar 19, 2025
5e810e6
hacking the test
israbbani Mar 19, 2025
bbc7a14
Skip building ray when using cgroups
israbbani Mar 19, 2025
58123b3
Merge branch 'master' into irabbani/ci-adventures
israbbani Mar 19, 2025
9df5448
Removing a log line and fixing a typo
israbbani Mar 19, 2025
1ae9cd6
Renaming flag to --privileged. Fixing a typo.
israbbani Mar 19, 2025
801ae0c
Changed test scoping for CI
israbbani Mar 19, 2025
1ba8495
Adding a sleep to debug on machine
israbbani Mar 19, 2025
a386ce4
One more typo
israbbani Mar 19, 2025
37dd164
Sleeping
israbbani Mar 19, 2025
86cfd09
YAMLLLLL
israbbani Mar 19, 2025
3718e8a
or true!
israbbani Mar 20, 2025
f3c5df3
Adding some logging to debug
israbbani Mar 20, 2025
49fc60b
The logs did show me!
israbbani Mar 20, 2025
5b75e72
Moving the --privileged flag before the other flags
israbbani Mar 20, 2025
15e4d89
it would be nice to test these locally
israbbani Mar 20, 2025
db50587
Adding an eternal timeout to the test so
israbbani Mar 20, 2025
7f5074e
another one!
israbbani Mar 20, 2025
625b27c
Cleanup
israbbani Mar 20, 2025
a0bf010
Debugging again
israbbani Mar 20, 2025
a326ebc
And we should be done!
israbbani Mar 20, 2025
c7e32af
Final cleanup!
israbbani Mar 20, 2025
d8c3e4d
Addressing comments
israbbani Mar 21, 2025
0d18f89
Moving privileged flag back into tester_container
israbbani Mar 24, 2025
a57748e
Removing parallel-workers
israbbani Mar 24, 2025
e10ce69
Responding to feedback
israbbani Mar 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .bazelrc
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,8 @@ build:ubsan --linkopt -fsanitize=undefined
build:ubsan --linkopt -fno-sanitize-recover=all
build:ubsan --per_file_copt="-external/com_github_grpc_grpc/.*@-fsanitize=undefined"

build:cgroup --sandbox_writable_path=/sys/fs/cgroup --config=llvm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what is --config=llvm?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means inherit build:llvm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's defined in .bazelrc file. I'm not sure if this is the right one for what we want. @aslonnie maybe --config=clang is better because it's consistent with our other cpp tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename can happen in another follow up.

@israbbani maybe checkin the change of this .bazelrc file first (and I can approve it), and then your iteration on this PR can have much better cache hit rate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for some reason the cache hit rate has improved on its own. Maybe one of the microchecks passed and published the docker image?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means inherit build:llvm

Sorry for the confusion, my real question is why do we need the config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aslonnie asked me to include it as the base config for C++ testing. It adds the following properties:

# LLVM (clang & libc++) build flags common across Linux installations and systems.
# On Ubuntu, the remaining configurations can be generated by running ci/env/install-llvm-binaries.sh
build:llvm --action_env=CXXFLAGS=-stdlib=libc++
build:llvm --action_env=LDFLAGS=-stdlib=libc++
build:llvm --action_env=BAZEL_CXXOPTS=-stdlib=libc++
build:llvm --action_env=BAZEL_LINKLIBS=-l%:libc++.a:-l%:libc++abi.a
build:llvm --action_env=BAZEL_LINKOPTS=-lm:-pthread
build:llvm --define force_libcpp=enabled

My understanding is that it's needed if we're going to use the cgroup config to run C++ tests. Are you asking why we need the cgroup config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking why we need the cgroup config?

No, I'm just curious why do we need these llvm config.
Because I don't see it in all C++ related configs, for example

ray/.bazelrc

Lines 128 to 134 in 5e05c2f

build:asan --strip=never
build:asan --copt -g
build:asan --copt -fsanitize=address
build:asan --copt -DADDRESS_SANITIZER
build:asan --copt -fno-omit-frame-pointer
build:asan --linkopt -fsanitize=address
build:asan --no//:jemalloc_flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. We may need to audit these and removes one that we don't need.


# Import local specific llvm config options, which can be generated by
# ci/env/install-llvm-dependencies.sh
try-import %workspace%/.llvm-local.bazelrc
Expand Down
12 changes: 12 additions & 0 deletions .buildkite/cicd.rayci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,15 @@ steps:
depends_on:
- oss-ci-base_test
- forge
- label: ":coral: reef: privileged container tests"
commands:
- bazel run //ci/ray_ci:test_in_docker --
//ci/ray_ci:test_privileged ci
--cache-test-results --parallelism-per-worker 2
--build-name oss-ci-base_test
--build-type cgroup
--privileged
instance_type: small
depends_on:
- oss-ci-base_test
- forge
13 changes: 13 additions & 0 deletions ci/ray_ci/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -208,3 +208,16 @@ py_test(
ci_require("pytest"),
],
)

# This test is only run on linux machines
# with docker containers that have --privileged
# enabled.
py_test(
name = "test_privileged",
size = "small",
srcs = ["test_privileged.py"],
tags = [
"team:ci"
],
deps = [ci_require("pytest")],
)
10 changes: 6 additions & 4 deletions ci/ray_ci/container.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

from typing import List, Tuple, Optional


_CUDA_COPYRIGHT = """
==========
== CUDA ==
Expand Down Expand Up @@ -57,6 +58,7 @@ def __init__(
self.volumes = volumes or []
self.envs = envs or []
self.envs += _DOCKER_ENV
self.privileged = False

def run_script_with_output(self, script: List[str]) -> str:
"""
Expand Down Expand Up @@ -109,10 +111,10 @@ def get_run_command(
:param gpu_ids: ids of gpus on the host machine
"""
artifact_mount_host, artifact_mount_container = self.get_artifact_mount()
command = [
"docker",
"run",
"-i",
command = ["docker", "run", "-i"]
if self.privileged:
command.append("--privileged")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it needs to be added at this location? this split of command feels a bit weird..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having trouble getting it to work if the --privileged flag was at the end of the command. It works now so it must've been another problem. I moved it back

command += [
"--rm",
"--volume",
f"{artifact_mount_host}:{artifact_mount_container}",
Expand Down
8 changes: 5 additions & 3 deletions ci/ray_ci/linux_container.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,15 @@ def __init__(
volumes: Optional[List[str]] = None,
envs: Optional[List[str]] = None,
tmp_filesystem: Optional[str] = None,
privileged: bool = False,
) -> None:
super().__init__(docker_tag, volumes, envs)

if tmp_filesystem is not None:
if tmp_filesystem != "tmpfs":
raise ValueError("Only tmpfs is supported for tmp filesystem")
self.tmp_filesystem = tmp_filesystem
self.privileged = privileged

def install_ray(
self, build_type: Optional[str] = None, mask: Optional[str] = None
Expand Down Expand Up @@ -78,16 +80,16 @@ def get_run_command_extra_args(
"--mount",
f"type={self.tmp_filesystem},destination=/tmp",
]
for cap in _DOCKER_CAP_ADD:
extra_args += ["--cap-add", cap]
if not self.privileged:
for cap in _DOCKER_CAP_ADD:
extra_args += ["--cap-add", cap]
if gpu_ids:
extra_args += ["--gpus", f'"device={",".join(map(str, gpu_ids))}"']
extra_args += [
"--workdir",
"/rayci",
"--shm-size=2.5gb",
]

return extra_args

def get_artifact_mount(self) -> Tuple[str, str]:
Expand Down
2 changes: 2 additions & 0 deletions ci/ray_ci/linux_tester_container.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ def __init__(
build_type: Optional[str] = None,
install_mask: Optional[str] = None,
tmp_filesystem: Optional[str] = None,
privileged: bool = False,
) -> None:
LinuxContainer.__init__(
self,
Expand All @@ -28,6 +29,7 @@ def __init__(
"/var/run/docker.sock:/var/run/docker.sock",
],
tmp_filesystem=tmp_filesystem,
privileged=privileged,
)
TesterContainer.__init__(
self,
Expand Down
50 changes: 50 additions & 0 deletions ci/ray_ci/test_privileged.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import os
import pytest
import sys

from pathlib import Path

# In privileged containers, we expect the following
# cgroupv1 is disabled
# cgroupv2 is enabled and mounted on /sys/fs/cgroup
# the user running tests has read and write access to the cgroup subtree
# memory and cpu controllers are enabled

_MOUNT_FILE_PATH = "/proc/mounts"
_CGROUP2_PATH = "/sys/fs/cgroup"
_CTRL_FILE = "cgroup.controllers"
_EXPECTED_CTRLS = ["memory", "cpu"]


# mount file format:
# cgroup /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
def test_only_cgroupv2_mounted_rw():
found_cgroupv2 = False
found_cgroupv1 = False
with open(Path(_MOUNT_FILE_PATH)) as f:
for line in f:
c = line.split()
found_cgroupv2 = found_cgroupv2 or (
c[2] == "cgroup2" and c[1] == _CGROUP2_PATH and "rw" in c[3]
)
found_cgroupv1 = found_cgroupv1 or (c[2] == "cgroup")
assert found_cgroupv2 and not found_cgroupv1


def test_cgroupv2_rw_for_test_user():
assert os.access(_CGROUP2_PATH, os.R_OK) and os.access(_CGROUP2_PATH, os.W_OK)


def test_cgroupv2_controllers_enabled():
with open(os.path.join(_CGROUP2_PATH, _CTRL_FILE)) as f:
enabled = f.readlines()
assert len(enabled) == 1
enabled_ctrls = enabled[0].split()
for expected_ctrl in _EXPECTED_CTRLS:
assert (
expected_ctrl in enabled_ctrls
), f"Expected {expected_ctrl} to be enabled for cgroups2, but it is not"


if __name__ == "__main__":
sys.exit(pytest.main(["-v", __file__]))
19 changes: 19 additions & 0 deletions ci/ray_ci/test_tester.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,25 @@ def test_get_tag_matcher() -> None:
)


def test_linux_privileged() -> None:
with mock.patch(
"ci.ray_ci.linux_tester_container.LinuxTesterContainer.install_ray",
return_value=None,
):
container = _get_container(
team="core",
operating_system="linux",
workers=3,
worker_id=1,
parallelism_per_worker=2,
network=None,
gpus=0,
tmp_filesystem=None,
privileged=True,
)
assert container.privileged


def test_get_container() -> None:
with mock.patch(
"ci.ray_ci.linux_tester_container.LinuxTesterContainer.install_ray",
Expand Down
12 changes: 12 additions & 0 deletions ci/ray_ci/tester.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@
"asan-clang",
"ubsan",
"tsan-clang",
"cgroup",
# java build types
"java",
# do not build ray
Expand Down Expand Up @@ -188,6 +189,13 @@
type=str,
help=("Filesystem to use for /tmp"),
)
@click.option(
"--privileged",
is_flag=True,
show_default=True,
default=False,
help="Run the test in a privileged Docker container",
)
def main(
targets: List[str],
team: str,
Expand All @@ -212,6 +220,7 @@ def main(
install_mask: Optional[str],
bisect_run_test_target: Optional[str],
tmp_filesystem: Optional[str],
privileged: bool,
) -> None:
if not bazel_workspace_dir:
raise Exception("Please use `bazelisk run //ci/ray_ci`")
Expand Down Expand Up @@ -241,6 +250,7 @@ def main(
build_type=build_type,
skip_ray_installation=skip_ray_installation,
install_mask=install_mask,
privileged=privileged,
)
if build_only:
sys.exit(0)
Expand Down Expand Up @@ -291,6 +301,7 @@ def _get_container(
build_type: Optional[str] = None,
install_mask: Optional[str] = None,
skip_ray_installation: bool = False,
privileged: bool = False,
) -> TesterContainer:
shard_count = workers * parallelism_per_worker
shard_start = worker_id * parallelism_per_worker
Expand All @@ -312,6 +323,7 @@ def _get_container(
build_type=build_type,
tmp_filesystem=tmp_filesystem,
install_mask=install_mask,
privileged=privileged,
)

if operating_system == "windows":
Expand Down
2 changes: 2 additions & 0 deletions ci/ray_ci/tester_container.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,8 @@ def _run_tests_in_docker(
test_cmd += "--config=ubsan "
if self.build_type == "tsan-clang":
test_cmd += "--config=tsan-clang "
if self.build_type == "cgroup":
test_cmd += "--config=cgroup "
for env in test_envs:
test_cmd += f"--test_env {env} "
if test_arg:
Expand Down
2 changes: 1 addition & 1 deletion ci/ray_ci/tests.env.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ if [[ "$BUILD_TYPE" == "skip" || "${BUILD_TYPE}" == "ubsan" ]]; then
exit 0
fi

if [[ "$BUILD_TYPE" == "clang" || "$BUILD_TYPE" == "asan-clang" || "$BUILD_TYPE" == "tsan-clang" ]]; then
if [[ "$BUILD_TYPE" == "clang" || "$BUILD_TYPE" == "asan-clang" || "$BUILD_TYPE" == "tsan-clang" || "$BUILD_TYPE" == "cgroup" ]]; then
echo "--- Install LLVM dependencies (and skip building ray package)"
bash ci/env/install-llvm-binaries.sh
exit 0
Expand Down