[RFC] k8s-native worker pool #14077

ericl · 2021-02-12T21:35:14Z

We should consider supporting a k8s-native worker pool for when Ray is being run with the k8s operator. This means launching worker processes in individual pods separate from the raylet pod, rather than having workers living together in the big raylet pod.

The advantages of having workers in separate pods:

[major] Can have separate containers per worker (one mechanism for implementing runtime_env RFC: [RFC] runtime_env for actors and tasks #14019)
[major] Can have resource limit enforcement for individual workers
[minor] Workers can be targeted individually by k8s label selectors for metrics scraping, load balancing, etc.
[minor] Can get individual logs for workers

Disadvantages:

K8s doesn't support nested containers, so the implementation gets a bit complicated
Raylet pods will need k8s API access to create worker pods
Worker start slightly slower (add container overhead to process overhead)

Proposal:

The raylet pod requests the resources (e.g., {"cpu": {"request": 32}} for a 32-worker raylet pod).
The raylet creates worker pods with zero requests but with specified limits (e.g., {"cpu": {"request": 0}, {"limit": 1}}). The pods are given constraining labels to force them to be co-located with the raylet.
Somehow the shared memory socket is shared with these workers (?)
These worker pods join the raylet as worker processes.

edoakes · 2021-02-12T21:39:04Z

@ericl one thought about the privilege concern: is it possible to use k8s RBAC so that the Raylet can only create and delete these specific worker pods?

ericl · 2021-02-12T21:41:43Z

Probably? cc @thomasdesr @DmitriGekhtman if you have thoughts on how the access control could be implemented here

ericl · 2021-02-12T21:43:15Z

Another thing is I'm not sure how to support shared memory here (how do we give workers access to that shared memory fd?).

clarkzinzow · 2021-02-12T22:37:43Z

Another thing is I'm not sure how to support shared memory here (how do we give workers access to that shared memory fd?).

I've gotten something similar to this working between a driver pod and a DaemonSet raylet in the past using bidirectional mount propagation on a shared hostpath volume for the raylet and plasma socket paths, where the mount is propagated to the host and all other containers/pods that share the volume. The drivers then use HostToContainer mount propagation.

An issue with this approach is if there's more than one raylet per node, then you start running into hostpath conflicts. But that could be easily solved by adding the node ID to the hostpath.

clarkzinzow · 2021-02-12T22:51:27Z

I assume that slow worker pool scale up would be a significant disadvantage? Pod startup times can be a few seconds or longer, and could be particularly bad for dynamic worker provisioning, such as prestarting workers on lease request, dedicated workers, IO workers, etc.

ericl · 2021-02-12T22:55:35Z

Workers already take a couple seconds to start, so I don't see this as a significant disadvantage--- things might get a bit slower with the extra overhead, but Ray is already designed around the fact this is a high overhead operation.

simon-mo · 2021-02-12T23:13:28Z

On: "raylet creates worker pods ", can raylet just tell the k8s operator (somehow) to create the worker pods?

ericl · 2021-02-13T02:23:33Z

Yeah that's a good idea, an initial version could request pod creation using the operator, which would avoid needing to talk to k8s apiserver directly.

DmitriGekhtman · 2021-02-13T04:42:51Z

On: "raylet creates worker pods ", can raylet just tell the k8s operator (somehow) to create the worker pods?

+1 For separation of control plane and data plane.

DmitriGekhtman · 2021-02-14T02:25:25Z

the solution @clarkzinzow described (Raylet DaemonSet, worker pods) seems pretty reasonable -- it matches Ray architecture with K8s architecture pretty well.
why not do that?

kfstorm · 2021-03-02T08:11:11Z

Nice RFC! We've been struggling with Java worker memory restriction for a long time. A couple of things we may need to think about:

Worker lifetime management (is it still possible with this RFC?)
- Currently Raylet is able to kill workers by sending signals.
- A worker will quit automatically if the ppid becomes 1 (meaning that the parent process has died).
Since workers are not in the Raylet pod anymore, will drivers be able to run in standalone pods as well?

Qstar · 2021-03-02T08:15:47Z

Great topic!!! This feature will have better control of separation and more flexible scheduling.

chenk008 · 2021-03-02T10:10:17Z

Big +1!

In AntGroup, we consider two schema for k8s-native worker

Pod per worker

deploy raylet as daemon, raylet requests to k8s apiserver for create worker pod
combine gcs scheduler with k8s scheduler, using pod affinity. It's can reuse k8s resource restriction
using k8s persistentvolume , share session dir with multiple worker pod in the same k8s node
autoscaler will create/delete k8s node

In one pod , container per worker

deploy raylet as pod, mount the containerd socket int raylet pod
raylet ask containerd to create new worker container in current pod. Every worker run in separate container, and has individual cgroup limit. The pod has parent cgroup for all worker container
using container mount, share session dir with worker container
it's easy to run on bare metal
autoscaler will create/delete pod

We are glad to join this RFC

ericl · 2021-03-02T18:45:16Z

Currently Raylet is able to kill workers by sending signals.

I think so, the pod will exit in this case once the process it contains is shutdown.

Since workers are not in the Raylet pod anymore, will drivers be able to run in standalone pods as well?

It should become possible, though it would still need to be co-located with a Raylet.

In AntGroup, we consider two schema for k8s-native worker
Pod per worker
In one pod , container per worker

cc @edoakes this is also very similar to our internal discussions. Our conclusion was the container per worker situation is a bit problematic due to:

need to privileged containers to allow container in container (this is the main sticking point; if we didn't need privileged containers this would be simpler)
lack of ability to address the containers within the pod using k8s labels and tooling

However, the pod per worker approach is somewhat complicated. The initial approach suggested in the RFC is untenable since it assigns cpu shares=0 to worker pods, meaning they will get zero CPU under contention, and there doesn't seem to be a way to work around this without modifying the k8s scheduler. The daemonset approach or similar may be the way to go.

Btw @kfstorm @chenk008 our conclusion was that k8s native workers are desirable, but we won't have the bandwidth to implement this ourselves until Q3. If you have resources to start work on k8s native workers in Q2 though, we could work together on that.

chenk008 · 2021-03-03T04:04:38Z

@ericl

In one pod , container per worker

Now we create worker container in the same pod with raylet, not container in container. In other words, raylet container and worker container run in the same hierarchy

ericl · 2021-03-03T04:08:48Z

I was under the impression pods are immutable, do you mean creating containers out of band that aren't managed by k8s?

…

On Tue, Mar 2, 2021, 8:04 PM chenk008 ***@***.***> wrote: @ericl <https://github.com/ericl> In one pod , container per worker Now we create worker container in the same pod with raylet, not container in container. In other words, raylet container and worker container run in the same hierarchy — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14077 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSV3W7I6T362V6MP4KDTBWYOLANCNFSM4XRL2N5A> .

chenk008 · 2021-03-03T12:14:16Z

I was under the impression pods are immutable, do you mean creating containers out of band that aren't managed by k8s?
…
On Tue, Mar 2, 2021, 8:04 PM chenk008 @.***> wrote: @ericl https://github.com/ericl In one pod , container per worker Now we create worker container in the same pod with raylet, not container in container. In other words, raylet container and worker container run in the same hierarchy — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14077 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSV3W7I6T362V6MP4KDTBWYOLANCNFSM4XRL2N5A .

Yes, make Raylet to manage worker container, and all worker container share the network namespace. It looks a little hacky.

The advantages:

raylet don't need to interactive with k8s apiserver, but interactive with containerd/dockerd
in gcs scheduler, no modification is required
worker can use on-premises resource which is requested by pod. In kuberntes cluster which is shared by multiple workloads, it helps to reduce scheduling failure caused by resource competition
when run on bare metal, also has ability to separate work

thomasdesr · 2021-03-03T17:21:11Z

From what I understand, the dependency for privileged containers mostly comes from cgroups. If we didn't want to rely on cgroups to enforce Ray's scheduling constraints, some shape of container-in-container should be doable. That said, container-in-container is definitely an uncommon path atm.

ericl · 2021-03-03T19:34:41Z

As I understand it this is more like allowing docker access to create containers on the side, not container in container correct? So the security issues are only around unrestricted access to docker?

…

On Wed, Mar 3, 2021, 9:21 AM Thomas Desrosiers ***@***.***> wrote: From what I understand, the dependency for privileged containers *mostly* comes from cgroups. If we didn't want to rely on cgroups to enforce Ray's scheduling constraints, some shape of container-in-container should be doable. That said, container-in-container is definitely an uncommon path atm. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14077 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSRBHVK64VYAEPI5FELTBZVZRANCNFSM4XRL2N5A> .

edoakes · 2021-03-03T23:34:00Z

One advantage of a solution like the one @chenk008 suggested is this can easily be replicated when not using Kubernetes, so we can keep the same feature set and code paths for all deployment strategies.

ericl · 2021-03-04T00:51:49Z

Yeah agreed, I actually like this solution of "out of band docker containers". It seems the main unknowns here are whether (1) k8s users will find this generally acceptable, (2) whether k8s providers like EKS or GKE support it.

thomasdesr · 2021-03-04T01:03:42Z

Out-of-band docker containers is ~= privileged containers. If we can figure out how to nest containers sensibly it offers a big advantage in terms of cleanliness (cleanup is easy & automatic because lineage is obvious).

ericl · 2021-03-04T01:18:49Z

It seems both GKE and EKS now support privileged containers: https://stackoverflow.com/questions/31124368/allow-privileged-containers-in-kubernetes-on-google-container-gke https://docs.aws.amazon.com/eks/latest/userguide/pod-security-policy.html

@chenk008 have you considered nested containers instead? It sounds slightly better (though, I'm not too worried about cleanup since Ray workers generally kill themselves when the raylet dies). I guess the docker container files might not be cleaned up in either case, unless we eschew docker for container in container.

chenk008 · 2021-03-04T02:31:00Z

@ericl
Yeah.In fact,I prefer containerd/dockerd nested container. Because there are some security problems if raylet directly interactives with containerd/dockerd on host .
I have tested docker in docker last year, but encounters some problems. We could test it again and try to fix problems

chenk008 · 2021-03-04T15:39:11Z

podman is a more lightweight container manager tool.
https://stackoverflow.com/questions/56032747/how-to-run-podman-from-inside-a-container

Maybe it's a better choice. I will test it and compare with dockerd/containerd

chenk008 · 2021-04-04T06:31:00Z

After study some potential solutions, I think I found a ideal way to support this feature.

To start/stop every ray worker in a single container, use podman/docker ; worker container's entrypoint is java_worker_command or python_worker_command.
To support directory sharing, we can use volume (aka linux bind mount) ; mount tmp_dir into every ray worker container, ray worker can connect to raylet/plasma with unix domain socket.
To support resource isolation, use container cpu/memory cgroup
To support pid management in Raylet, worker container use Raylet pid namespace
To support network communication, worker container use Raylet network namespace

I'm going to write a PoC to prove the feasibility of these solutions. Please let me know if you have some specific topic you want to discuss more. And I'll write a more detailed design document about this PoC.

chenk008 · 2021-04-27T07:23:42Z

@ericl @edoakes
I describe the design in this doc:
https://docs.google.com/document/d/1vSdO7NSobdYy7ewe5nenteC0nM9bAffE/edit
Please share your thoughts. We will appreciate any feedback!

edoakes · 2021-04-28T14:40:54Z

Thanks! We are going to do a detailed review of this on Thursday and will get back to you with feedback/questions.

…

On Tue, Apr 27, 2021 at 2:23 AM chenk008 ***@***.***> wrote: @ericl <https://github.com/ericl> @edoakes <https://github.com/edoakes> I describe the design in this doc: https://docs.google.com/document/d/1vSdO7NSobdYy7ewe5nenteC0nM9bAffE/edit Please share your thoughts. We will appreciate any feedback! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14077 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLKAZKGNWO5ODZ4FEVXAWDTKZRAZANCNFSM4XRL2N5A> .

ericl · 2021-05-11T15:52:39Z

I see, so it seems HostIPC is an equivalent security issue. The other concerns we could probably work around, but this one seems problematic.

chenk008 · 2021-05-13T17:05:37Z

Hey @ericl @edoakes , in my opinion, we want to build Ray as serverless platform, it makes sense to run Raylet with SYS_ADMIN. I don't think it is very hard to request for SYS_ADMIN capability in most K8s cluster, and we can use seccomp to add syscall restrict. In AntGroup K8s cluster, Pod will be assigned SYS_ADMIN capability by default.

But the worker containers are running without SYS_ADMIN capability . In other word, jobs are running without SYS_ADMIN capability. Even we could drop other capabilites in worker container, start worker in a security sandbox. For example we can using seccomp to forbid kill syscall in worker container, it's a useful feature in multi-tenancy cluster.

And we expected that Raylet is running as non-root user, such as admin. So we will start worker in rootless container, it adds a new security layer.

ericl · 2021-05-13T17:10:08Z

I see, it does seem better that untrusted user code is run in an unprivileged context. @edoakes @yiranwang52 what do you think? Perhaps we can poll the community to see how acceptable this is?

Note that I think we don't have to be satisfying 100% of users, since this feature is optional after all. As long as a majority are happy I think that is enough to move forward.

bklaubos · 2021-05-20T17:25:49Z

Where is the final decision and design doc for this feature?

bklaubos · 2021-05-26T15:34:10Z

I assume that slow worker pool scale up would be a significant disadvantage? Pod startup times can be a few seconds or longer, and could be particularly bad for dynamic worker provisioning, such as prestarting workers on lease request, dedicated workers, IO workers, etc.

We used Argo workflow extensively, by scaling one worker per pod is going to take performance hit.
Do not go this route!!! Be warned.

chenk008 · 2021-07-06T14:13:58Z

I see, so it seems HostIPC is an equivalent security issue. The other concerns we could probably work around, but this one seems problematic.

Hi, @ericl I have another look at the code about the way how Ray use share memory last days. Currently raylet pass the shared file descriptor to worker processes, it does not need HostIPC to use share memory in cross-pod containers when disable SELinux. And I think most of kubernetes clusters make SELinux disable.

I think the originally proposed in this RFC is good option for a lot scenario. How about moving on to implement it?

simon-mo · 2021-07-13T01:25:21Z

@chenk008 circling back, if we assume SYS_ADMIN cannot be turn on and we run container-in-container, is it possible to use something like rootless container? https://github.com/opencontainers/runc#rootless-containers

chenk008 · 2021-07-13T03:22:38Z

@simon-mo I think SYS_ADMIN is a necessity.

When we start a container without privileged, the /sys/fs/cgroup will be mounted as read-only, so we cann't create a new cgroup. But with SYS_ADMIN, we can remount /sys/fs/cgroup as writable. There is a similar issue:Finding the minimal set of privileges for a docker container to spawn rootless containers opencontainers/runc#1456. The below is my test script:

[root@72d5a36f6df4 rootless]# ./runc --root /tmp/runc run mycontainerid
WARN[0000] unable to get oom kill count                  error="no directory specified for memory.oom_control"
ERRO[0000] container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: mkdir /sys/fs/cgroup/cpuset/mycontainerid: read-only file system

And when start a container, we need to change process rootfs, SYS_ADMIN is required. https://man7.org/linux/man-pages/man2/pivot_root.2.html

jovany-wang · 2021-12-21T17:49:04Z

Hi @chenk008 , could you list the current progress of this to let other know it ?

juliusvonkohout · 2022-10-19T01:00:22Z

@chenk008 @simon-mo @ericl @edoakes

i have found a solution. With a recent podman and a recent Kernel>5.13 e.g. centos9 i can run a rootless podman in rootless podman https://www.redhat.com/sysadmin/podman-rootless-overlay
podman run --security-opt label=disable should be enough to run the worker without SYS_ADMIN and any other ugly root stuff in the future.

ray/python/ray/_private/runtime_env/container.py

Lines 26 to 37 in 3e357c8

    
           container_driver = "podman" 
        
           container_command = [ 
        
               container_driver, 
        
               "run", 
        
               "-v", 
        
               self._ray_tmp_dir + ":" + self._ray_tmp_dir, 
        
               "--cgroup-manager=cgroupfs", 
        
               "--network=host", 
        
               "--pid=host", 
        
               "--ipc=host", 
        
               "--env-host", 
        
           ]

is still a problem

host: use the host’s shared memory, semaphores, and message queues inside the container. Note: the host mode gives the container full access to local shared memory and is therefore considered insecure.

https://docs.podman.io/en/latest/markdown/podman-run.1.html#ipc-ipc
What is the point of using podman then if you destroy all security features? Anyway we would still need to fix Error: containers not creating Cgroups must create a private PID namespace: invalid argument .ipc=host and pid=host is the problem. Or we might just just use a simple and easy udocker https://pypi.org/project/udocker/

ericl added the RFC RFC issues label Feb 12, 2021

ericl changed the title ~~[RFC] k8s worker pool~~ [RFC] k8s-native worker pool Feb 12, 2021

chenk008 mentioned this issue May 19, 2021

[WIP][Core]The first PR for runnint worker in container antgroup/ant-ray#162

Closed

6 tasks

chenk008 mentioned this issue May 21, 2021

[Core]The first PR for running worker in container #15976

Closed

6 tasks

ericl added this to the Containerized Workers milestone Jun 23, 2021

chenk008 mentioned this issue Jun 25, 2021

[Core] start worker in container #16671

Merged

6 tasks

chenk008 mentioned this issue Jul 15, 2021

[Core] Use language independent worker initiators #17104

Closed

chenk008 mentioned this issue Sep 16, 2021

[RFC] Add resource limit for worker without container #17596

Closed

ericl added this to Ray Core Public Roadmap Nov 11, 2021

ericl moved this to In discussion in Ray Core Public Roadmap Nov 11, 2021

chenk008 mentioned this issue Dec 1, 2021

[Feature] Support to run Ray worker container in a rootless container #20820

Closed

2 tasks

This was referenced Oct 21, 2022

Ray on Kubeflow - Asking for user and contributor interest / comments kubeflow/kubeflow#6680

Closed

Ray restricted podsecuritystandards for enterprise security and Kubeflow integration #29665

Closed

anyscalesam removed this from the Containerized Workers milestone Jun 15, 2024

[RFC] k8s-native worker pool #14077

[RFC] k8s-native worker pool #14077

Comments

ericl commented Feb 12, 2021 • edited Loading

edoakes commented Feb 12, 2021 • edited Loading

ericl commented Feb 12, 2021 • edited Loading

ericl commented Feb 12, 2021 • edited Loading

clarkzinzow commented Feb 12, 2021

clarkzinzow commented Feb 12, 2021

ericl commented Feb 12, 2021

simon-mo commented Feb 12, 2021

ericl commented Feb 13, 2021

DmitriGekhtman commented Feb 13, 2021

DmitriGekhtman commented Feb 14, 2021 • edited Loading

kfstorm commented Mar 2, 2021

Qstar commented Mar 2, 2021

chenk008 commented Mar 2, 2021 • edited Loading

ericl commented Mar 2, 2021 • edited Loading

chenk008 commented Mar 3, 2021 • edited Loading

ericl commented Mar 3, 2021 via email

chenk008 commented Mar 3, 2021 • edited Loading

thomasdesr commented Mar 3, 2021

ericl commented Mar 3, 2021 via email

edoakes commented Mar 3, 2021

ericl commented Mar 4, 2021

thomasdesr commented Mar 4, 2021

ericl commented Mar 4, 2021 • edited Loading

chenk008 commented Mar 4, 2021 • edited Loading

chenk008 commented Mar 4, 2021

chenk008 commented Apr 4, 2021 • edited Loading

chenk008 commented Apr 27, 2021

edoakes commented Apr 28, 2021 via email

ericl commented May 11, 2021

chenk008 commented May 13, 2021

ericl commented May 13, 2021

bklaubos commented May 20, 2021

bklaubos commented May 26, 2021

chenk008 commented Jul 6, 2021 • edited Loading

simon-mo commented Jul 13, 2021

chenk008 commented Jul 13, 2021

jovany-wang commented Dec 21, 2021

juliusvonkohout commented Oct 19, 2022 • edited Loading

ericl commented Feb 12, 2021 •

edited

Loading

edoakes commented Feb 12, 2021 •

edited

Loading

ericl commented Feb 12, 2021 •

edited

Loading

ericl commented Feb 12, 2021 •

edited

Loading

DmitriGekhtman commented Feb 14, 2021 •

edited

Loading

chenk008 commented Mar 2, 2021 •

edited

Loading

ericl commented Mar 2, 2021 •

edited

Loading

chenk008 commented Mar 3, 2021 •

edited

Loading

chenk008 commented Mar 3, 2021 •

edited

Loading

ericl commented Mar 4, 2021 •

edited

Loading

chenk008 commented Mar 4, 2021 •

edited

Loading

chenk008 commented Apr 4, 2021 •

edited

Loading

chenk008 commented Jul 6, 2021 •

edited

Loading

juliusvonkohout commented Oct 19, 2022 •

edited

Loading