Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] k8s-native worker pool #14077

Open
ericl opened this issue Feb 12, 2021 · 48 comments
Open

[RFC] k8s-native worker pool #14077

ericl opened this issue Feb 12, 2021 · 48 comments
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical size-large

Comments

@ericl
Copy link
Contributor

ericl commented Feb 12, 2021

We should consider supporting a k8s-native worker pool for when Ray is being run with the k8s operator. This means launching worker processes in individual pods separate from the raylet pod, rather than having workers living together in the big raylet pod.

The advantages of having workers in separate pods:

  • [major] Can have separate containers per worker (one mechanism for implementing runtime_env RFC: [RFC] runtime_env for actors and tasks #14019)
  • [major] Can have resource limit enforcement for individual workers
  • [minor] Workers can be targeted individually by k8s label selectors for metrics scraping, load balancing, etc.
  • [minor] Can get individual logs for workers

Disadvantages:

  • K8s doesn't support nested containers, so the implementation gets a bit complicated
  • Raylet pods will need k8s API access to create worker pods
  • Worker start slightly slower (add container overhead to process overhead)

Proposal:

  • The raylet pod requests the resources (e.g., {"cpu": {"request": 32}} for a 32-worker raylet pod).
  • The raylet creates worker pods with zero requests but with specified limits (e.g., {"cpu": {"request": 0}, {"limit": 1}}). The pods are given constraining labels to force them to be co-located with the raylet.
  • Somehow the shared memory socket is shared with these workers (?)
  • These worker pods join the raylet as worker processes.
@ericl ericl added the RFC RFC issues label Feb 12, 2021
@ericl ericl changed the title [RFC] k8s worker pool [RFC] k8s-native worker pool Feb 12, 2021
@edoakes
Copy link
Contributor

edoakes commented Feb 12, 2021

@ericl one thought about the privilege concern: is it possible to use k8s RBAC so that the Raylet can only create and delete these specific worker pods?

@ericl
Copy link
Contributor Author

ericl commented Feb 12, 2021

Probably? cc @thomasdesr @DmitriGekhtman if you have thoughts on how the access control could be implemented here

@ericl
Copy link
Contributor Author

ericl commented Feb 12, 2021

Another thing is I'm not sure how to support shared memory here (how do we give workers access to that shared memory fd?).

@clarkzinzow
Copy link
Contributor

Another thing is I'm not sure how to support shared memory here (how do we give workers access to that shared memory fd?).

I've gotten something similar to this working between a driver pod and a DaemonSet raylet in the past using bidirectional mount propagation on a shared hostpath volume for the raylet and plasma socket paths, where the mount is propagated to the host and all other containers/pods that share the volume. The drivers then use HostToContainer mount propagation.

An issue with this approach is if there's more than one raylet per node, then you start running into hostpath conflicts. But that could be easily solved by adding the node ID to the hostpath.

@clarkzinzow
Copy link
Contributor

I assume that slow worker pool scale up would be a significant disadvantage? Pod startup times can be a few seconds or longer, and could be particularly bad for dynamic worker provisioning, such as prestarting workers on lease request, dedicated workers, IO workers, etc.

@ericl
Copy link
Contributor Author

ericl commented Feb 12, 2021

Workers already take a couple seconds to start, so I don't see this as a significant disadvantage--- things might get a bit slower with the extra overhead, but Ray is already designed around the fact this is a high overhead operation.

@simon-mo
Copy link
Contributor

On: "raylet creates worker pods ", can raylet just tell the k8s operator (somehow) to create the worker pods?

@ericl
Copy link
Contributor Author

ericl commented Feb 13, 2021

Yeah that's a good idea, an initial version could request pod creation using the operator, which would avoid needing to talk to k8s apiserver directly.

@DmitriGekhtman
Copy link
Contributor

On: "raylet creates worker pods ", can raylet just tell the k8s operator (somehow) to create the worker pods?

+1 For separation of control plane and data plane.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Feb 14, 2021

the solution @clarkzinzow described (Raylet DaemonSet, worker pods) seems pretty reasonable -- it matches Ray architecture with K8s architecture pretty well.
why not do that?

@kfstorm
Copy link
Member

kfstorm commented Mar 2, 2021

Nice RFC! We've been struggling with Java worker memory restriction for a long time. A couple of things we may need to think about:

  • Worker lifetime management (is it still possible with this RFC?)
    • Currently Raylet is able to kill workers by sending signals.
    • A worker will quit automatically if the ppid becomes 1 (meaning that the parent process has died).
  • Since workers are not in the Raylet pod anymore, will drivers be able to run in standalone pods as well?

@Qstar
Copy link
Member

Qstar commented Mar 2, 2021

Great topic!!! This feature will have better control of separation and more flexible scheduling.

@chenk008
Copy link
Contributor

chenk008 commented Mar 2, 2021

Big +1!

In AntGroup, we consider two schema for k8s-native worker

  • Pod per worker
  1. deploy raylet as daemon, raylet requests to k8s apiserver for create worker pod
  2. combine gcs scheduler with k8s scheduler, using pod affinity. It's can reuse k8s resource restriction
  3. using k8s persistentvolume , share session dir with multiple worker pod in the same k8s node
  4. autoscaler will create/delete k8s node
  • In one pod , container per worker
  1. deploy raylet as pod, mount the containerd socket int raylet pod
  2. raylet ask containerd to create new worker container in current pod. Every worker run in separate container, and has individual cgroup limit. The pod has parent cgroup for all worker container
  3. using container mount, share session dir with worker container
  4. it's easy to run on bare metal
  5. autoscaler will create/delete pod

We are glad to join this RFC

@ericl
Copy link
Contributor Author

ericl commented Mar 2, 2021

Currently Raylet is able to kill workers by sending signals.

I think so, the pod will exit in this case once the process it contains is shutdown.

Since workers are not in the Raylet pod anymore, will drivers be able to run in standalone pods as well?

It should become possible, though it would still need to be co-located with a Raylet.

In AntGroup, we consider two schema for k8s-native worker
Pod per worker
In one pod , container per worker

cc @edoakes this is also very similar to our internal discussions. Our conclusion was the container per worker situation is a bit problematic due to:

  • need to privileged containers to allow container in container (this is the main sticking point; if we didn't need privileged containers this would be simpler)
  • lack of ability to address the containers within the pod using k8s labels and tooling

However, the pod per worker approach is somewhat complicated. The initial approach suggested in the RFC is untenable since it assigns cpu shares=0 to worker pods, meaning they will get zero CPU under contention, and there doesn't seem to be a way to work around this without modifying the k8s scheduler. The daemonset approach or similar may be the way to go.

Btw @kfstorm @chenk008 our conclusion was that k8s native workers are desirable, but we won't have the bandwidth to implement this ourselves until Q3. If you have resources to start work on k8s native workers in Q2 though, we could work together on that.

@chenk008
Copy link
Contributor

chenk008 commented Mar 3, 2021

@ericl

In one pod , container per worker

Now we create worker container in the same pod with raylet, not container in container. In other words, raylet container and worker container run in the same hierarchy

@ericl
Copy link
Contributor Author

ericl commented Mar 3, 2021 via email

@chenk008
Copy link
Contributor

chenk008 commented Mar 3, 2021

I was under the impression pods are immutable, do you mean creating containers out of band that aren't managed by k8s?

On Tue, Mar 2, 2021, 8:04 PM chenk008 @.***> wrote: @ericl https://github.com/ericl In one pod , container per worker Now we create worker container in the same pod with raylet, not container in container. In other words, raylet container and worker container run in the same hierarchy — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14077 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSV3W7I6T362V6MP4KDTBWYOLANCNFSM4XRL2N5A .

Yes, make Raylet to manage worker container, and all worker container share the network namespace. It looks a little hacky.

The advantages:

  • raylet don't need to interactive with k8s apiserver, but interactive with containerd/dockerd
  • in gcs scheduler, no modification is required
  • worker can use on-premises resource which is requested by pod. In kuberntes cluster which is shared by multiple workloads, it helps to reduce scheduling failure caused by resource competition
  • when run on bare metal, also has ability to separate work

@thomasdesr
Copy link
Contributor

From what I understand, the dependency for privileged containers mostly comes from cgroups. If we didn't want to rely on cgroups to enforce Ray's scheduling constraints, some shape of container-in-container should be doable. That said, container-in-container is definitely an uncommon path atm.

@ericl
Copy link
Contributor Author

ericl commented Mar 3, 2021 via email

@edoakes
Copy link
Contributor

edoakes commented Mar 3, 2021

One advantage of a solution like the one @chenk008 suggested is this can easily be replicated when not using Kubernetes, so we can keep the same feature set and code paths for all deployment strategies.

@ericl
Copy link
Contributor Author

ericl commented Mar 4, 2021

Yeah agreed, I actually like this solution of "out of band docker containers". It seems the main unknowns here are whether (1) k8s users will find this generally acceptable, (2) whether k8s providers like EKS or GKE support it.

@thomasdesr
Copy link
Contributor

Out-of-band docker containers is ~= privileged containers. If we can figure out how to nest containers sensibly it offers a big advantage in terms of cleanliness (cleanup is easy & automatic because lineage is obvious).

@ericl
Copy link
Contributor Author

ericl commented Mar 4, 2021

It seems both GKE and EKS now support privileged containers: https://stackoverflow.com/questions/31124368/allow-privileged-containers-in-kubernetes-on-google-container-gke https://docs.aws.amazon.com/eks/latest/userguide/pod-security-policy.html

@chenk008 have you considered nested containers instead? It sounds slightly better (though, I'm not too worried about cleanup since Ray workers generally kill themselves when the raylet dies). I guess the docker container files might not be cleaned up in either case, unless we eschew docker for container in container.

@chenk008
Copy link
Contributor

chenk008 commented Mar 4, 2021

@ericl
Yeah.In fact,I prefer containerd/dockerd nested container. Because there are some security problems if raylet directly interactives with containerd/dockerd on host .
I have tested docker in docker last year, but encounters some problems. We could test it again and try to fix problems

@chenk008
Copy link
Contributor

chenk008 commented Mar 4, 2021

podman is a more lightweight container manager tool.
https://stackoverflow.com/questions/56032747/how-to-run-podman-from-inside-a-container

Maybe it's a better choice. I will test it and compare with dockerd/containerd

@chenk008
Copy link
Contributor

chenk008 commented Apr 4, 2021

After study some potential solutions, I think I found a ideal way to support this feature.

  • To start/stop every ray worker in a single container, use podman/docker ; worker container's entrypoint is java_worker_command or python_worker_command.
  • To support directory sharing, we can use volume (aka linux bind mount) ; mount tmp_dir into every ray worker container, ray worker can connect to raylet/plasma with unix domain socket.
  • To support resource isolation, use container cpu/memory cgroup
  • To support pid management in Raylet, worker container use Raylet pid namespace
  • To support network communication, worker container use Raylet network namespace

I'm going to write a PoC to prove the feasibility of these solutions. Please let me know if you have some specific topic you want to discuss more. And I'll write a more detailed design document about this PoC.

@chenk008
Copy link
Contributor

@ericl @edoakes
I describe the design in this doc:
https://docs.google.com/document/d/1vSdO7NSobdYy7ewe5nenteC0nM9bAffE/edit
Please share your thoughts. We will appreciate any feedback!

@edoakes
Copy link
Contributor

edoakes commented Apr 28, 2021 via email

@ericl
Copy link
Contributor Author

ericl commented May 11, 2021

I see, so it seems HostIPC is an equivalent security issue. The other concerns we could probably work around, but this one seems problematic.

@chenk008
Copy link
Contributor

Hey @ericl @edoakes , in my opinion, we want to build Ray as serverless platform, it makes sense to run Raylet with SYS_ADMIN. I don't think it is very hard to request for SYS_ADMIN capability in most K8s cluster, and we can use seccomp to add syscall restrict. In AntGroup K8s cluster, Pod will be assigned SYS_ADMIN capability by default.

But the worker containers are running without SYS_ADMIN capability . In other word, jobs are running without SYS_ADMIN capability. Even we could drop other capabilites in worker container, start worker in a security sandbox. For example we can using seccomp to forbid kill syscall in worker container, it's a useful feature in multi-tenancy cluster.

And we expected that Raylet is running as non-root user, such as admin. So we will start worker in rootless container, it adds a new security layer.

@ericl
Copy link
Contributor Author

ericl commented May 13, 2021

I see, it does seem better that untrusted user code is run in an unprivileged context. @edoakes @yiranwang52 what do you think? Perhaps we can poll the community to see how acceptable this is?

Note that I think we don't have to be satisfying 100% of users, since this feature is optional after all. As long as a majority are happy I think that is enough to move forward.

@bklaubos
Copy link

Where is the final decision and design doc for this feature?

@bklaubos
Copy link

I assume that slow worker pool scale up would be a significant disadvantage? Pod startup times can be a few seconds or longer, and could be particularly bad for dynamic worker provisioning, such as prestarting workers on lease request, dedicated workers, IO workers, etc.

We used Argo workflow extensively, by scaling one worker per pod is going to take performance hit.
Do not go this route!!! Be warned.

@ericl ericl added this to the Containerized Workers milestone Jun 23, 2021
@chenk008
Copy link
Contributor

chenk008 commented Jul 6, 2021

I see, so it seems HostIPC is an equivalent security issue. The other concerns we could probably work around, but this one seems problematic.

Hi, @ericl I have another look at the code about the way how Ray use share memory last days. Currently raylet pass the shared file descriptor to worker processes, it does not need HostIPC to use share memory in cross-pod containers when disable SELinux. And I think most of kubernetes clusters make SELinux disable.

I think the originally proposed in this RFC is good option for a lot scenario. How about moving on to implement it?

@simon-mo
Copy link
Contributor

@chenk008 circling back, if we assume SYS_ADMIN cannot be turn on and we run container-in-container, is it possible to use something like rootless container? https://github.com/opencontainers/runc#rootless-containers

@chenk008
Copy link
Contributor

@simon-mo I think SYS_ADMIN is a necessity.

  1. When we start a container without privileged, the /sys/fs/cgroup will be mounted as read-only, so we cann't create a new cgroup. But with SYS_ADMIN, we can remount /sys/fs/cgroup as writable. There is a similar issue:Finding the minimal set of privileges for a docker container to spawn rootless containers opencontainers/runc#1456. The below is my test script:
[root@72d5a36f6df4 rootless]# ./runc --root /tmp/runc run mycontainerid
WARN[0000] unable to get oom kill count                  error="no directory specified for memory.oom_control"
ERRO[0000] container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: mkdir /sys/fs/cgroup/cpuset/mycontainerid: read-only file system
  1. And when start a container, we need to change process rootfs, SYS_ADMIN is required. https://man7.org/linux/man-pages/man2/pivot_root.2.html

@ericl ericl moved this to In discussion in Ray Core Public Roadmap Nov 11, 2021
@ericl ericl added enhancement Request for new feature and/or capability P2 Important issue, but not time-critical size-large P1 Issue that should be fixed within a few weeks and removed RFC RFC issues P2 Important issue, but not time-critical P1 Issue that should be fixed within a few weeks labels Nov 17, 2021
@jovany-wang
Copy link
Contributor

Hi @chenk008 , could you list the current progress of this to let other know it ?

@juliusvonkohout
Copy link

juliusvonkohout commented Oct 19, 2022

@chenk008 @simon-mo @ericl @edoakes

i have found a solution. With a recent podman and a recent Kernel>5.13 e.g. centos9 i can run a rootless podman in rootless podman https://www.redhat.com/sysadmin/podman-rootless-overlay
podman run --security-opt label=disable should be enough to run the worker without SYS_ADMIN and any other ugly root stuff in the future.

container_driver = "podman"
container_command = [
container_driver,
"run",
"-v",
self._ray_tmp_dir + ":" + self._ray_tmp_dir,
"--cgroup-manager=cgroupfs",
"--network=host",
"--pid=host",
"--ipc=host",
"--env-host",
]
is still a problem

host: use the host’s shared memory, semaphores, and message queues inside the container. Note: the host mode gives the container full access to local shared memory and is therefore considered insecure.

https://docs.podman.io/en/latest/markdown/podman-run.1.html#ipc-ipc
What is the point of using podman then if you destroy all security features? Anyway we would still need to fix Error: containers not creating Cgroups must create a private PID namespace: invalid argument .ipc=host and pid=host is the problem. Or we might just just use a simple and easy udocker https://pypi.org/project/udocker/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical size-large
Projects
Status: In discussion
Development

No branches or pull requests