-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] k8s-native worker pool #14077
Comments
@ericl one thought about the privilege concern: is it possible to use k8s RBAC so that the Raylet can only create and delete these specific worker pods? |
Probably? cc @thomasdesr @DmitriGekhtman if you have thoughts on how the access control could be implemented here |
Another thing is I'm not sure how to support shared memory here (how do we give workers access to that shared memory fd?). |
I've gotten something similar to this working between a driver pod and a DaemonSet raylet in the past using bidirectional mount propagation on a shared hostpath volume for the raylet and plasma socket paths, where the mount is propagated to the host and all other containers/pods that share the volume. The drivers then use An issue with this approach is if there's more than one raylet per node, then you start running into hostpath conflicts. But that could be easily solved by adding the node ID to the hostpath. |
I assume that slow worker pool scale up would be a significant disadvantage? Pod startup times can be a few seconds or longer, and could be particularly bad for dynamic worker provisioning, such as prestarting workers on lease request, dedicated workers, IO workers, etc. |
Workers already take a couple seconds to start, so I don't see this as a significant disadvantage--- things might get a bit slower with the extra overhead, but Ray is already designed around the fact this is a high overhead operation. |
On: "raylet creates worker pods ", can raylet just tell the k8s operator (somehow) to create the worker pods? |
Yeah that's a good idea, an initial version could request pod creation using the operator, which would avoid needing to talk to k8s apiserver directly. |
+1 For separation of control plane and data plane. |
the solution @clarkzinzow described (Raylet DaemonSet, worker pods) seems pretty reasonable -- it matches Ray architecture with K8s architecture pretty well. |
Nice RFC! We've been struggling with Java worker memory restriction for a long time. A couple of things we may need to think about:
|
Great topic!!! This feature will have better control of separation and more flexible scheduling. |
Big +1! In AntGroup, we consider two schema for k8s-native worker
We are glad to join this RFC |
I think so, the pod will exit in this case once the process it contains is shutdown.
It should become possible, though it would still need to be co-located with a Raylet.
cc @edoakes this is also very similar to our internal discussions. Our conclusion was the container per worker situation is a bit problematic due to:
However, the pod per worker approach is somewhat complicated. The initial approach suggested in the RFC is untenable since it assigns cpu shares=0 to worker pods, meaning they will get zero CPU under contention, and there doesn't seem to be a way to work around this without modifying the k8s scheduler. The daemonset approach or similar may be the way to go. Btw @kfstorm @chenk008 our conclusion was that k8s native workers are desirable, but we won't have the bandwidth to implement this ourselves until Q3. If you have resources to start work on k8s native workers in Q2 though, we could work together on that. |
Now we create worker container in the same pod with raylet, not container in container. In other words, raylet container and worker container run in the same hierarchy |
I was under the impression pods are immutable, do you mean creating
containers out of band that aren't managed by k8s?
…On Tue, Mar 2, 2021, 8:04 PM chenk008 ***@***.***> wrote:
@ericl <https://github.com/ericl>
In one pod , container per worker
Now we create worker container in the same pod with raylet, not container
in container. In other words, raylet container and worker container run in
the same hierarchy
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14077 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSV3W7I6T362V6MP4KDTBWYOLANCNFSM4XRL2N5A>
.
|
Yes, make Raylet to manage worker container, and all worker container share the network namespace. It looks a little hacky. The advantages:
|
From what I understand, the dependency for privileged containers mostly comes from cgroups. If we didn't want to rely on cgroups to enforce Ray's scheduling constraints, some shape of container-in-container should be doable. That said, container-in-container is definitely an uncommon path atm. |
As I understand it this is more like allowing docker access to create
containers on the side, not container in container correct?
So the security issues are only around unrestricted access to docker?
…On Wed, Mar 3, 2021, 9:21 AM Thomas Desrosiers ***@***.***> wrote:
From what I understand, the dependency for privileged containers *mostly*
comes from cgroups. If we didn't want to rely on cgroups to enforce Ray's
scheduling constraints, some shape of container-in-container should be
doable. That said, container-in-container is definitely an uncommon path
atm.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14077 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSRBHVK64VYAEPI5FELTBZVZRANCNFSM4XRL2N5A>
.
|
One advantage of a solution like the one @chenk008 suggested is this can easily be replicated when not using Kubernetes, so we can keep the same feature set and code paths for all deployment strategies. |
Yeah agreed, I actually like this solution of "out of band docker containers". It seems the main unknowns here are whether (1) k8s users will find this generally acceptable, (2) whether k8s providers like EKS or GKE support it. |
Out-of-band docker containers is ~= privileged containers. If we can figure out how to nest containers sensibly it offers a big advantage in terms of cleanliness (cleanup is easy & automatic because lineage is obvious). |
It seems both GKE and EKS now support privileged containers: https://stackoverflow.com/questions/31124368/allow-privileged-containers-in-kubernetes-on-google-container-gke https://docs.aws.amazon.com/eks/latest/userguide/pod-security-policy.html @chenk008 have you considered nested containers instead? It sounds slightly better (though, I'm not too worried about cleanup since Ray workers generally kill themselves when the raylet dies). I guess the docker container files might not be cleaned up in either case, unless we eschew docker for container in container. |
@ericl |
Maybe it's a better choice. I will test it and compare with dockerd/containerd |
After study some potential solutions, I think I found a ideal way to support this feature.
I'm going to write a PoC to prove the feasibility of these solutions. Please let me know if you have some specific topic you want to discuss more. And I'll write a more detailed design document about this PoC. |
@ericl @edoakes |
Thanks! We are going to do a detailed review of this on Thursday and will
get back to you with feedback/questions.
…On Tue, Apr 27, 2021 at 2:23 AM chenk008 ***@***.***> wrote:
@ericl <https://github.com/ericl> @edoakes <https://github.com/edoakes>
I describe the design in this doc:
https://docs.google.com/document/d/1vSdO7NSobdYy7ewe5nenteC0nM9bAffE/edit
Please share your thoughts. We will appreciate any feedback!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14077 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLKAZKGNWO5ODZ4FEVXAWDTKZRAZANCNFSM4XRL2N5A>
.
|
I see, so it seems HostIPC is an equivalent security issue. The other concerns we could probably work around, but this one seems problematic. |
Hey @ericl @edoakes , in my opinion, we want to build Ray as serverless platform, it makes sense to run Raylet with But the worker containers are running without And we expected that Raylet is running as non-root user, such as admin. So we will start worker in rootless container, it adds a new security layer. |
I see, it does seem better that untrusted user code is run in an unprivileged context. @edoakes @yiranwang52 what do you think? Perhaps we can poll the community to see how acceptable this is? Note that I think we don't have to be satisfying 100% of users, since this feature is optional after all. As long as a majority are happy I think that is enough to move forward. |
Where is the final decision and design doc for this feature? |
We used Argo workflow extensively, by scaling one worker per pod is going to take performance hit. |
Hi, @ericl I have another look at the code about the way how Ray use share memory last days. Currently raylet pass the shared file descriptor to worker processes, it does not need HostIPC to use share memory in cross-pod containers when disable SELinux. And I think most of kubernetes clusters make SELinux disable. I think the originally proposed in this RFC is good option for a lot scenario. How about moving on to implement it? |
@chenk008 circling back, if we assume |
@simon-mo I think
|
Hi @chenk008 , could you list the current progress of this to let other know it ? |
@chenk008 @simon-mo @ericl @edoakes i have found a solution. With a recent podman and a recent Kernel>5.13 e.g. centos9 i can run a rootless podman in rootless podman https://www.redhat.com/sysadmin/podman-rootless-overlay ray/python/ray/_private/runtime_env/container.py Lines 26 to 37 in 3e357c8
https://docs.podman.io/en/latest/markdown/podman-run.1.html#ipc-ipc |
We should consider supporting a k8s-native worker pool for when Ray is being run with the k8s operator. This means launching worker processes in individual pods separate from the raylet pod, rather than having workers living together in the big raylet pod.
The advantages of having workers in separate pods:
Disadvantages:
Proposal:
The text was updated successfully, but these errors were encountered: