Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gcp] "No such container" error after ray up #29671

Open
Tehada opened this issue Oct 25, 2022 · 3 comments
Open

[gcp] "No such container" error after ray up #29671

Tehada opened this issue Oct 25, 2022 · 3 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core gcp infra autoscaler, ray client, kuberay, related issues P3 Issue moderate in impact or severity

Comments

@Tehada
Copy link

Tehada commented Oct 25, 2022

What happened + What you expected to happen

After executing ray up and waiting until the command finishes, I can't use cluster properly, because container is in "Exited" status (I confirmed it by doing raw ssh into vm) -- if I will run ray exec or ray attach I will see an error "No such container" with the name of container fromm ray's config. It happens on google cloud with vm's image "projects/deeplearning-platform-release/global/images/family/common-cpu" (this image was in default gcp config in ray's repo). I managed to trace the cause of a problem -- this particular vm has c2d-startup script which runs several other scripts, and one of them restarts docker engine after ray started its container, which ofc will stop ray's container, but ray will not restart it (the hacky solution I used is just to call again ray up immediately after the first call -- this healed the cluster). In journalctl -xe this looks something like this:

Oct 25 17:09:01 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 25 17:09:01 ray-cluster-minimal-head-4b5952a2-compute CRON[5661]: (root) CMD (/opt/deeplearning/bin/run_diagnostic_tool.sh 2>&1)
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: (CRON) info (No MTA installed, discarding output)
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: pam_unix(cron:session): session closed for user root
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up docker-ce-rootless-extras (5:20.10.20~3-0~debian-buster) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up libc-dev-bin (2.28-10+deb10u2) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up libdns-export1104 (1:9.11.5.P4+dfsg-5.1+deb10u8) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up isc-dhcp-client (4.4.1-2+deb10u2) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up docker-ce (5:20.10.20~3-0~debian-buster) ...
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Reloading.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Reloading.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Stopping Docker Application Container Engine...
-- Subject: A stop job for unit docker.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A stop job for unit docker.service has begun execution.
-- 
-- The job identifier is 1667.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:03.503321085Z" level=info msg="Processing signal 'terminated'"
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607816080Z" level=info msg="shim disconnected" id=4f0cd322189e45d38de56400315d112b2f42209ea
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607911268Z" level=warning msg="cleaning up after shim disconnected" id=4f0cd322189e45d38de5
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607930270Z" level=info msg="cleaning up dead shim"
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:03.607866268Z" level=info msg="ignoring event" container=4f0cd322189e45d38de56400315d112b2f42209e
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.620542271Z" level=warning msg="cleanup warnings time=\"2022-10-25T17:09:03Z\" level=info ms
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: var-lib-docker-overlay2-9ce1ec35220b7ab41ab99dd6da7a4c996b6f1404ad3b9281ab19b6fb9a7355e5-merged.mount: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit var-lib-docker-overlay2-9ce1ec35220b7ab41ab99dd6da7a4c996b6f1404ad3b9281ab19b6fb9a7355e5-merged.mount has successfully entered the 'dead' state.
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:05.675187400Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" m
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:05.675463479Z" level=info msg="Daemon shutdown complete"
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: docker.service: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- The unit docker.service has successfully entered the 'dead' state.
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Stopped Docker Application Container Engine.
-- Subject: A stop job for unit docker.service has finished

By using vm's image "projects/cos-cloud/global/images/cos-101-17162-40-16" the problem seems to disappear.

So I'm thinking whether this problem could somehow be addressed in docs, as the full example contains this image. Not sure, whether it is simple to implement some kind of synchronization during initialization of ray up to avoid this problem consistently.

Versions / Dependencies

2.0.0

Reproduction script

Issue Severity

Low: It annoys or frustrates me.

@Tehada Tehada added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 25, 2022
@hora-anyscale hora-anyscale added core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022
@cadedaniel
Copy link
Member

Thanks for reporting this issue! cc @DmitriGekhtman what do you think about this issue? Seems like we should either update the docs to not use this image, or update Ray to attempt to heal the container automatically, or both.

cc @jjyao

@DmitriGekhtman
Copy link
Contributor

update the docs to not use this image

Probably that.

@jjyao jjyao added the infra autoscaler, ray client, kuberay, related issues label Oct 28, 2022
@jjyao
Copy link
Collaborator

jjyao commented Oct 28, 2022

Hmm, even if we don't mention in our doc. People may still use it and run into this error. Should we fix the code to handle this case or we think that the c2d-startup script behavior is not expected and supported?

@richardliaw richardliaw changed the title [Core] "No such container" error after ray up [gcp] "No such container" error after ray up Nov 21, 2022
@hora-anyscale hora-anyscale added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Dec 14, 2022
@jjyao jjyao added P3 Issue moderate in impact or severity and removed P2 Important issue, but not time-critical labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core gcp infra autoscaler, ray client, kuberay, related issues P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests

6 participants