[gcp] "No such container" error after ray up #29671

Tehada · 2022-10-25T18:36:55Z

What happened + What you expected to happen

After executing ray up and waiting until the command finishes, I can't use cluster properly, because container is in "Exited" status (I confirmed it by doing raw ssh into vm) -- if I will run ray exec or ray attach I will see an error "No such container" with the name of container fromm ray's config. It happens on google cloud with vm's image "projects/deeplearning-platform-release/global/images/family/common-cpu" (this image was in default gcp config in ray's repo). I managed to trace the cause of a problem -- this particular vm has c2d-startup script which runs several other scripts, and one of them restarts docker engine after ray started its container, which ofc will stop ray's container, but ray will not restart it (the hacky solution I used is just to call again ray up immediately after the first call -- this healed the cluster). In journalctl -xe this looks something like this:

Oct 25 17:09:01 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 25 17:09:01 ray-cluster-minimal-head-4b5952a2-compute CRON[5661]: (root) CMD (/opt/deeplearning/bin/run_diagnostic_tool.sh 2>&1)
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: (CRON) info (No MTA installed, discarding output)
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: pam_unix(cron:session): session closed for user root
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up docker-ce-rootless-extras (5:20.10.20~3-0~debian-buster) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up libc-dev-bin (2.28-10+deb10u2) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up libdns-export1104 (1:9.11.5.P4+dfsg-5.1+deb10u8) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up isc-dhcp-client (4.4.1-2+deb10u2) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up docker-ce (5:20.10.20~3-0~debian-buster) ...
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Reloading.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Reloading.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Stopping Docker Application Container Engine...
-- Subject: A stop job for unit docker.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A stop job for unit docker.service has begun execution.
-- 
-- The job identifier is 1667.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:03.503321085Z" level=info msg="Processing signal 'terminated'"
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607816080Z" level=info msg="shim disconnected" id=4f0cd322189e45d38de56400315d112b2f42209ea
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607911268Z" level=warning msg="cleaning up after shim disconnected" id=4f0cd322189e45d38de5
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607930270Z" level=info msg="cleaning up dead shim"
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:03.607866268Z" level=info msg="ignoring event" container=4f0cd322189e45d38de56400315d112b2f42209e
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.620542271Z" level=warning msg="cleanup warnings time=\"2022-10-25T17:09:03Z\" level=info ms
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: var-lib-docker-overlay2-9ce1ec35220b7ab41ab99dd6da7a4c996b6f1404ad3b9281ab19b6fb9a7355e5-merged.mount: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit var-lib-docker-overlay2-9ce1ec35220b7ab41ab99dd6da7a4c996b6f1404ad3b9281ab19b6fb9a7355e5-merged.mount has successfully entered the 'dead' state.
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:05.675187400Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" m
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:05.675463479Z" level=info msg="Daemon shutdown complete"
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: docker.service: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- The unit docker.service has successfully entered the 'dead' state.
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Stopped Docker Application Container Engine.
-- Subject: A stop job for unit docker.service has finished

By using vm's image "projects/cos-cloud/global/images/cos-101-17162-40-16" the problem seems to disappear.

So I'm thinking whether this problem could somehow be addressed in docs, as the full example contains this image. Not sure, whether it is simple to implement some kind of synchronization during initialization of ray up to avoid this problem consistently.

Versions / Dependencies

2.0.0

Reproduction script

Issue Severity

Low: It annoys or frustrates me.

The text was updated successfully, but these errors were encountered:

cadedaniel · 2022-10-28T21:52:21Z

Thanks for reporting this issue! cc @DmitriGekhtman what do you think about this issue? Seems like we should either update the docs to not use this image, or update Ray to attempt to heal the container automatically, or both.

cc @jjyao

DmitriGekhtman · 2022-10-28T22:53:03Z

update the docs to not use this image

Probably that.

jjyao · 2022-10-28T23:47:46Z

Hmm, even if we don't mention in our doc. People may still use it and run into this error. Should we fix the code to handle this case or we think that the c2d-startup script behavior is not expected and supported?

Tehada added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 25, 2022

hora-anyscale added core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022

hora-anyscale assigned cadedaniel Oct 28, 2022

jjyao added the infra autoscaler, ray client, kuberay, related issues label Oct 28, 2022

jjyao unassigned cadedaniel Oct 28, 2022

AmeerHajAli added the gcp label Nov 20, 2022

richardliaw changed the title ~~[Core] "No such container" error after ray up~~ [gcp] "No such container" error after ray up Nov 21, 2022

hora-anyscale added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Dec 14, 2022

jjyao added P3 Issue moderate in impact or severity and removed P2 Important issue, but not time-critical labels Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gcp] "No such container" error after ray up #29671

[gcp] "No such container" error after ray up #29671

Tehada commented Oct 25, 2022

cadedaniel commented Oct 28, 2022

DmitriGekhtman commented Oct 28, 2022

jjyao commented Oct 28, 2022

[gcp] "No such container" error after ray up #29671

[gcp] "No such container" error after ray up #29671

Comments

Tehada commented Oct 25, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

cadedaniel commented Oct 28, 2022

DmitriGekhtman commented Oct 28, 2022

jjyao commented Oct 28, 2022