You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tehada opened this issue
Oct 25, 2022
· 3 comments
Labels
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Coregcpinfraautoscaler, ray client, kuberay, related issuesP3Issue moderate in impact or severity
After executing ray up and waiting until the command finishes, I can't use cluster properly, because container is in "Exited" status (I confirmed it by doing raw ssh into vm) -- if I will run ray exec or ray attach I will see an error "No such container" with the name of container fromm ray's config. It happens on google cloud with vm's image "projects/deeplearning-platform-release/global/images/family/common-cpu" (this image was in default gcp config in ray's repo). I managed to trace the cause of a problem -- this particular vm has c2d-startup script which runs several other scripts, and one of them restarts docker engine after ray started its container, which ofc will stop ray's container, but ray will not restart it (the hacky solution I used is just to call again ray up immediately after the first call -- this healed the cluster). In journalctl -xe this looks something like this:
Oct 25 17:09:01 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 25 17:09:01 ray-cluster-minimal-head-4b5952a2-compute CRON[5661]: (root) CMD (/opt/deeplearning/bin/run_diagnostic_tool.sh 2>&1)
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: (CRON) info (No MTA installed, discarding output)
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute CRON[5660]: pam_unix(cron:session): session closed for user root
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up docker-ce-rootless-extras (5:20.10.20~3-0~debian-buster) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up libc-dev-bin (2.28-10+deb10u2) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up libdns-export1104 (1:9.11.5.P4+dfsg-5.1+deb10u8) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up isc-dhcp-client (4.4.1-2+deb10u2) ...
Oct 25 17:09:02 ray-cluster-minimal-head-4b5952a2-compute c2d-startup[451]: Setting up docker-ce (5:20.10.20~3-0~debian-buster) ...
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Reloading.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Reloading.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Stopping Docker Application Container Engine...
-- Subject: A stop job for unit docker.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A stop job for unit docker.service has begun execution.
--
-- The job identifier is 1667.
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:03.503321085Z" level=info msg="Processing signal 'terminated'"
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607816080Z" level=info msg="shim disconnected" id=4f0cd322189e45d38de56400315d112b2f42209ea
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607911268Z" level=warning msg="cleaning up after shim disconnected" id=4f0cd322189e45d38de5
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.607930270Z" level=info msg="cleaning up dead shim"
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:03.607866268Z" level=info msg="ignoring event" container=4f0cd322189e45d38de56400315d112b2f42209e
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute containerd[530]: time="2022-10-25T17:09:03.620542271Z" level=warning msg="cleanup warnings time=\"2022-10-25T17:09:03Z\" level=info ms
Oct 25 17:09:03 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: var-lib-docker-overlay2-9ce1ec35220b7ab41ab99dd6da7a4c996b6f1404ad3b9281ab19b6fb9a7355e5-merged.mount: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit var-lib-docker-overlay2-9ce1ec35220b7ab41ab99dd6da7a4c996b6f1404ad3b9281ab19b6fb9a7355e5-merged.mount has successfully entered the 'dead' state.
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:05.675187400Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" m
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute dockerd[773]: time="2022-10-25T17:09:05.675463479Z" level=info msg="Daemon shutdown complete"
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: docker.service: Succeeded.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit docker.service has successfully entered the 'dead' state.
Oct 25 17:09:05 ray-cluster-minimal-head-4b5952a2-compute systemd[1]: Stopped Docker Application Container Engine.
-- Subject: A stop job for unit docker.service has finished
By using vm's image "projects/cos-cloud/global/images/cos-101-17162-40-16" the problem seems to disappear.
So I'm thinking whether this problem could somehow be addressed in docs, as the full example contains this image. Not sure, whether it is simple to implement some kind of synchronization during initialization of ray up to avoid this problem consistently.
Versions / Dependencies
2.0.0
Reproduction script
Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered:
Tehada
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Oct 25, 2022
hora-anyscale
added
core
Issues that should be addressed in Ray Core
P1
Issue that should be fixed within a few weeks
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Oct 28, 2022
Thanks for reporting this issue! cc @DmitriGekhtman what do you think about this issue? Seems like we should either update the docs to not use this image, or update Ray to attempt to heal the container automatically, or both.
Hmm, even if we don't mention in our doc. People may still use it and run into this error. Should we fix the code to handle this case or we think that the c2d-startup script behavior is not expected and supported?
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Coregcpinfraautoscaler, ray client, kuberay, related issuesP3Issue moderate in impact or severity
What happened + What you expected to happen
After executing
ray up
and waiting until the command finishes, I can't use cluster properly, because container is in "Exited" status (I confirmed it by doing raw ssh into vm) -- if I will runray exec
orray attach
I will see an error "No such container" with the name of container fromm ray's config. It happens on google cloud with vm's image "projects/deeplearning-platform-release/global/images/family/common-cpu" (this image was in default gcp config in ray's repo). I managed to trace the cause of a problem -- this particular vm has c2d-startup script which runs several other scripts, and one of them restarts docker engine after ray started its container, which ofc will stop ray's container, but ray will not restart it (the hacky solution I used is just to call againray up
immediately after the first call -- this healed the cluster). Injournalctl -xe
this looks something like this:By using vm's image "projects/cos-cloud/global/images/cos-101-17162-40-16" the problem seems to disappear.
So I'm thinking whether this problem could somehow be addressed in docs, as the full example contains this image. Not sure, whether it is simple to implement some kind of synchronization during initialization of
ray up
to avoid this problem consistently.Versions / Dependencies
2.0.0
Reproduction script
Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered: