-
Notifications
You must be signed in to change notification settings - Fork 624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write /etc/environment before the lingering session is started #362
Conversation
The race is that as soon as cloud-init writes the Even though it only checks every 10s, sometimes this succeeds early enough that it can advance to the "check for sshfs" stage before the boot scripts get to the point of installing rootless The "check for sshfs" script only checks every 3 seconds if For this reason we must terminate any current session after writing |
https://github.com/lima-vm/lima/runs/4004061587?check_suite_focus=true |
This happens on a restart because all requirements are satisfied before the boot script even starts: all additional packages have already been installed, and the check for containerd also only verifies that the software is installed. Therefore I've now added a final requirement that all the boot scripts have finished, by copying |
I think it is unrelated to this PR, but I've experienced one curious situation in my local testing: jan@lima-vmnet:~$ nerdctl ps
WARN[0000] environment variable XDG_RUNTIME_DIR is not set, see https://rootlesscontaine.rs/getting-started/common/login/
FATA[0000] rootless containerd not running? (hint: use `containerd-rootless-setuptool.sh install` to start rootless containerd): environment variable XDG_RUNTIME_DIR is not set, see https://rootlesscontaine.rs/getting-started/common/login/
jan@lima-vmnet:~$ env|grep XDG
XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop
jan@lima-vmnet:~$ loginctl
No sessions. I don't understand how it is possible to not have a session at all even though I'm logged in via I've only seen this once, and could not reproduce, so ignoring for now. But I wanted to record it here in case somebody else encounters this too. |
I've switched this PR back to draft status. Even though it now passes all tests, I'm suspicious of this:
I thought the guest agent was running as |
I guess the SSH connection from the host to the guest is killed when you terminate the session |
(We should switch away from SSH to virtserial for communication with the guestagent) |
Yes, but the guest agent connection can only be established once the ga is running (after
Not as part of this PR. 😸 |
And make sure we are not reusing a session that was started by a requirements check before /etc/environment was updated. Signed-off-by: Jan Dubois <[email protected]>
Signed-off-by: Jan Dubois <[email protected]>
Is this still WIP? |
Yes, it is, but it is also possible that I have fixed the remaining issue with the refactoring of the shutdown logic I did for the socket forwarding PR. Will try to confirm later today. |
a12ae55
to
51ddcdf
Compare
This turned out to be not correct, and I had to add another check that the instance is "ssh-ready" before setting up sshfs mounts etc. I believe this PR is now good (assuming I didn't break CI). |
pkg/cidata/cidata.TEMPLATE.d/boot.sh
Outdated
@@ -64,5 +59,9 @@ if [ -d "${LIMA_CIDATA_MNT}"/provision.user ]; then | |||
done | |||
fi | |||
|
|||
# Signal that provisioning is done. The instance-id in the meta-data file changes on every boot, | |||
# so any copy from a previous boot cycle will have different content. | |||
cp "${LIMA_CIDATA_MNT}"/meta-data /etc/lima-boot-done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use /run/lima
rather than /etc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need persistence it should be /var/lib/lima
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, /run
is perfect.
I want the opposite of persistence, which is why I copy the meta-data
file, which I know has different content on every restart. We have no way to "reset" the ready markers before cloud-init allows ssh access, so the requirements checks could move past them before they have been reset. Having different content for the marker on every boot solves this.
Updated
The boot scripts must terminate an existing user session after /etc/environment has been updated to make sure the user session (which may linger) has the updated values. This reset will break the SSH control path, breaking sshfs mounts, so the hostagent must wait until the instance is "ss-ready" for persistent connections. A similar "boot-done" status check is added as a "final" requirement so that `limactl start` doesn't return until all the boot scripts have finished. Signed-off-by: Jan Dubois <[email protected]>
51ddcdf
to
ab9ff6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, merging, but please consider updating docs/internal.md
to explain the /run
files
Should address #351 (comment)
Fixes #365
Also makes sure we are not reusing a session that was started by a requirements check before
/etc/environment
was updated.This is a race condition that seems to happen maybe 25% of the time when an instance is started:
If we don't terminate it, then it will be reused for starting rootless components (which will then be missing proxy settings), and also for later ssh connections because we
enable-linger
.