Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write /etc/environment before the lingering session is started #362

Merged
merged 3 commits into from
Nov 4, 2021

Conversation

jandubois
Copy link
Member

@jandubois jandubois commented Oct 23, 2021

Should address #351 (comment)
Fixes #365

Also makes sure we are not reusing a session that was started by a requirements check before /etc/environment was updated.

This is a race condition that seems to happen maybe 25% of the time when an instance is started:

+ loginctl user-status 501
jan (501)
           Since: Sat 2021-10-23 07:17:02 UTC; 2s ago
           State: active
        Sessions: *2
          Linger: no
            Unit: user-501.slice
                  ├─session-2.scope
                  │ ├─1341 sshd: jan [priv]
                  │ ├─1438 sshd: jan@notty
                  │ ├─1440 /bin/bash
                  │ ├─1441 timeout 30s bash -c until command -v sshfs; do sleep 3; done
                  │ ├─1442 bash -c until command -v sshfs; do sleep 3; done
                  │ └─1443 sleep 3
                  └─[email protected]
                    └─init.scope
                      ├─1344 /lib/systemd/systemd --user
                      └─1345 (sd-pam)

If we don't terminate it, then it will be reused for starting rootless components (which will then be missing proxy settings), and also for later ssh connections because we enable-linger.

@jandubois jandubois marked this pull request as ready for review October 26, 2021 00:47
@jandubois jandubois requested a review from AkihiroSuda October 26, 2021 00:50
@jandubois jandubois added this to the v0.7.3 milestone Oct 26, 2021
@jandubois
Copy link
Member Author

This is a race condition

The race is that as soon as cloud-init writes the authorized_keys to disk, the host agent can ssh into the instance.

Even though it only checks every 10s, sometimes this succeeds early enough that it can advance to the "check for sshfs" stage before the boot scripts get to the point of installing rootless containerd and enabling linger mode. This is even more likely when sshfs must be downloaded and installed from a repo.

The "check for sshfs" script only checks every 3 seconds if sshfs is available. If linger mode is enabled before this check succeeds (and the session terminates), the session becomes permanent. Since it may have been started before /etc/environment was written, it can be missing the proxy settings and the env variables from lima.yaml.

For this reason we must terminate any current session after writing /etc/environment to ensure the lingering session uses the right environment values.

@AkihiroSuda
Copy link
Member

TEST| [INFO] Make sure proxy setting is updated
TEST| [INFO] FTP_PROXY: expected=FTP_PROXY=my.proxy:8021 got=FTP_PROXY=http://192.168.5.2:2121
Error: ERROR] proxy environment variable not set to correct value

https://github.com/lima-vm/lima/runs/4004061587?check_suite_focus=true

@jandubois
Copy link
Member Author

TEST| [INFO] Make sure proxy setting is updated
TEST| [INFO] FTP_PROXY: expected=FTP_PROXY=my.proxy:8021 got=FTP_PROXY=http://192.168.5.2:2121
Error: ERROR] proxy environment variable not set to correct value

This happens on a restart because all requirements are satisfied before the boot script even starts: all additional packages have already been installed, and the check for containerd also only verifies that the software is installed.

Therefore limactl shell is running before /etc/environments has been updated for the reboot.

I've now added a final requirement that all the boot scripts have finished, by copying /mnt/lima-cidata/meta-data to /etc/lima-meta-data, and checking in the final requirement that the files are identical. Since we change the instance-id on every boot, this will only trigger once boot.sh is really done.

@jandubois
Copy link
Member Author

I think it is unrelated to this PR, but I've experienced one curious situation in my local testing:

jan@lima-vmnet:~$ nerdctl ps
WARN[0000] environment variable XDG_RUNTIME_DIR is not set, see https://rootlesscontaine.rs/getting-started/common/login/
FATA[0000] rootless containerd not running? (hint: use `containerd-rootless-setuptool.sh install` to start rootless containerd): environment variable XDG_RUNTIME_DIR is not set, see https://rootlesscontaine.rs/getting-started/common/login/
jan@lima-vmnet:~$ env|grep XDG
XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop
jan@lima-vmnet:~$ loginctl
No sessions.

I don't understand how it is possible to not have a session at all even though I'm logged in via ssh. Since the session doesn't exist XDG_* variables aren't set either (except for XDG_DATA_DIRS which we set in ~/.profile), so nerdctl fails.

I've only seen this once, and could not reproduce, so ignoring for now. But I wanted to record it here in case somebody else encounters this too.

@jandubois jandubois marked this pull request as draft October 26, 2021 07:16
@jandubois
Copy link
Member Author

I've switched this PR back to draft status. Even though it now passes all tests, I'm suspicious of this:

time="2021-10-26T07:12:58Z" level=info msg="[hostagent] Waiting for the final requirement 1 of 1: \"boot scripts must have finished\""
time="2021-10-26T07:12:59Z" level=info msg="[hostagent] Not forwarding TCP 127.0.0.53:53"
time="2021-10-26T07:12:59Z" level=info msg="[hostagent] Not forwarding TCP 0.0.0.0:22"
time="2021-10-26T07:12:59Z" level=info msg="[hostagent] Not forwarding TCP [::]:22"
time="2021-10-26T07:13:04Z" level=warning msg="[hostagent] connection to the guest agent was closed unexpectedly"
time="2021-10-26T07:13:14Z" level=info msg="[hostagent] Waiting for the final requirement 1 of 1: \"boot scripts must have finished\""

I thought the guest agent was running as root now, so should not be affected by terminating the user session, but I've seen the unexpected "connection to guest agent was closed" message twice now and want to understand where it comes from.

@AkihiroSuda
Copy link
Member

I guess the SSH connection from the host to the guest is killed when you terminate the session

@AkihiroSuda
Copy link
Member

(We should switch away from SSH to virtserial for communication with the guestagent)

@jandubois
Copy link
Member Author

I guess the SSH connection from the host to the guest is killed when you terminate the session

Yes, but the guest agent connection can only be established once the ga is running (after 25-guest-agent-base), and the session is terminated in 07-etc-environment. So a new ssh connection and user session should be started for the ga, and it should not be interrupted again. Will have to look some more tomorrow...

(We should switch away from SSH to virtserial for communication with the guestagent)

Not as part of this PR. 😸

And make sure we are not reusing a session that was started by
a requirements check before /etc/environment was updated.

Signed-off-by: Jan Dubois <[email protected]>
@AkihiroSuda
Copy link
Member

Is this still WIP?

@jandubois
Copy link
Member Author

Is this still WIP?

Yes, it is, but it is also possible that I have fixed the remaining issue with the refactoring of the shutdown logic I did for the socket forwarding PR. Will try to confirm later today.

@jandubois
Copy link
Member Author

Yes, but the guest agent connection can only be established once the ga is running (after 25-guest-agent-base),

This turned out to be not correct, and I had to add another check that the instance is "ssh-ready" before setting up sshfs mounts etc.

I believe this PR is now good (assuming I didn't break CI).

@jandubois jandubois marked this pull request as ready for review November 2, 2021 20:26
@@ -64,5 +59,9 @@ if [ -d "${LIMA_CIDATA_MNT}"/provision.user ]; then
done
fi

# Signal that provisioning is done. The instance-id in the meta-data file changes on every boot,
# so any copy from a previous boot cycle will have different content.
cp "${LIMA_CIDATA_MNT}"/meta-data /etc/lima-boot-done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use /run/lima rather than /etc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need persistence it should be /var/lib/lima

Copy link
Member Author

@jandubois jandubois Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, /run is perfect.

I want the opposite of persistence, which is why I copy the meta-data file, which I know has different content on every restart. We have no way to "reset" the ready markers before cloud-init allows ssh access, so the requirements checks could move past them before they have been reset. Having different content for the marker on every boot solves this.

Updated

The boot scripts must terminate an existing user session after
/etc/environment has been updated to make sure the user session
(which may linger) has the updated values.

This reset will break the SSH control path, breaking sshfs mounts,
so the hostagent must wait until the instance is "ss-ready" for
persistent connections.

A similar "boot-done" status check is added as a "final" requirement
so that `limactl start` doesn't return until all the boot scripts
have finished.

Signed-off-by: Jan Dubois <[email protected]>
Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, merging, but please consider updating docs/internal.md to explain the /run files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

http_proxy et.al. not set in a new VM until restart
2 participants