Write /etc/environment before the lingering session is started #362

jandubois · 2021-10-23T08:14:17Z

Should address #351 (comment)
Fixes #365

Also makes sure we are not reusing a session that was started by a requirements check before /etc/environment was updated.

This is a race condition that seems to happen maybe 25% of the time when an instance is started:

+ loginctl user-status 501
jan (501)
           Since: Sat 2021-10-23 07:17:02 UTC; 2s ago
           State: active
        Sessions: *2
          Linger: no
            Unit: user-501.slice
                  ├─session-2.scope
                  │ ├─1341 sshd: jan [priv]
                  │ ├─1438 sshd: jan@notty
                  │ ├─1440 /bin/bash
                  │ ├─1441 timeout 30s bash -c until command -v sshfs; do sleep 3; done
                  │ ├─1442 bash -c until command -v sshfs; do sleep 3; done
                  │ └─1443 sleep 3
                  └─[email protected]
                    └─init.scope
                      ├─1344 /lib/systemd/systemd --user
                      └─1345 (sd-pam)

If we don't terminate it, then it will be reused for starting rootless components (which will then be missing proxy settings), and also for later ssh connections because we enable-linger.

jandubois · 2021-10-26T01:03:57Z

This is a race condition

The race is that as soon as cloud-init writes the authorized_keys to disk, the host agent can ssh into the instance.

Even though it only checks every 10s, sometimes this succeeds early enough that it can advance to the "check for sshfs" stage before the boot scripts get to the point of installing rootless containerd and enabling linger mode. This is even more likely when sshfs must be downloaded and installed from a repo.

The "check for sshfs" script only checks every 3 seconds if sshfs is available. If linger mode is enabled before this check succeeds (and the session terminates), the session becomes permanent. Since it may have been started before /etc/environment was written, it can be missing the proxy settings and the env variables from lima.yaml.

For this reason we must terminate any current session after writing /etc/environment to ensure the lingering session uses the right environment values.

AkihiroSuda · 2021-10-26T04:24:50Z

TEST| [INFO] Make sure proxy setting is updated
TEST| [INFO] FTP_PROXY: expected=FTP_PROXY=my.proxy:8021 got=FTP_PROXY=http://192.168.5.2:2121
Error: ERROR] proxy environment variable not set to correct value

https://github.com/lima-vm/lima/runs/4004061587?check_suite_focus=true

jandubois · 2021-10-26T06:58:37Z

TEST| [INFO] Make sure proxy setting is updated
TEST| [INFO] FTP_PROXY: expected=FTP_PROXY=my.proxy:8021 got=FTP_PROXY=http://192.168.5.2:2121
Error: ERROR] proxy environment variable not set to correct value

This happens on a restart because all requirements are satisfied before the boot script even starts: all additional packages have already been installed, and the check for containerd also only verifies that the software is installed.

Therefore limactl shell is running before /etc/environments has been updated for the reboot.

I've now added a final requirement that all the boot scripts have finished, by copying /mnt/lima-cidata/meta-data to /etc/lima-meta-data, and checking in the final requirement that the files are identical. Since we change the instance-id on every boot, this will only trigger once boot.sh is really done.

jandubois · 2021-10-26T07:03:20Z

I think it is unrelated to this PR, but I've experienced one curious situation in my local testing:

jan@lima-vmnet:~$ nerdctl ps
WARN[0000] environment variable XDG_RUNTIME_DIR is not set, see https://rootlesscontaine.rs/getting-started/common/login/
FATA[0000] rootless containerd not running? (hint: use `containerd-rootless-setuptool.sh install` to start rootless containerd): environment variable XDG_RUNTIME_DIR is not set, see https://rootlesscontaine.rs/getting-started/common/login/
jan@lima-vmnet:~$ env|grep XDG
XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop
jan@lima-vmnet:~$ loginctl
No sessions.

I don't understand how it is possible to not have a session at all even though I'm logged in via ssh. Since the session doesn't exist XDG_* variables aren't set either (except for XDG_DATA_DIRS which we set in ~/.profile), so nerdctl fails.

I've only seen this once, and could not reproduce, so ignoring for now. But I wanted to record it here in case somebody else encounters this too.

jandubois · 2021-10-26T07:21:23Z

I've switched this PR back to draft status. Even though it now passes all tests, I'm suspicious of this:

time="2021-10-26T07:12:58Z" level=info msg="[hostagent] Waiting for the final requirement 1 of 1: \"boot scripts must have finished\""
time="2021-10-26T07:12:59Z" level=info msg="[hostagent] Not forwarding TCP 127.0.0.53:53"
time="2021-10-26T07:12:59Z" level=info msg="[hostagent] Not forwarding TCP 0.0.0.0:22"
time="2021-10-26T07:12:59Z" level=info msg="[hostagent] Not forwarding TCP [::]:22"
time="2021-10-26T07:13:04Z" level=warning msg="[hostagent] connection to the guest agent was closed unexpectedly"
time="2021-10-26T07:13:14Z" level=info msg="[hostagent] Waiting for the final requirement 1 of 1: \"boot scripts must have finished\""

I thought the guest agent was running as root now, so should not be affected by terminating the user session, but I've seen the unexpected "connection to guest agent was closed" message twice now and want to understand where it comes from.

AkihiroSuda · 2021-10-26T07:32:01Z

I guess the SSH connection from the host to the guest is killed when you terminate the session

AkihiroSuda · 2021-10-26T07:32:28Z

(We should switch away from SSH to virtserial for communication with the guestagent)

jandubois · 2021-10-26T07:48:38Z

I guess the SSH connection from the host to the guest is killed when you terminate the session

Yes, but the guest agent connection can only be established once the ga is running (after 25-guest-agent-base), and the session is terminated in 07-etc-environment. So a new ssh connection and user session should be started for the ga, and it should not be interrupted again. Will have to look some more tomorrow...

(We should switch away from SSH to virtserial for communication with the guestagent)

Not as part of this PR. 😸

And make sure we are not reusing a session that was started by a requirements check before /etc/environment was updated. Signed-off-by: Jan Dubois <[email protected]>

Signed-off-by: Jan Dubois <[email protected]>

AkihiroSuda · 2021-11-02T08:55:23Z

Is this still WIP?

jandubois · 2021-11-02T15:22:16Z

Is this still WIP?

Yes, it is, but it is also possible that I have fixed the remaining issue with the refactoring of the shutdown logic I did for the socket forwarding PR. Will try to confirm later today.

jandubois · 2021-11-02T20:25:57Z

Yes, but the guest agent connection can only be established once the ga is running (after 25-guest-agent-base),

This turned out to be not correct, and I had to add another check that the instance is "ssh-ready" before setting up sshfs mounts etc.

I believe this PR is now good (assuming I didn't break CI).

AkihiroSuda · 2021-11-03T08:58:37Z

pkg/cidata/cidata.TEMPLATE.d/boot.sh

@@ -64,5 +59,9 @@ if [ -d "${LIMA_CIDATA_MNT}"/provision.user ]; then
 	done
 fi

+# Signal that provisioning is done. The instance-id in the meta-data file changes on every boot,
+# so any copy from a previous boot cycle will have different content.
+cp "${LIMA_CIDATA_MNT}"/meta-data /etc/lima-boot-done


Can we use /run/lima rather than /etc?

If you need persistence it should be /var/lib/lima

No, /run is perfect.

I want the opposite of persistence, which is why I copy the meta-data file, which I know has different content on every restart. We have no way to "reset" the ready markers before cloud-init allows ssh access, so the requirements checks could move past them before they have been reset. Having different content for the marker on every boot solves this.

Updated

The boot scripts must terminate an existing user session after /etc/environment has been updated to make sure the user session (which may linger) has the updated values. This reset will break the SSH control path, breaking sshfs mounts, so the hostagent must wait until the instance is "ss-ready" for persistent connections. A similar "boot-done" status check is added as a "final" requirement so that `limactl start` doesn't return until all the boot scripts have finished. Signed-off-by: Jan Dubois <[email protected]>

AkihiroSuda

Thanks, merging, but please consider updating docs/internal.md to explain the /run files

jandubois mentioned this pull request Oct 25, 2021

http_proxy et.al. not set in a new VM until restart #365

Closed

jandubois marked this pull request as ready for review October 26, 2021 00:47

jandubois requested a review from AkihiroSuda October 26, 2021 00:50

jandubois added this to the v0.7.3 milestone Oct 26, 2021

jandubois marked this pull request as draft October 26, 2021 07:16

jandubois added 2 commits November 1, 2021 19:50

Write /etc/environment before the lingering session is started

7059f47

And make sure we are not reusing a session that was started by a requirements check before /etc/environment was updated. Signed-off-by: Jan Dubois <[email protected]>

Add test to verify proxy settings work inside the instance

fad4325

Signed-off-by: Jan Dubois <[email protected]>

jandubois force-pushed the etc-environment branch from a12ae55 to 51ddcdf Compare November 2, 2021 20:19

jandubois marked this pull request as ready for review November 2, 2021 20:26

AkihiroSuda reviewed Nov 3, 2021

View reviewed changes

jandubois force-pushed the etc-environment branch from 51ddcdf to ab9ff6c Compare November 3, 2021 15:09

jandubois requested a review from AkihiroSuda November 3, 2021 17:15

AkihiroSuda approved these changes Nov 4, 2021

View reviewed changes

AkihiroSuda merged commit 1729fb4 into lima-vm:master Nov 4, 2021

jandubois deleted the etc-environment branch December 9, 2021 17:50

AkihiroSuda mentioned this pull request Oct 24, 2023

shell connections are bounced ~30 seconds into debian boot cycle subsequent to first boot/provision #1941

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write /etc/environment before the lingering session is started #362

Write /etc/environment before the lingering session is started #362

jandubois commented Oct 23, 2021 •

edited

Loading

jandubois commented Oct 26, 2021

AkihiroSuda commented Oct 26, 2021

jandubois commented Oct 26, 2021

jandubois commented Oct 26, 2021

jandubois commented Oct 26, 2021

AkihiroSuda commented Oct 26, 2021

AkihiroSuda commented Oct 26, 2021

jandubois commented Oct 26, 2021

AkihiroSuda commented Nov 2, 2021

jandubois commented Nov 2, 2021

jandubois commented Nov 2, 2021

AkihiroSuda Nov 3, 2021

AkihiroSuda Nov 3, 2021

jandubois Nov 3, 2021 •

edited

Loading

AkihiroSuda left a comment

Write /etc/environment before the lingering session is started #362

Write /etc/environment before the lingering session is started #362

Conversation

jandubois commented Oct 23, 2021 • edited Loading

jandubois commented Oct 26, 2021

AkihiroSuda commented Oct 26, 2021

jandubois commented Oct 26, 2021

jandubois commented Oct 26, 2021

jandubois commented Oct 26, 2021

AkihiroSuda commented Oct 26, 2021

AkihiroSuda commented Oct 26, 2021

jandubois commented Oct 26, 2021

AkihiroSuda commented Nov 2, 2021

jandubois commented Nov 2, 2021

jandubois commented Nov 2, 2021

AkihiroSuda Nov 3, 2021

Choose a reason for hiding this comment

AkihiroSuda Nov 3, 2021

Choose a reason for hiding this comment

jandubois Nov 3, 2021 • edited Loading

Choose a reason for hiding this comment

AkihiroSuda left a comment

Choose a reason for hiding this comment

jandubois commented Oct 23, 2021 •

edited

Loading

jandubois Nov 3, 2021 •

edited

Loading