Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fixes to improve boot speed #1809

Merged
merged 11 commits into from
Nov 10, 2021
Merged

Conversation

bcressey
Copy link
Contributor

@bcressey bcressey commented Nov 9, 2021

Issue number:
N/A

Description of changes:
This is a collection of fixes to improve boot speed and time to a usable node - at least by 5 seconds, at most by 8 seconds.

Building kubelet with the "dockerless" tag saves 5 seconds during service startup, as otherwise cadvisor tries for five seconds to connect to the Docker daemon before printing an error.

The fix for the defer timeout in the wicked DHCPv6 client saves 1 second for around half of launches, in cases where the timer fires a little early and would otherwise trigger another 1 second wait.

Using an overlayfs for the CNI plugin directory saves a variable amount of time by avoiding a potentially slow copy to an unwritten EBS volume in the critical path. systemd-tmpfiles-setup previously took 900 milliseconds or more in most cases, and now takes 100 milliseconds or less, with most of the remaining time spent populating the SELinux modules in /var/lib/selinux.

Building support for the PS/2 controller, keyboard, and mouse as modules saves around 400 milliseconds during boot under KVM, as otherwise device mapper waits for the configuration to finish before mounting the root filesystem. They are still loaded later, after the root filesystem is mounted, but at that point we can do more work in parallel.

Disabling RAID auto-detect avoids another potential device wait and reduces printk messages. Writing to the console device at 115200 bits per second speeds up those operations by 12x. Console logging continues to be a drag on overall boot speed. We can turn it off altogether to gain at least 2 seconds, but only at a severe cost to debugging capabilities if anything goes wrong. Using the higher device speed obviously helps, but its impact is spread across all threads that might draw the short straw after triggering a printk call, and is difficult to quantify.

Removing the udevadm settle dependency doesn't yield a measurable improvement in boot speed, but does stop systemd from blaming wicked for slowing everything down.

I've kept the two commits that added debug output for systemd-tmpfiles and the wicked clients, since these were instrumental in identifying the underlying issues and confirming the fixes. These logs are all sent to the journal rather than the console, so they don't compete with existing output or slow down the boot.

Testing done:
For the kernel change: verified that the keyboard and mouse modules were still loaded on x86_64 nodes.

For the changes to kubelet and the CNI plugins directory: verified that sonobuoy runs passed for these versions, and that no Docker related error messages were logged to the journal.

For the "activate" targets: confirmed that these were no longer blamed by systemd-analyze blame, and that bootstrap containers still worked as expected.

For the wicked changes: confirmed that the DHCP6 client would defer after the first timeout, whether the timer fired slightly before or slightly after one second elapsed. On instances with DHCP6 enabled, the lease was successfully acquired.

For the udev settle change: used a hacked up local build where wicked was set up to manage "eth1" rather than "eth0", and verified that wicked would still configure the device if I renamed it into existence during the wait.

For the serial console changes: verified that console logs were present for AWS variants across a range of instance types - c1.xlarge, t2.large, m3.2xlarge, c3.large, c4.large, c5.large, c6g.large - and for VMware variants running on ESXi 7.0. Note that we're already using 115200 for GRUB as of #1701, so this setting has previously been validated on a smaller set of instance types.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One super minor suggestion, but otherwise LGTM!

What=overlay
Where=/opt/cni/bin
Type=overlay
Options=noatime,nosuid,nodev,lowerdir=/usr/libexec/cni/bin,upperdir=/opt/cni/upper,workdir=/opt/cni/work,context=system_u:object_r:local_t:s0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding another directory here to contain the upperdir and workdir to hide them/make it more obvious that they're an implementation detail of the overlay mount? Something like /opt/cni/.overlay/upper and /opt/cni/.overlay/work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to use /var/lib/cni-plugins for the overlay directories, partly to match the treatment of /var/lib/kernel-devel (which I also adjusted along these lines), and partly to guard against cases where pods might be mounting in /opt/cni and get confused by the new directories.

@bcressey
Copy link
Contributor Author

bcressey commented Nov 9, 2021

Rebase; fix the serial console speed commit to account for the removed aws-k8s-1.17 variant.

If they're built in, they can delay mounting the root filesystem.

Signed-off-by: Ben Cressey <[email protected]>
This disables most of the Docker-related functionality, and avoids a
five second delay at startup waiting for the Docker daemon.

Signed-off-by: Ben Cressey <[email protected]>
We use tmpfiles extensively, and the additional output gives a more
complete picture of what happens each boot.

Signed-off-by: Ben Cressey <[email protected]>
Move the upper, lower, and work directories for the writable kernel
development tree into a subdirectory, to better indicate their status
as an implementation detail for the overlayfs mount.

Signed-off-by: Ben Cressey <[email protected]>
This speeds up boot by avoiding the need to copy the binaries to the
local storage volume.

Signed-off-by: Ben Cressey <[email protected]>
Otherwise these units show up as some of the longest running jobs in
`systemd-analyze blame` output.

Signed-off-by: Ben Cressey <[email protected]>
The wicked daemons will wait for expected devices to appear, which is
more reliable than relying on `udevadm settle` and avoids unnecessary
boot delays.

Signed-off-by: Ben Cressey <[email protected]>
We use a one second defer timeout for the DHCPv6 lease essentially to
mark it as optional and minimize the boot delay. One second is longer
than we would like already, but going sub-second is somewhat invasive
because the timeouts are tied to the protocol implementation and can
change the client behavior. It's relatively simple to avoid the extra
wait caused by an early timer event.

Signed-off-by: Ben Cressey <[email protected]>
Existing variant platforms all support the 115200 speed for the guest
serial device.

Signed-off-by: Ben Cressey <[email protected]>
Any use of RAID is left up to containers to handle.

Signed-off-by: Ben Cressey <[email protected]>
@bcressey
Copy link
Contributor Author

bcressey commented Nov 9, 2021

Adjust overlay directory handling per @samuelkarp

Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bcressey bcressey merged commit cb728c4 into bottlerocket-os:develop Nov 10, 2021
@bcressey bcressey deleted the faster-boot branch November 10, 2021 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants