Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport systemd patches and configure netlink timeout #3520

Merged
merged 2 commits into from
Oct 12, 2023

Conversation

zmrow
Copy link
Contributor

@zmrow zmrow commented Oct 11, 2023

Issue number:
N/A

Description of changes:
This PR backports a few patches from a recent systemd issue. Further details are in the issue; the tl;dr is that on a heavily loaded system (such as one running many containers/pods), it is possible for the kernel to not respond to RTM_NEWROUTE requests within the current default 25 second timeout. This results in network links becoming unusable. Bottlerocket now sets this timeout to infinity since we require all links to be usable, especially the main link, even if the system is heavily loaded.

    systemd: Backport patches that allow netlink timeout to be configured

    Backport 2 patches that allow the configurability of the default timeout
    for waiting for replies for netlink messages from the kernel.  The
    default timeout is 25 seconds and it is possible to hit this timeout if
    the system is under load.  If the timeout is hit, network links can
    become unusable without intervention.

    Reference issue: github.com/systemd/systemd/issues/25441
    release: Configure default SYSTEMD_NETLINK_DEFAULT_TIMEOUT

    Write a drop-in for the systemd-networkd service that sets the
    SYSTEMD_NETLINK_DEFAULT_TIMEOUT environment variable to "infinity",
    rather than use the default timeout of 25 seconds.  This ensures that
    systemd-networkd won't put links into an inoperable state if the
    kernel/system is heavily loaded and doesn't respond right away.

Testing done:
systemd-networkd's environment shows the appropriate variable after this change:

bash-5.1# systemctl status systemd-networkd
...
   Main PID: 2767 (systemd-network)
...

bash-5.1# cat /proc/2767/environ                                                                                                
...SYSTEMD_NETLINK_DEFAULT_TIMEOUT=infinity

Also set a super short timeout of 1ms and watched the network fail to come up:

bash-5.1# journalctl -u systemd-networkd
...
Oct 11 22:29:13 localhost systemd-networkd[1004]: Enumeration completed                                                                        
Oct 11 22:29:13 localhost systemd-networkd[1004]: eth0: Failed to wait for the interface to be initialized: Connection timed out
Oct 11 22:29:13 localhost systemd-networkd[1004]: eth0: Failed                                                                                 

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@zmrow zmrow requested a review from bcressey October 11, 2023 20:14
Backport 2 patches that allow the configurability of the default timeout
for waiting for replies for netlink messages from the kernel.  The
default timeout is 25 seconds and it is possible to hit this timeout if
the system is under load.  If the timeout is hit, network links can
become unusable without intervention.

Reference issue: github.com/systemd/systemd/issues/25441
Write a drop-in for the systemd-networkd service that sets the
SYSTEMD_NETLINK_DEFAULT_TIMEOUT environment variable to "infinity",
rather than use the default timeout of 25 seconds.  This ensures that
systemd-networkd won't put links into an inoperable state if the
kernel/system is heavily loaded and doesn't respond right away.
@zmrow
Copy link
Contributor Author

zmrow commented Oct 11, 2023

^ Addresses @bcressey 's feedback

@zmrow zmrow requested a review from bcressey October 11, 2023 20:38
@zmrow zmrow merged commit c3e7a9e into bottlerocket-os:develop Oct 12, 2023
48 checks passed
@zmrow zmrow deleted the netlink-timeout branch October 12, 2023 19:33
@stmcginnis stmcginnis mentioned this pull request Oct 23, 2023
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants