Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Conditionally remove networkd online dependency on Ubuntu #5772

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

TheRealFalcon
Copy link
Member

@TheRealFalcon TheRealFalcon commented Oct 2, 2024

Proposed Commit Message

feat: Conditionally remove networkd online dependency on Ubuntu

Traditionally, cloud-init-network.service (previously
cloud-init.service) waited for network connectivity (via systemd
service ordering) before running. This has caused
cloud-init-network.service to block boot for a significant amount of
time. For the vast majority of boots, this network connectivity
isn't required.

This commit removes the ordering
After=systemd-networkd-wait-online.service, but checks the datasource
and user data in the init-local timeframe to see if network
connectivity will be necessary in the init network timeframe.
If so, when the init network service starts, it will call
systemd-networkd-wait-online manually in the same manner that the
systemd-networkd-wait-online.service does to wait for network
connectivity.

This commit affects Ubuntu only due to the various number of service
orderings and network renderers possible, along with the downstream
synchronization needed. However, a new overrideable method in the
Distro class should make this optimization trivial to implement for
any other distro.

Additional Context

Test Steps

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

@TheRealFalcon
Copy link
Member Author

TheRealFalcon commented Oct 3, 2024

Take 2...

I added a commit that instead of using a drop-in will call the systemd-networkd-wait-online binary using the exact same args that the service uses based on 'systemctl cat'. This removes the need for a costly daemon-reload in the cases we need to wait on network.

I haven't touched integration tests since the first commit, so they will have known failures.

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TheRealFalcon for working this. quick first pass review as your are actively coding/developing tests for this functionality

@@ -411,6 +474,9 @@ def main_init(name, args):
mode = sources.DSMODE_LOCAL if args.local else sources.DSMODE_NETWORK

if mode == sources.DSMODE_NETWORK:
if os.path.exists(init.paths.get_runpath(".wait-on-network")):
LOG.debug("Will wait for network connectivity before continuing")
init.distro.wait_for_network()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we now util.del_file("init.paths.get_runpath(".wait-on-network")) now that we've waited and succeeded on the wait_for_network() call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to? /run gets cleaned on boot, and if a cloud-init collect-logs will contain the file as another piece of evidence. I don't think it would hurt any further invocations of cloud-init on the current boot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason would be later calls on the command line to cloud-init init per use-cases like -#5726 not needing to actually shell out to run that wait again.

Comment on lines 391 to 396
_should_wait_via_cloud_config(config)
for config in [
datasource.get_userdata_raw(),
datasource.get_vendordata_raw(),
datasource.get_vendordata2_raw(),
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a generator expression without having to calculate the truthy return values from userdata vedordata and vendordarta2? Then we can early exit on the first case of userdata/vendordata containing write_files or un-decodable file content without having to perform the extra calculations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, good question. A generator expression won't accomplish that though.

I could do something like:

def _get_configs(datasource):
    yield datasource.get_userdata_raw()
    yield datasource.get_vendordata_raw()
    yield datasource.get_vendordata2_raw()

def _should_wait_on_network(datasource):
    ...
    return any(
        _should_wait_via_cloud_config(config)
        for config in _get_configs(datasource)
    )

if _should_wait_on_network(init.datasource):
LOG.debug(
"Network connectivity determined necessary for "
"cloud-init's network stage"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Future me" in triage wants to know the reason we chose to wait on network. Can we log what condition triggered our network wait: no datasource vs non-#cloud-config user-data vs write_files in config. I also want this in logs to make it conspicuous why we are waiting in the event that we shouldn't be in some cases.

@@ -220,6 +221,10 @@ def set_keymap(self, layout: str, model: str, variant: str, options: str):
# be needed
self.manage_service("restart", "console-setup")

def wait_for_network(self):
"""Ensure that cloud-init's network service has network connectivity"""
netplan.wait_for_network()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like this approach will work for broader environments than just netplan and just Ubuntu and/or Debian. I like that you are keeping it specific for the time being to environments which have netplan.

Don't we also want to confirm netplan.available() before calling netplan.wait_for_network() to avoid calling this on debian environments which don't have netplan.io installed?

This makes sense if the debian image contains netplan, but what happens on Debian images when they don't have have netplan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was assuming it was fine for debian because of the netplan usage there too.
I think instead I should move this method into ubuntu.py. If netplan isn't available, we have nothing left to wait on. On Ubuntu, we can assume we have netplan available for supported images. If somebody goes off the beaten path in their image there's really no "right" thing to do other than log an error.

@@ -9,7 +9,9 @@ Wants=cloud-init-local.service
Wants=sshd-keygen.service
Wants=sshd.service
After=cloud-init-local.service
{% if variant not in ["debian", "ubuntu"] %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is "safe to include debian in this conditional here. There are plenty of images, even cloud image containing cloud-init, which also don't contain netplan.io in the image: lxc launch images:debian/bookworm/cloud. In these cases, I'd not want default packaging to drop this ordering on Debian.

LOG.debug("NetworkManager is enabled, skipping networkd wait")
return
wait_online_def: str = subp.subp(
["systemctl", "cat", "systemd-networkd-wait-online.service"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Given that this function isn't really netplan-speciifc, it's more systemd-networkd specific, doesn't this helper function actually belong over in cloudinit/net/systemd.py?
  2. If we happen to run in an environment which doesn't have the systemd-networkd-wait-online.service the systemctl cat will exit 1, we should figure out how we want to behave in that environment. Info or WARNING log and return without waiting?
  3. If called on a system without netplan.io, do we want this function to log that condition and just return without waiting?

Traditionally, cloud-init-network.service (previously
cloud-init.service) waited for network connectivity (via systemd
service ordering) before running. This has caused
cloud-init-network.service to block boot for a significant amount of
time. For the vast majority of boots, this network connectivity
isn't required.

This commit removes the ordering
After=systemd-networkd-wait-online.service, but checks the datasource
and user data in the init-local timeframe to see if network
connectivity will be necessary in the init network timeframe.
If so, when the init network service starts, it will call
systemd-networkd-wait-online manually in the same manner that the
systemd-networkd-wait-online.service does to wait for network
connectivity.

This commit affects Ubuntu only due to the various number of service
orderings and network renderers possible, along with the downstream
synchronization needed. However, a new overrideable method in the
Distro class should make this optimization trivial to implement for
any other distro.
@TheRealFalcon
Copy link
Member Author

@blackboxsw , thanks for the review. I applied your comments and also made a change to move the wait into the activators.

@blackboxsw blackboxsw added the packaging Supplemental package review requested label Oct 3, 2024
return True, "random_seed command found"
if parsed_yaml.get("mounts"):
return True, "mounts found"
return False, "cloud-config does not contain network requiring elements"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. thanks for the breakdown

return activators.select_activator(priority=priority)
except activators.NoActivatorException:
return None
return activators.select_activator(priority=priority)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change I think you've also missed handling the exception at the call-site in cloudinit/cmd/devel/hotplug_hook.py... Although I think that call-site in hotplug hook is also broken anyway before this PR if cloud-init can't find an activator as activator return value would be None, and we'd try to reference None.bring_up_interface

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, lack of exception handling there was intentional.

@TheRealFalcon TheRealFalcon marked this pull request as ready for review October 8, 2024 19:41
@blackboxsw blackboxsw self-assigned this Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
packaging Supplemental package review requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants