Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Warn user of unexpected run mode #5209

Merged

Conversation

holmanb
Copy link
Member

@holmanb holmanb commented Apr 24, 2024

Proposed Commit Message

feat: Warn user of unexpected run mode
  
On systemd, services are started by PID 1. When this doesn't happen, cloud-init
is in an unknown run state and should warn the user.

- Reorder pid log to be able to reuse Distro information.
- Add docstring deprecating util.is_Linux().

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking some boxes:

  • ✔ Alpine (openrc-based) will invoke cloud-init boot stages from non-1 PPID, but we are ignoring non-systemd cases
  • ✔non-linux and therefore non-systemd is ignored

Here's a suggested diff to ensure the logs are actually written, and the log message has a bit more data about the launched boot stage. It doesn't resolve either the lack of logging PID on non-linux/non-systemd or the dropping of util.is_Linux.

diff --git a/cloudinit/cmd/main.py b/cloudinit/cmd/main.py
index ad9910e04..aff4d380f 100644
--- a/cloudinit/cmd/main.py
+++ b/cloudinit/cmd/main.py
@@ -72,15 +72,20 @@ def print_exc(msg=""):
     sys.stderr.write("\n")
 
 
-def log_ppid(distro):
+def log_ppid(distro, bootstage_name):
     if distro.is_linux:
         ppid = os.getppid()
         log = LOG.info
         extra_message = ""
         if 1 != ppid and distro.uses_systemd():
             log = LOG.warning
-            extra_message = ("Not a supported configuration.",)
-        log("PID [%s] started cloud-init. %s", ppid, extra_message)
+            extra_message = " Unsupported configuration: boot stage called outside of systemd"
+        log(
+            "PID [%s] started cloud-init '%s'.%s",
+            ppid,
+            bootstage_name,
+            extra_message,
+        )
 
 
 def welcome(action, msg=None):
@@ -321,14 +326,11 @@ def main_init(name, args):
     #    objects config as it may be different from init object
     # 10. Run the modules for the 'init' stage
     # 11. Done!
-    if not args.local:
-        w_msg = welcome_format(name)
-    else:
-        w_msg = welcome_format("%s-local" % (name))
+    bootstage_name = "%s-local" % (name) if args.local else name
+    w_msg = welcome_format(bootstage_name)
     init = stages.Init(ds_deps=deps, reporter=args.reporter)
     # Stage 1
     init.read_cfg(extract_fns(args))
-    log_ppid(init.distro)
     # Stage 2
     outfmt = None
     errfmt = None
@@ -354,6 +356,7 @@ def main_init(name, args):
     # config applied.  We send the welcome message now, as stderr/out have
     # been redirected and log now configured.
     welcome(name, msg=w_msg)
+    log_ppid(init.distro, bootstage_name)
 
     # re-play early log messages before logging was setup
     for lvl, msg in early_logs:
@@ -581,11 +584,11 @@ def main_modules(action_name, args):
     #    the modules objects configuration
     # 5. Run the modules for the given stage name
     # 6. Done!
-    w_msg = welcome_format("%s:%s" % (action_name, name))
+    bootstage_name = "%s:%s" % (action_name, name)
+    w_msg = welcome_format(bootstage_name)
     init = stages.Init(ds_deps=[], reporter=args.reporter)
     # Stage 1
     init.read_cfg(extract_fns(args))
-    log_ppid(init.distro)
     # Stage 2
     try:
         init.fetch(existing="trust")
@@ -620,6 +623,7 @@ def main_modules(action_name, args):
 
     # now that logging is setup and stdout redirected, send welcome
     welcome(name, msg=w_msg)
+    log_ppid(init.distro, bootstage_name)
 
     if name == "init":
         util.deprecate(

@@ -324,6 +328,7 @@ def main_init(name, args):
init = stages.Init(ds_deps=deps, reporter=args.reporter)
# Stage 1
init.read_cfg(extract_fns(args))
log_ppid(init.distro)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot happen until after log.setup_logging has been called.

@@ -580,6 +585,7 @@ def main_modules(action_name, args):
init = stages.Init(ds_deps=[], reporter=args.reporter)
# Stage 1
init.read_cfg(extract_fns(args))
log_ppid(init.distro)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to move this down a bit too after log.setup_logging otherwise no logs are written to /var/log/cloud-init.log. To retain original behavior, we may want to log this just after the welcome(... w_msg) to log it during the boot stage block.

extra_message = ""
if 1 != ppid and distro.uses_systemd():
log = LOG.warning
extra_message = ("Not a supported configuration.",)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add details in the extra message to tell use what's unsupported about it.
Unsupported configuration: boot stage called outside of systemd

@@ -490,6 +490,12 @@ def multi_log(

@lru_cache()
def is_Linux():
"""deprecated: prefer Distro object's `is_linux` property
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just drop this function altogether if we can figure a better approach to sort cloud-init analyze dump as the only other use-case right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

if 1 != ppid and distro.uses_systemd():
log = LOG.warning
extra_message = ("Not a supported configuration.",)
log("PID [%s] started cloud-init. %s", ppid, extra_message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we intentionally not logging the PID when not distro.is_linux?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the code did before, but I suppose we should be able to log PPID on non-linux too. I'm curious whether the PPID=1 should hold true on any of the BSDs, and it wouldn't hurt to log it.

@blackboxsw blackboxsw self-assigned this Apr 25, 2024
Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good and behaves well. One thing I think we need to sort here is integration tests which call cloud-init --local over SSH as they'll now exit non-zero due to the warning.

Tests affected are at least the following which both check response.ok from a cloud-init init --local call.

  • tests/integration_tests/datasources/test_ec2_ipv6.py
  • tests/integration_tests/test_upgrade.py

We might be able to adapt those tests to generally disregard exit 2 or maybe specifically call verify_clean_boot with a non-empty ignore_warnings item to ignore this expected warning log.

@TheRealFalcon
Copy link
Member

TheRealFalcon commented Apr 30, 2024

Because we patched out the exit 2 behavior on old releases, there's no backwards incompatibility here, correct? Just an extra warning in the logs?

We would see an exit 2 when we SRU to Noble though...

@blackboxsw
Copy link
Collaborator

Because we patched out the exit 2 behavior on old releases, there's no backwards incompatibility here, correct? Just an extra warning in the logs?

We would see an exit 2 when we SRU to Noble though...

yep and yep. I think it's worth a followup quilt patch in ubuntu/noble:debian/patches/retain-ppid-unsupported-config-as-info-level.patch to ensure we don't set warning level on the logs in noble.

@holmanb
Copy link
Member Author

holmanb commented May 2, 2024

yep and yep. I think it's worth a followup quilt patch in ubuntu/noble:debian/patches/retain-ppid-unsupported-config-as-info-level.patch to ensure we don't set warning level on the logs in noble.

we could always just make this a deprecation log and deal with this in the same way that we choose to deal with other new deprecations on older series - a patch per log like this seems really expensive to maintain in the long run

@holmanb holmanb mentioned this pull request May 3, 2024
2 tasks
@blackboxsw
Copy link
Collaborator

we could always just make this a deprecation log and deal with this in the same way that we choose to deal with other new deprecations on older series - a patch per log like this seems really expensive to maintain in the long run

That is a much better idea. Good suggestion, let's go with deprecation.

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So on stable releases, we'll patch the warning to a deprecation. Only thing left to land this PR is integration test fixes per this comment

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label May 23, 2024
@holmanb holmanb removed the stale-pr Pull request is stale; will be auto-closed soon label May 29, 2024
@canonical canonical deleted a comment from github-actions bot Jun 3, 2024
@blackboxsw blackboxsw added this to the cloud-init-24.3 milestone Jun 3, 2024
@blackboxsw
Copy link
Collaborator

I think this is just waiting on two integration test fixes. But, I'd like this in 24.3 as we are introducing a "breaking change" as far as warnings being emitted where there were not warnings before-hand and I'd like us to avoid trying to quickly stuff quilt patches into /noble for this across the SRU boundary for 24.2

@holmanb
Copy link
Member Author

holmanb commented Jun 4, 2024

I think this is just waiting on two integration test fixes. But, I'd like this in 24.3 as we are introducing a "breaking change" as far as warnings being emitted where there were not warnings before-hand and I'd like us to avoid trying to quickly stuff quilt patches into /noble for this across the SRU boundary for 24.2

Resolving this involves either:

  1. ignore this warning always
  2. do this somehow without using verify_clean_log()

I don't think that we want 1) (it involves muddying verify_clean_log() with even more single-use ignores aka tech debt), but 2) requires implementing what was described here:

One idea that I envision for this helper is a matrix of platforms and series that allows us to specify exactly when an expected warning message that will be seen. This would be more precise than the current "ignore this pattern everywhere" paradigm that verify_clean_log() uses. That said, I'm not sure that I want to include that refactor in this PR, which has already grown sizeable.

In order to make verify_clean_log() truly be a general-purpose drop-in replacement for verify_clean_boot(), I think we need to grow the ability to track tracebacks in cloud-init status --format json (which would be useful for users anyways). Unfortunately, that is not something that will happen quickly, but also I don't think that this is a strong requirement today for implementing this test, since ignoring tracebacks is an oracle-only requirement.

I'll share what I've been working on shortly.

@holmanb
Copy link
Member Author

holmanb commented Jun 4, 2024

@blackboxsw Last commit contains what I had in mind.

With this commit I can successfully run tests/integration_tests/test_upgrade.py::test_clean_boot_of_upgraded_package and get an instance with this log:

# grep WARN /var/log/cloud-init.log 
2024-06-04 01:07:26,672 - main.py[WARNING]: PID [2082] started cloud-init 'init-local'. Unsupported configuration: boot stage called outside of systemd
2024-06-04 01:07:27,299 - main.py[WARNING]: PID [2082] started cloud-init 'init'. Unsupported configuration: boot stage called outside of systemd
2024-06-04 01:07:27,956 - main.py[WARNING]: PID [2082] started cloud-init 'modules:config'. Unsupported configuration: boot stage called outside of systemd
2024-06-04 01:07:28,548 - main.py[WARNING]: PID [2082] started cloud-init 'modules:final'. Unsupported configuration: boot stage called outside of systemd

And it passes.

Thoughts?

@holmanb holmanb requested a review from blackboxsw June 10, 2024 18:10
@blackboxsw
Copy link
Collaborator

Thanks @holmanb . I've added two supplemental commits to your branch for reviiew

  • move from LOG.warning to util.deprecate() so we can better prompt any consumers of this feature that it will go away. It also allows us to reduce this deprectaion from WARNING level to INFO level on stable releases to avoid non-zero exit codes in the future once the feature spec'd at WIP brach deprecation version boundary #5411 lands
  • fixup verify_clean_boot to support ignore_(warnings|errrors|tracebacks)=True params just like verify_clean_log supports. Allowing individual tracebacks to be ignored or all tracebacks if ignore_tracebacks=True.

@blackboxsw blackboxsw force-pushed the holmanb/warn-when-not-running-under-pid-1 branch 2 times, most recently from bb67e92 to 370552d Compare June 19, 2024 19:07
tests/integration_tests/util.py Outdated Show resolved Hide resolved
).stdout.strip()
)
assert expected_traceback_count == traceback_count, (
f"{traceback_count - expected_traceback_count} unexpected traceback(s)"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: This fits the technical requirement for checking for tracebacks, but I'd prefer to have a nicer UX than "we found something we didn't expect, go figure it out", if possible.

Currently the one-line pytest failure summary will say something like:

Unexpected traceback found in /var/log/cloud-init.log

However, if the error message included the exception message I think that would be more useful while iterating during development or debugging.

Unexpected tracebacks: ZeroDivisionError: division by zero

Also, looking back at _format_found(), we only include the unexpected log when there is a single unexpected error / warning / etc. I'd like to change this to include the first unexpected encountered rather than use a generic header - again so that the pytest summary has richer information about what failed.

We don't need to make these changes in this PR, but I just wanted to lay out some UI improvements that I'd like to see eventually which should make developing and debugging easier.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this too, I did also look at processing/extracting tracebacks from the cloud-init.log and thought it a bit more complex than trying to piggy back on this PR. But, agreed a richer UX would be helpful in the event of failures and manually jumping into the logged artifacts to get this data after the fact is painful.

cloudinit/cmd/main.py Outdated Show resolved Hide resolved
DEPRECATE_BOOT_STAGE_MESSAGE = (
"Triggering cloud-init boot stages outside of intial system boot is not a"
" fully supported operation which can lead to incomplete of incorrect"
" configuration due to either race-conditions with other active services"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

race conditions and incomplete triggering are two reasons these may not work, but there are other reasons, for example: non-idempotent behavior

Maybe just drop the "due to ..."?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

" fully supported operation which can lead to incomplete of incorrect"
" configuration due to either race-conditions with other active services"
" or incomplete triggering of all cloud-init boot stages. As such,"
" cloud-init is deprecating this undocumented feature in the future. If"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these subcommands are documented (man page, docs), we just don't actively encourage their use

I think it would be best if we make them undocumented, or at least make the wording in the man page and docs more clear that these aren't intended to be used manually

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That probably makes sense in this PR too.

cloudinit/cmd/main.py Outdated Show resolved Hide resolved
@blackboxsw blackboxsw force-pushed the holmanb/warn-when-not-running-under-pid-1 branch 2 times, most recently from ee02b64 to 6134d93 Compare June 24, 2024 21:31
@blackboxsw
Copy link
Collaborator

@holmanb I think I resolved your last pass of review comments. Please take a look and make sure I didn't miss anything.

@blackboxsw
Copy link
Collaborator

blackboxsw commented Jun 25, 2024

@holmanb I think we want this #5209 in milestone 24.2 only if #5411 merged because #5491will give us an easy downstream/stable release mechanism to avoid warnings on "new" deprecation messages introduced across SRU boundary. If we think 5491 will be reviewable for milestone 24.2 today/tomorrow, I'll spend some time on that PR now and could provide some feedback here to get #5411 across the line.

@blackboxsw
Copy link
Collaborator

I think now that milestone 24.2 has been released. We can merge this pending your approval @holmanb

@holmanb
Copy link
Member Author

holmanb commented Jul 4, 2024

I think now that milestone 24.2 has been released. We can merge this pending your approval @holmanb

Thanks @blackboxsw taking one last look at this now, I think that I'd like to see two minor changes before merge:

  1. Use the new util deprecation helper rather than ubuntu series to conditionally check for depreciation.

  2. Squash git history. I think that the modifications to verify_clean_boot probably deserve to be a separate (initial) commit, and the rest can probably be squashed into a single coauthored commit.

WDYT?

@blackboxsw
Copy link
Collaborator

I think now that milestone 24.2 has been released. We can merge this pending your approval @holmanb

Thanks @blackboxsw taking one last look at this now, I think that I'd like to see two minor changes before merge:

  1. Use the new util deprecation helper rather than ubuntu series to conditionally check for depreciation.
  2. Squash git history. I think that the modifications to verify_clean_boot probably deserve to be a separate (initial) commit, and the rest can probably be squashed into a single coauthored commit.

WDYT?
@holmanb +1 on using deprecation helper in integration tests and agreed on squash merge intent.

On systemd, services are started by PID 1. When this doesn't happen, cloud-init
is in an unknown run state and should warn the user.

Reorder pid log to be able to reuse Distro information.

Add docstring deprecating util.is_Linux().
holmanb added a commit to holmanb/cloud-init that referenced this pull request Jul 4, 2024
…5209)

Implement verify_clean_boot() to ignore certain expected logs
in a platform-specific way.
@holmanb holmanb force-pushed the holmanb/warn-when-not-running-under-pid-1 branch from 6134d93 to ee44a4e Compare July 4, 2024 16:59
holmanb pushed a commit to holmanb/cloud-init that referenced this pull request Jul 4, 2024
…anonical#5209)

Ensure ignore_warnings=True or ignore_errors=True is honored and
not overridden by supplemental warning texts appended.
holmanb pushed a commit to holmanb/cloud-init that referenced this pull request Jul 4, 2024
…l#5209)

Avoid using warning level messages as there may be some
use-cases in the wild that need to invoke cloud-init boot
stages after boot for some reason unknown to upstream.

Provide a detailed warning message informing admins to file
issues against cloud-init to better represent those feature
needs before dropping this feature altogether.
"Unsupported configuration: boot stage called "
f"by PID [{ppid}] outside of systemd"
),
deprecated_version="24.2",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 24.3 now

holmanb and others added 3 commits July 10, 2024 11:37
…5209)

Implement verify_clean_boot() to ignore certain expected logs
in a platform-specific way.
…anonical#5209)

Ensure ignore_warnings=True or ignore_errors=True is honored and
not overridden by supplemental warning texts appended.
…l#5209)

Avoid using warning level messages as there may be some
use-cases in the wild that need to invoke cloud-init boot
stages after boot for some reason unknown to upstream.

Provide a detailed warning message informing admins to file
issues against cloud-init to better represent those feature
needs before dropping this feature altogether.
@blackboxsw blackboxsw force-pushed the holmanb/warn-when-not-running-under-pid-1 branch from ee44a4e to 8805555 Compare July 10, 2024 18:30
@blackboxsw blackboxsw merged commit 75add5c into canonical:main Jul 10, 2024
23 checks passed
blackboxsw pushed a commit that referenced this pull request Jul 10, 2024
On systemd, services are started by PID 1. When this doesn't happen, cloud-init
is in an unknown run state and should warn the user.

Reorder pid log to be able to reuse Distro information.

Add docstring deprecating util.is_Linux().
blackboxsw pushed a commit that referenced this pull request Jul 10, 2024
Implement verify_clean_boot() to ignore certain expected logs
in a platform-specific way.
blackboxsw added a commit that referenced this pull request Jul 10, 2024
…5209)

Ensure ignore_warnings=True or ignore_errors=True is honored and
not overridden by supplemental warning texts appended.
holmanb added a commit to holmanb/cloud-init that referenced this pull request Aug 2, 2024
On systemd, services are started by PID 1. When this doesn't happen, cloud-init
is in an unknown run state and should warn the user.

Reorder pid log to be able to reuse Distro information.

Add docstring deprecating util.is_Linux().
holmanb added a commit to holmanb/cloud-init that referenced this pull request Aug 2, 2024
…5209)

Implement verify_clean_boot() to ignore certain expected logs
in a platform-specific way.
holmanb pushed a commit to holmanb/cloud-init that referenced this pull request Aug 2, 2024
…anonical#5209)

Ensure ignore_warnings=True or ignore_errors=True is honored and
not overridden by supplemental warning texts appended.
holmanb pushed a commit to holmanb/cloud-init that referenced this pull request Aug 2, 2024
…l#5209)

Avoid using warning level messages as there may be some
use-cases in the wild that need to invoke cloud-init boot
stages after boot for some reason unknown to upstream.

Provide a detailed warning message informing admins to file
issues against cloud-init to better represent those feature
needs before dropping this feature altogether.
holmanb added a commit that referenced this pull request Aug 6, 2024
On systemd, services are started by PID 1. When this doesn't happen, cloud-init
is in an unknown run state and should warn the user.

Reorder pid log to be able to reuse Distro information.

Add docstring deprecating util.is_Linux().
holmanb added a commit that referenced this pull request Aug 6, 2024
Implement verify_clean_boot() to ignore certain expected logs
in a platform-specific way.
holmanb pushed a commit that referenced this pull request Aug 6, 2024
…5209)

Ensure ignore_warnings=True or ignore_errors=True is honored and
not overridden by supplemental warning texts appended.
holmanb pushed a commit that referenced this pull request Aug 6, 2024
Avoid using warning level messages as there may be some
use-cases in the wild that need to invoke cloud-init boot
stages after boot for some reason unknown to upstream.

Provide a detailed warning message informing admins to file
issues against cloud-init to better represent those feature
needs before dropping this feature altogether.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants