Skip to content

Commit

Permalink
Add support for cloud-init "degraded" state (#4500)
Browse files Browse the repository at this point in the history
Summary
=======
This commit `cloud-init status` to include:

  1. A new exit code (2)
  2. Additional running states, exported under a new key "extended_status"
  3. External representation of all internal errors:
    - aggregate recoverable errors
    - per-stage recoverable errors
    - per-stage non-recoverable errors (aggregate key already exists)

Current state: recoverable errors vs non-recoverable errors
===========================================================

critical failure
----------------
If cloud-init is unable to complete, the service returns with exit code
1, and error messages are visible in the log files and in output of
`cloud-init status --format json` under the top level 'error' key.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service returns with exit code 0 and messages are visible in the log
files.

Future state: recoverable errors vs non-recoverable errors
==========================================================

critical failure
----------------
If cloud-init is unable to complete, error messages will now
additionally be visible in output of `cloud-init status --format json`
within the 'error' key nested under the module-level keys: 'init-local',
'init', 'modules-config', 'modules-final'.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service will now return with exit code 2, and error messages will be
visible in the output of `cloud-init status --format` json under the top
level 'recoverable_errors' key as well as within the 'error' key nested
under the module-level keys: 'init-local', 'init', 'modules-config',
'modules-final'.

Implementation
==============

Cloud-init error codes
----------------------
 0 - success
 1 - unrecoverable error
 2 - recoverable error (new)

This new exit code indicates recoverable errors. If cloud-init exits
with exit code (2), cloud-init was able to complete gracefully, however
something went wrong and the user should investigate.

Additional states
-----------------
For backwards compatibility, the output of `cloud-init status` remains
unchanged. A new key 'extended_status' is included in the output:

$ cloud-init status --format json | jq .status
"done"

$ cloud-init status --format json | jq .extended_status
"degraded done"

See Appendix A for list of possible states.

Exported errors: Aggregated errors
----------------------------------
When a recoverable error occurs, the internal cloud-init state
information is made visible under a top level aggregate key
'recoverable_errors' with errors sorted by error level:

$ cloud-init status --format json | jq .recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config",
    "No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
    "No template found in /etc/cloud/templates for template named sources.list",
    "No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
  ]
}

See Appendix B for list of possible error levels.

Exported errors: Per-stage errors
---------------------------------
The keys 'errors' and 'recoverable_errors' are also exported for each
stage to allow attribution of recoverable and non-recoverable errors
to their source.

$ cloud-init status --format json | jq .init.recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config"
  ]
}

Note: Only cloud-init stages which have completed are listed in the
output of `cloud-init status --format json`.

See Appendix C for list of possible cloud-init stages.

Limitations of internal errors
==============================
- Exported recoverable errors represent logged messages, which are not
  guaranteed to be stable between releases. The contents of the
  'errors' and 'recoverable_errors' keys are not guaranteed to have
  stable output!
- Exported errors and recoverable errors may occur at different stages
  since users may reorder configuration modules to run at different
  stages via cloud.cfg.

Appendices
==========

Appendix A: Extended states
---------------------------
  "not running"
  "running"
  "done"
  "error"
  "degraded done"
  "degraded running"
  "disabled"

Appendix B: Error levels
------------------------
Reported recoverable error messages are grouped by the level at which
they are logged. Complete list of levels:

  WARNING
  DEPRECATED
  ERROR
  CRITICAL

Appendix C: Stages of cloud-init
--------------------------------
The json representation of cloud-init stages (in run order) is:

  "init-local"
  "init"
  "modules-config"
  "modules-final"

This commit implements design specification US057[1].

[1] https://discourse.ubuntu.com/t/spec-improve-error-and-warning-visibility/39765
  • Loading branch information
holmanb committed Nov 14, 2023
1 parent 3426000 commit 468861a
Show file tree
Hide file tree
Showing 4 changed files with 359 additions and 19 deletions.
104 changes: 98 additions & 6 deletions cloudinit/cmd/status.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@
"""Define 'status' utility and handler as part of cloud-init commandline."""

import argparse
import copy
import enum
import json
import os
import sys
from copy import deepcopy
from time import gmtime, sleep, strftime
from typing import Any, Dict, List, NamedTuple, Optional, Tuple, Union

Expand All @@ -33,9 +33,24 @@ class UXAppStatus(enum.Enum):
RUNNING = "running"
DONE = "done"
ERROR = "error"
DEGRADED_DONE = "degraded done"
DEGRADED_RUNNING = "degraded running"
DISABLED = "disabled"


# Extend states when degraded
UXAppStatusDegradedMap = {
UXAppStatus.RUNNING: UXAppStatus.DEGRADED_RUNNING,
UXAppStatus.DONE: UXAppStatus.DEGRADED_DONE,
}

# Map extended states back to simplified states
UXAppStatusDegradedMapCompat = {
UXAppStatus.DEGRADED_RUNNING: UXAppStatus.RUNNING,
UXAppStatus.DEGRADED_DONE: UXAppStatus.DONE,
}


@enum.unique
class UXAppBootStatusCode(enum.Enum):
"""Enum representing user-visible cloud-init boot status codes."""
Expand Down Expand Up @@ -65,11 +80,14 @@ class StatusDetails(NamedTuple):
boot_status_code: UXAppBootStatusCode
description: str
errors: List[str]
recoverable_errors: Dict[str, List[str]]
last_update: str
datasource: Optional[str]
v1: Dict[str, Dict]


TABULAR_LONG_TMPL = """\
extended_status: {extended_status}
boot_status_code: {boot_code}
{last_update}detail:
{description}"""
Expand Down Expand Up @@ -121,7 +139,11 @@ def handle_status_args(name, args) -> int:
paths = read_cfg_paths()
details = get_status_details(paths)
if args.wait:
while details.status in (UXAppStatus.NOT_RUN, UXAppStatus.RUNNING):
while details.status in (
UXAppStatus.NOT_RUN,
UXAppStatus.RUNNING,
UXAppStatus.DEGRADED_RUNNING,
):
if args.format == "tabular":
sys.stdout.write(".")
sys.stdout.flush()
Expand All @@ -130,27 +152,64 @@ def handle_status_args(name, args) -> int:
details_dict: Dict[str, Union[None, str, List[str], Dict[str, Any]]] = {
"datasource": details.datasource,
"boot_status_code": details.boot_status_code.value,
"status": details.status.value,
"status": UXAppStatusDegradedMapCompat.get(
details.status, details.status
).value,
"extended_status": details.status.value,
"detail": details.description,
"errors": details.errors,
"recoverable_errors": details.recoverable_errors,
"last_update": details.last_update,
**details.v1,
}

if args.format == "tabular":
prefix = "\n" if args.wait else ""
print(f"{prefix}status: {details.status.value}")

# For backwards compatability, don't report degraded status here,
# extended_status key reports the complete status (includes degraded)
state = UXAppStatusDegradedMapCompat.get(
details.status, details.status
).value
print(f"{prefix}status: {state}")
if args.long:
if details.last_update:
last_update = f"last_update: {details.last_update}\n"
else:
last_update = ""
print(
TABULAR_LONG_TMPL.format(
extended_status=details.status.value,
prefix=prefix,
boot_code=details.boot_status_code.value,
description=details.description,
last_update=last_update,
)
+ (
"\nerrors:"
+ (
"\n\t- " + "\n\t- ".join(details.errors)
if details.errors
else f" {details.errors}"
)
)
+ (
"\nrecoverable_errors:"
+ (
"\n"
+ "\n".join(
[
f"{k}:\n\t- "
+ "\n\t- ".join(
[i.replace("\n", " ") for i in v]
)
for k, v in details.recoverable_errors.items()
]
)
if details.recoverable_errors
else f" {details.recoverable_errors}"
)
)
)
elif args.format == "json":
print(
Expand All @@ -160,7 +219,14 @@ def handle_status_args(name, args) -> int:
)
elif args.format == "yaml":
print(safeyaml.dumps(details_dict))
return 1 if details.status == UXAppStatus.ERROR else 0

# Hard error
if details.status == UXAppStatus.ERROR:
return 1
# Recoverable error
elif details.status in UXAppStatusDegradedMap.values():
return 2
return 0


def get_bootstatus(disable_file, paths) -> Tuple[UXAppBootStatusCode, str]:
Expand Down Expand Up @@ -285,6 +351,7 @@ def get_status_details(paths: Optional[Paths] = None) -> StatusDetails:
status = UXAppStatus.RUNNING
status_v1 = load_json(load_file(status_file)).get("v1", {})
latest_event = 0
recoverable_errors = {}
for key, value in sorted(status_v1.items()):
if key == "stage":
if value:
Expand All @@ -303,6 +370,18 @@ def get_status_details(paths: Optional[Paths] = None) -> StatusDetails:
errors.extend(value.get("errors", []))
start = value.get("start") or 0
finished = value.get("finished") or 0

# Aggregate recoverable_errors from all stages
current_recoverable_errors = value.get("recoverable_errors", {})
for err_type in current_recoverable_errors.keys():
if err_type not in recoverable_errors:
recoverable_errors[err_type] = deepcopy(
current_recoverable_errors[err_type]
)
else:
recoverable_errors[err_type].extend(
current_recoverable_errors[err_type]
)
if finished == 0 and start != 0:
status = UXAppStatus.RUNNING
event_time = max(start, finished)
Expand All @@ -325,8 +404,21 @@ def get_status_details(paths: Optional[Paths] = None) -> StatusDetails:
if latest_event
else ""
)

if recoverable_errors:
status = UXAppStatusDegradedMap.get(status, status)

# this key is a duplicate
status_v1.pop("datasource", None)
return StatusDetails(
status, boot_status_code, description, errors, last_update, datasource
status,
boot_status_code,
description,
errors,
recoverable_errors,
last_update,
datasource,
status_v1,
)


Expand Down
26 changes: 22 additions & 4 deletions tests/integration_tests/cmd/test_status.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,30 @@ def test_wait_when_no_datasource(session_cloud: IntegrationCloud, setup_image):
#cloud-config
ca-certs:
remove_defaults: false
invalid_key: true
"""


@pytest.mark.user_data(USER_DATA)
def test_status_json_errors(client):
"""Ensure that deprecated logs end up in the exported errors"""
assert json.loads(
client.execute("cat /run/cloud-init/status.json").stdout
)["v1"]["init"]["recoverable_errors"].get("DEPRECATED")
"""Ensure that deprecated logs end up in the recoverable errors and that
machine readable status contains recoverable errors
"""
status_json = client.execute("cat /run/cloud-init/status.json").stdout
assert json.loads(status_json)["v1"]["init"]["recoverable_errors"].get(
"DEPRECATED"
)

status_json = client.execute("cloud-init status --format json").stdout
assert "Deprecated cloud-config provided:\nca-certs:" in json.loads(
status_json
)["init"]["recoverable_errors"].get("DEPRECATED").pop(0)
assert "Deprecated cloud-config provided:\nca-certs:" in json.loads(
status_json
)["recoverable_errors"].get("DEPRECATED").pop(0)
assert "Invalid cloud-config provided" in json.loads(status_json)["init"][
"recoverable_errors"
].get("WARNING").pop(0)
assert "Invalid cloud-config provided" in json.loads(status_json)[
"recoverable_errors"
].get("WARNING").pop(0)
10 changes: 10 additions & 0 deletions tests/unittests/cmd/test_cloud_id.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,32 +16,40 @@
status.UXAppBootStatusCode.UNKNOWN,
"DataSourceNoCloud somedetail",
[],
{},
"",
"nocloud",
{},
)
STATUS_DETAILS_DISABLED = status.StatusDetails(
status.UXAppStatus.DISABLED,
status.UXAppBootStatusCode.DISABLED_BY_GENERATOR,
"DataSourceNoCloud somedetail",
[],
{},
"",
"",
{},
)
STATUS_DETAILS_NOT_RUN = status.StatusDetails(
status.UXAppStatus.NOT_RUN,
status.UXAppBootStatusCode.UNKNOWN,
"",
[],
{},
"",
"",
{},
)
STATUS_DETAILS_RUNNING = status.StatusDetails(
status.UXAppStatus.RUNNING,
status.UXAppBootStatusCode.UNKNOWN,
"",
[],
{},
"",
"",
{},
)


Expand All @@ -50,8 +58,10 @@
status.UXAppBootStatusCode.UNKNOWN,
"",
[],
{},
"",
None,
{},
)


Expand Down
Loading

0 comments on commit 468861a

Please sign in to comment.