Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cloud-init degraded state and machine readable degraded errors to cloud-init status #4500

Merged
merged 6 commits into from
Oct 31, 2023

Conversation

holmanb
Copy link
Member

@holmanb holmanb commented Oct 9, 2023

(Rebase merge)

Proposed Commit Message

See individual commit messages for details.

Additional Context

Design spec

TODO

  • Docs to come in a followup PR, based on the commit message of bbf644d.

@holmanb holmanb force-pushed the holmanb/machine-readable-output-b2 branch 4 times, most recently from 2711654 to bafd9d0 Compare October 10, 2023 01:12
@TheRealFalcon
Copy link
Member

Don't forget about cloud-init status --wait. If I have an error in my cloud-config, it currently returns status: degraded running, but does not wait.

@TheRealFalcon
Copy link
Member

What do you think about presenting more information in the cloud-init status --long case? It's using the --long flag, so I'd expect "long" output.

detail:
Degraded functionality in stage [ init init-local modules-config modules-final]

doesn't really tell me anything useful except that something went wrong in each of the stages. At this point I'd have to either jump into the logs or know to run --format json to get all of the errors in json format. Can we present the errors here instead?

@holmanb
Copy link
Member Author

holmanb commented Oct 10, 2023

What do you think about presenting more information in the cloud-init status --long case? It's using the --long flag, so I'd expect "long" output.

Sure, I can add that.

detail:
Degraded functionality in stage [ init init-local modules-config modules-final]

Just noticed this was buggy and was printing all stages if there was an error in any stage. I'll fix that.

Can we present the errors here instead?

I can add this to --long in addition to, but do not want to add this to --long instead of machine readable output.

@TheRealFalcon TheRealFalcon self-assigned this Oct 11, 2023
@TheRealFalcon
Copy link
Member

--long errors look good now!

I'm not sure I understand the structure of --json output:

{
  "_schema_version": "1",
  "boot_status_code": "enabled-by-generator",
  "datasource": "lxd",
  "detail": "DataSourceLXD",
  "errors": [],
  "last_update": "Wed, 11 Oct 2023 20:32:01 +0000",
  "recoverable_errors": {},
  "schemas": {
    "1": {
      "boot_status_code": "enabled-by-generator",
      "datasource": "lxd",
      "detail": "DataSourceLXD",
      "errors": [],
      "last_update": "Wed, 11 Oct 2023 20:32:01 +0000",
      "recoverable_errors": {},
      "status": "done",
      "v1": {
        "datasource": "DataSourceLXD",
        "init": {
          "errors": [],
          "exported_errors": {
            "WARNING": [
              "Failed at merging in cloud config part from part-001: empty cloud config"
            ]
          },
          "finished": 1697056316.532762,
          "start": 1697056316.0571733
        },
        "init-local": {
          "errors": [],
          "exported_errors": {},
          "finished": 1697056315.0196795,
          "start": 1697056314.8316493
        },
        "modules-config": {
          "errors": [],
          "exported_errors": {},
          "finished": 1697056320.82747,
          "start": 1697056320.6863594
        },
        "modules-final": {
          "errors": [],
          "exported_errors": {},
          "finished": 1697056321.1340263,
          "start": 1697056321.057343
        },
        "stage": null
      }
    }
  },
  "status": "done",
  "v1": {
    "datasource": "DataSourceLXD",
    "init": {
      "errors": [],
      "exported_errors": {
        "WARNING": [
          "Failed at merging in cloud config part from part-001: empty cloud config"
        ]
      },
      "finished": 1697056316.532762,
      "start": 1697056316.0571733
    },
    "init-local": {
      "errors": [],
      "exported_errors": {},
      "finished": 1697056315.0196795,
      "start": 1697056314.8316493
    },
    "modules-config": {
      "errors": [],
      "exported_errors": {},
      "finished": 1697056320.82747,
      "start": 1697056320.6863594
    },
    "modules-final": {
      "errors": [],
      "exported_errors": {},
      "finished": 1697056321.1340263,
      "start": 1697056321.057343
    },
    "stage": null
  }
}

Why is there a v1 that lives under schemas["1"].

Also, this is probably more relevant to one of the previous PRs, but in playing around with how different types of errors are handled, I noticed that not all warnings are being handled. The easiest way to demonstrate is to add a LOG.warning("This isn't showing up...") as the first line of the handle function of cc_bootcmd.py. For some reason this doesn't get reported by cloud-init status --long, though replacing it with a LOG.error does. I think we'll need this fixed in a follow-up PR along with an integration test showing that all errors/warnings from the logs make it into the status json.

@holmanb
Copy link
Member Author

holmanb commented Oct 11, 2023

Why is there a v1 that lives under schemas["1"].

v1 groups together the stages and their error/start/end/recoverable_errors content

As for why it is under schemas["1"], I'm really not sure. Everything under schemas["1"] is a copy of the top level keys. I don't know why that approach was decided initially, but at this point I would rather be consistent rather than diverge without reason[1].

Also, this is probably more relevant to one of the previous PRs, but in playing around with how different types of errors are handled, I noticed that not all warnings are being handled. The easiest way to demonstrate is to add a LOG.warning("This isn't showing up...") as the first line of the handle function of cc_bootcmd.py. For some reason this doesn't get reported by cloud-init status --long, though replacing it with a LOG.error does. I think we'll need this fixed in a follow-up PR along with an integration test showing that all errors/warnings from the logs make it into the status json.

Yikes. I'll take a look, thanks for the heads up. That will have to be a follow-up PR.

[1] Aside: Personally I think we could blow away this whole versioning in json thing we've got going on in status.json, cloud-init status --format json, etc, (if not for backwards compatibility concerns) and have less nested keys and noise to deal with (when was the last time we removed a v1 of... well anything really?).

@TheRealFalcon
Copy link
Member

v1 groups together the stages and their error/start/end/recoverable_errors content

But presumably schemas["1"] is there to indicate this is v1, so it's confusing to have an extra layer of versioning.

Per your aside, I was going to comment something similar but thought it not relevant enough to this PR. I have a few thoughts:
If we do need to version things, then we shouldn't have it duplicated at the top-level, and we shouldn't have multiple layers. E.g., rather than (truncated)

{
  "_schema_version": "1",
  "datasource": "lxd",
  "errors": [],
  "schemas": {
    "1": {
      "datasource": "lxd",
      "errors": []
    }
  }
}

Instead be

{
  "v1": {
    "datasource": "lxd",
    "errors": []
  }
}

If we don't want to version things, and then later need to introduce a breaking change, we can always version at that time, though it adds some ugliness. E.g.:

{
  "datasource": "lxd",
  "errors": [],
  "v2": {
    "datasource": "lxd",
    "errors": {...}
}

The current approach doesn't really buy us anything and just makes the output more cumbersome to work with. I'm fine either way, but I agree that we're unlikely to change the key types and can instead always add more keys. The problem is this has been in the wild for a few releases now, so it'd be a breaking change to change it now. It's probably worth doing, but if we do go ahead with it, we should also look at changing cloud-init query as that output has gotten unwieldly there too.

@cjp256
Copy link
Contributor

cjp256 commented Oct 18, 2023

Awesome! Access to the spec is restricted.

@holmanb holmanb force-pushed the holmanb/machine-readable-output-b2 branch 8 times, most recently from cd4bd0f to f415f23 Compare October 26, 2023 19:13
@holmanb
Copy link
Member Author

holmanb commented Oct 26, 2023

If we do need to version things, then we shouldn't have it duplicated at the top-level, and we shouldn't have multiple layers. E.g., rather than (truncated)

Thanks for the discussion on this @TheRealFalcon.

Per out out of band discussion, I've gotten rid of the versioned "1": {} output in 61647e4. Ubuntu releases will need that change patched out for to avoid the breaking change.

I also dropped the v1 key from status.json and opted to include the contents of that key instead. There was a collision with the datasource key, but the code prunes that before merging, so we have a much cleaner output now:

# cloud-init status --format json
{
  "boot_status_code": "enabled-by-generator",
  "datasource": "nocloud",
  "detail": "DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]",
  "errors": [],
  "extended_status": "degraded done",
  "init": {
    "errors": [],
    "finished": 1698279442.4062886,
    "recoverable_errors": {
      "WARNING": [
        "Failed at merging in cloud config part from part-001: empty cloud config"
      ]
    },
    "start": 1698279441.3664
  },
  "init-local": {
    "errors": [],
    "finished": 1698279439.4033117,
    "recoverable_errors": {},
    "start": 1698279438.9879673
  },
  "last_update": "Thu, 26 Oct 2023 00:17:31 +0000",
  "modules-config": {
    "errors": [],
    "finished": 1698279450.2446203,
    "recoverable_errors": {
      "WARNING": [
        "No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
        "No template found in /etc/cloud/templates for template named sources.list",
        "No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
      ]
    },
    "start": 1698279449.9259806
  },
  "modules-final": {
    "errors": [],
    "finished": 1698279451.0209844,
    "recoverable_errors": {},
    "start": 1698279450.8273187
  },
  "recoverable_errors": {
    "WARNING": [
      "Failed at merging in cloud config part from part-001: empty cloud config",
      "No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
      "No template found in /etc/cloud/templates for template named sources.list",
      "No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
    ]
  },
  "stage": null,
  "status": "done"
}

@holmanb
Copy link
Member Author

holmanb commented Oct 26, 2023

Awesome! Access to the spec is restricted.

@cjp256: I've moved the content of the spec to discourse:

https://discourse.ubuntu.com/t/spec-improve-error-and-warning-visibility/39765

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good now, but I have a few more comments.

It looks like the --long output isn't distinguishing between recoverable errors and hard errors E.g., when I provide an invalid network-config:

root@me:~# cloud-init status --long
status: error
extended_status: error
boot_status_code: enabled-by-generator
last_update: Fri, 27 Oct 2023 15:19:35 +0000
detail:
DataSourceLXD
recoverable_errors:
WARNING:
	- Failed to rename devices: Failed to apply network config names: Unknown network config version: None
	- failed stage init-local

The network config error appears as a recoverable error, but it appears as both in --format json.

root@me:~# cloud-init status --format json
{
  "boot_status_code": "enabled-by-generator",
  "datasource": "lxd",
  "detail": "DataSourceLXD",
  "errors": [
    "Unknown network config version: None"
  ],
  "extended_status": "error",
  "init": {
    "errors": [],
    "finished": 1698419940.0766742,
    "recoverable_errors": {
      "WARNING": [
        "Failed to rename devices: Failed to apply network config names: Unknown network config version: None"
      ]
    },
    "start": 1698419939.4522545
  },
  "init-local": {
    "errors": [
      "Unknown network config version: None"
    ],
    "finished": 1698419819.0041544,
    "recoverable_errors": {
      "WARNING": [
        "failed stage init-local"
      ]
    },
    "start": 1698419818.9596815
  },
  "last_update": "Fri, 27 Oct 2023 15:19:35 +0000",
  "modules-config": {
    "errors": [],
    "finished": 1698419975.6157393,
    "recoverable_errors": {},
    "start": 1698419975.4693048
  },
  "modules-final": {
    "errors": [],
    "finished": 1698419975.9615448,
    "recoverable_errors": {},
    "start": 1698419975.8828495
  },
  "recoverable_errors": {
    "WARNING": [
      "Failed to rename devices: Failed to apply network config names: Unknown network config version: None",
      "failed stage init-local"
    ]
  },
  "stage": null,
  "status": "error"
}

If there's not an easy way to distinguish them, we should at least print the hard errors in the --long output.

Also, I think we should add to the spec how we're distinguishing between hard error vs recoverable error. I couldn't tell until I checked the source that it comes down to service failure.

For integration tests, I would like to see an integration test that intentionally generates a few warnings and verifies that they show up in the status output.
For unit tests, I see that test_status.py has been updated to reflect the new output, but we don't have any tests that generate any kind of errors/warnings and ensure the output is correct. I think we should add a test that does this.


if args.format == "tabular":
prefix = "\n" if args.wait else ""
print(f"{prefix}status: {details.status.value}")

# For backwards compability, don't report degraded status here,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
# For backwards compability, don't report degraded status here,
# For backwards compatibility, don't report degraded status here,

)

status_json = client.execute("cloud-init status --format json").stdout
assert json.loads(status_json)["v1"]["init"]["recoverable_errors"].get(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right anymore. status command doesn't output v1 anymore, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, thank you.

@holmanb
Copy link
Member Author

holmanb commented Oct 27, 2023

@TheRealFalcon Thanks for the review.

The network config error appears as a recoverable error, but it appears as both in --format json.

Hrm, you're right. I guess I can add both to --long

Also, I think we should add to the spec how we're distinguishing between hard error vs recoverable error. I couldn't tell until I checked the source that it comes down to service failure.

Happy to add that to the spec. I'm planning on basing the followup documentation PR on the commit message that implements this feature. Do you think the content there is sufficient regarding the difference between cloud-init failure and recoverable error, or would you prefer to see more/different content?

For integration tests, I would like to see an integration test that intentionally generates a few warnings and verifies that they show up in the status output.
For unit tests, I see that test_status.py has been updated to reflect the new output, but we don't have any tests that generate any kind of errors/warnings and ensure the output is correct. I think we should add a test that does this.

+1 Will do agreed. I didn't want to generate new tests until we were happy with how this looked to avoid unnecessary rework.

@holmanb
Copy link
Member Author

holmanb commented Oct 27, 2023

Also, I think we should add to the spec how we're distinguishing between hard error vs recoverable error. I couldn't tell until I checked the source that it comes down to service failure.

See the new current state and future state sections under the Implementation section.

@TheRealFalcon
Copy link
Member

See the new current state and future state sections under the Implementation section.

Those are great, thanks!

Do you think the content there is sufficient regarding the difference between cloud-init failure and recoverable error, or would you prefer to see more/different content?

I don't see what distinguishes a recoverable error from a non-recoverable one. I think if you make the same distinction that you made in the spec, that will work.

@holmanb
Copy link
Member Author

holmanb commented Oct 30, 2023

I don't see what distinguishes a recoverable error from a non-recoverable one. I think if you make the same distinction that you made in the spec, that will work.

Thanks for the feedback, I'll add that to the commit message when I'm ready to merge.

Also, I think I've addressed the remaining concerns in ab35812. Better unittest and integration test coverage, fixed the spelling nit, and included both errors and recoverable errors in the output of --long.

Ready for re-review.

@holmanb holmanb force-pushed the holmanb/machine-readable-output-b2 branch from ab35812 to 321126a Compare October 30, 2023 16:25
blackboxsw pushed a commit to blackboxsw/cloud-init that referenced this pull request Nov 6, 2023
Summary
=======
This commit `cloud-init status` to include:

  1. A new exit code (2)
  2. Additional running states, exported under a new key "extended_status"
  3. External representation of all internal errors:
    - aggregate recoverable errors
    - per-stage recoverable errors
    - per-stage non-recoverable errors (aggregate key already exists)

Current state: recoverable errors vs non-recoverable errors
===========================================================

critical failure
----------------
If cloud-init is unable to complete, the service returns with exit code
1, and error messages are visible in the log files and in output of
`cloud-init status --format json` under the top level 'error' key.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service returns with exit code 0 and messages are visible in the log
files.

Future state: recoverable errors vs non-recoverable errors
==========================================================

critical failure
----------------
If cloud-init is unable to complete, error messages will now
additionally be visible in output of `cloud-init status --format json`
within the 'error' key nested under the module-level keys: 'init-local',
'init', 'modules-config', 'modules-final'.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service will now return with exit code 2, and error messages will be
visible in the output of `cloud-init status --format` json under the top
level 'recoverable_errors' key as well as within the 'error' key nested
under the module-level keys: 'init-local', 'init', 'modules-config',
'modules-final'.

Implementation
==============

Cloud-init error codes
----------------------
 0 - success
 1 - unrecoverable error
 2 - recoverable error (new)

This new exit code indicates recoverable errors. If cloud-init exits
with exit code (2), cloud-init was able to complete gracefully, however
something went wrong and the user should investigate.

Additional states
-----------------
For backwards compatibility, the output of `cloud-init status` remains
unchanged. A new key 'extended_status' is included in the output:

$ cloud-init status --format json | jq .status
"done"

$ cloud-init status --format json | jq .extended_status
"degraded done"

See Appendix A for list of possible states.

Exported errors: Aggregated errors
----------------------------------
When a recoverable error occurs, the internal cloud-init state
information is made visible under a top level aggregate key
'recoverable_errors' with errors sorted by error level:

$ cloud-init status --format json | jq .recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config",
    "No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
    "No template found in /etc/cloud/templates for template named sources.list",
    "No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
  ]
}

See Appendix B for list of possible error levels.

Exported errors: Per-stage errors
---------------------------------
The keys 'errors' and 'recoverable_errors' are also exported for each
stage to allow attribution of recoverable and non-recoverable errors
to their source.

$ cloud-init status --format json | jq .init.recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config"
  ]
}

Note: Only cloud-init stages which have completed are listed in the
output of `cloud-init status --format json`.

See Appendix C for list of possible cloud-init stages.

Limitations of internal errors
==============================
- Exported recoverable errors represent logged messages, which are not
  guaranteed to be stable between releases. The contents of the
  'errors' and 'recoverable_errors' keys are not guaranteed to have
  stable output!
- Exported errors and recoverable errors may occur at different stages
  since users may reorder configuration modules to run at different
  stages via cloud.cfg.

Appendices
==========

Appendix A: Extended states
---------------------------
  "not running"
  "running"
  "done"
  "error"
  "degraded done"
  "degraded running"
  "disabled"

Appendix B: Error levels
------------------------
Reported recoverable error messages are grouped by the level at which
they are logged. Complete list of levels:

  WARNING
  DEPRECATED
  ERROR
  CRITICAL

Appendix C: Stages of cloud-init
--------------------------------
The json representation of cloud-init stages (in run order) is:

  "init-local"
  "init"
  "modules-config"
  "modules-final"

This commit implements design specification US057[1].

[1] https://discourse.ubuntu.com/t/spec-improve-error-and-warning-visibility/39765
blackboxsw pushed a commit to blackboxsw/cloud-init that referenced this pull request Nov 6, 2023
holmanb added a commit that referenced this pull request Nov 14, 2023
Commit f780cf9 removed the modules-init key from status.json v1 key.
Don't use it as example test data.
holmanb added a commit that referenced this pull request Nov 14, 2023
This key had more meaning to a developer than to a user. Replace with
"recoverable_errors", and align internal variable names with external
user UI for code legibility.
holmanb added a commit that referenced this pull request Nov 14, 2023
If different meaning for duplicate keys is required, then a v2 can be
added. Drop versioning scheme and duplicate keys to reduce unnecessary
verbosity.

BREAKING CHANGE: cloud-init status --json output
holmanb added a commit that referenced this pull request Nov 14, 2023
The detail in this key is duplicate, and changing the value of this key
during error condition is neither obvious nor documented. Make this key
behave the same regardless of error condition.

BREAKING CHANGE: status.json
holmanb added a commit that referenced this pull request Nov 14, 2023
Summary
=======
This commit `cloud-init status` to include:

  1. A new exit code (2)
  2. Additional running states, exported under a new key "extended_status"
  3. External representation of all internal errors:
    - aggregate recoverable errors
    - per-stage recoverable errors
    - per-stage non-recoverable errors (aggregate key already exists)

Current state: recoverable errors vs non-recoverable errors
===========================================================

critical failure
----------------
If cloud-init is unable to complete, the service returns with exit code
1, and error messages are visible in the log files and in output of
`cloud-init status --format json` under the top level 'error' key.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service returns with exit code 0 and messages are visible in the log
files.

Future state: recoverable errors vs non-recoverable errors
==========================================================

critical failure
----------------
If cloud-init is unable to complete, error messages will now
additionally be visible in output of `cloud-init status --format json`
within the 'error' key nested under the module-level keys: 'init-local',
'init', 'modules-config', 'modules-final'.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service will now return with exit code 2, and error messages will be
visible in the output of `cloud-init status --format` json under the top
level 'recoverable_errors' key as well as within the 'error' key nested
under the module-level keys: 'init-local', 'init', 'modules-config',
'modules-final'.

Implementation
==============

Cloud-init error codes
----------------------
 0 - success
 1 - unrecoverable error
 2 - recoverable error (new)

This new exit code indicates recoverable errors. If cloud-init exits
with exit code (2), cloud-init was able to complete gracefully, however
something went wrong and the user should investigate.

Additional states
-----------------
For backwards compatibility, the output of `cloud-init status` remains
unchanged. A new key 'extended_status' is included in the output:

$ cloud-init status --format json | jq .status
"done"

$ cloud-init status --format json | jq .extended_status
"degraded done"

See Appendix A for list of possible states.

Exported errors: Aggregated errors
----------------------------------
When a recoverable error occurs, the internal cloud-init state
information is made visible under a top level aggregate key
'recoverable_errors' with errors sorted by error level:

$ cloud-init status --format json | jq .recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config",
    "No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
    "No template found in /etc/cloud/templates for template named sources.list",
    "No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
  ]
}

See Appendix B for list of possible error levels.

Exported errors: Per-stage errors
---------------------------------
The keys 'errors' and 'recoverable_errors' are also exported for each
stage to allow attribution of recoverable and non-recoverable errors
to their source.

$ cloud-init status --format json | jq .init.recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config"
  ]
}

Note: Only cloud-init stages which have completed are listed in the
output of `cloud-init status --format json`.

See Appendix C for list of possible cloud-init stages.

Limitations of internal errors
==============================
- Exported recoverable errors represent logged messages, which are not
  guaranteed to be stable between releases. The contents of the
  'errors' and 'recoverable_errors' keys are not guaranteed to have
  stable output!
- Exported errors and recoverable errors may occur at different stages
  since users may reorder configuration modules to run at different
  stages via cloud.cfg.

Appendices
==========

Appendix A: Extended states
---------------------------
  "not running"
  "running"
  "done"
  "error"
  "degraded done"
  "degraded running"
  "disabled"

Appendix B: Error levels
------------------------
Reported recoverable error messages are grouped by the level at which
they are logged. Complete list of levels:

  WARNING
  DEPRECATED
  ERROR
  CRITICAL

Appendix C: Stages of cloud-init
--------------------------------
The json representation of cloud-init stages (in run order) is:

  "init-local"
  "init"
  "modules-config"
  "modules-final"

This commit implements design specification US057[1].

[1] https://discourse.ubuntu.com/t/spec-improve-error-and-warning-visibility/39765
holmanb added a commit that referenced this pull request Nov 14, 2023
holmanb added a commit that referenced this pull request Nov 14, 2023
Commit f780cf9 removed the modules-init key from status.json v1 key.
Don't use it as example test data.
holmanb added a commit that referenced this pull request Nov 14, 2023
This key had more meaning to a developer than to a user. Replace with
"recoverable_errors", and align internal variable names with external
user UI for code legibility.
holmanb added a commit that referenced this pull request Nov 14, 2023
If different meaning for duplicate keys is required, then a v2 can be
added. Drop versioning scheme and duplicate keys to reduce unnecessary
verbosity.

BREAKING CHANGE: cloud-init status --json output
holmanb added a commit that referenced this pull request Nov 14, 2023
The detail in this key is duplicate, and changing the value of this key
during error condition is neither obvious nor documented. Make this key
behave the same regardless of error condition.

BREAKING CHANGE: status.json
holmanb added a commit that referenced this pull request Nov 14, 2023
Summary
=======
This commit `cloud-init status` to include:

  1. A new exit code (2)
  2. Additional running states, exported under a new key "extended_status"
  3. External representation of all internal errors:
    - aggregate recoverable errors
    - per-stage recoverable errors
    - per-stage non-recoverable errors (aggregate key already exists)

Current state: recoverable errors vs non-recoverable errors
===========================================================

critical failure
----------------
If cloud-init is unable to complete, the service returns with exit code
1, and error messages are visible in the log files and in output of
`cloud-init status --format json` under the top level 'error' key.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service returns with exit code 0 and messages are visible in the log
files.

Future state: recoverable errors vs non-recoverable errors
==========================================================

critical failure
----------------
If cloud-init is unable to complete, error messages will now
additionally be visible in output of `cloud-init status --format json`
within the 'error' key nested under the module-level keys: 'init-local',
'init', 'modules-config', 'modules-final'.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service will now return with exit code 2, and error messages will be
visible in the output of `cloud-init status --format` json under the top
level 'recoverable_errors' key as well as within the 'error' key nested
under the module-level keys: 'init-local', 'init', 'modules-config',
'modules-final'.

Implementation
==============

Cloud-init error codes
----------------------
 0 - success
 1 - unrecoverable error
 2 - recoverable error (new)

This new exit code indicates recoverable errors. If cloud-init exits
with exit code (2), cloud-init was able to complete gracefully, however
something went wrong and the user should investigate.

Additional states
-----------------
For backwards compatibility, the output of `cloud-init status` remains
unchanged. A new key 'extended_status' is included in the output:

$ cloud-init status --format json | jq .status
"done"

$ cloud-init status --format json | jq .extended_status
"degraded done"

See Appendix A for list of possible states.

Exported errors: Aggregated errors
----------------------------------
When a recoverable error occurs, the internal cloud-init state
information is made visible under a top level aggregate key
'recoverable_errors' with errors sorted by error level:

$ cloud-init status --format json | jq .recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config",
    "No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
    "No template found in /etc/cloud/templates for template named sources.list",
    "No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
  ]
}

See Appendix B for list of possible error levels.

Exported errors: Per-stage errors
---------------------------------
The keys 'errors' and 'recoverable_errors' are also exported for each
stage to allow attribution of recoverable and non-recoverable errors
to their source.

$ cloud-init status --format json | jq .init.recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config"
  ]
}

Note: Only cloud-init stages which have completed are listed in the
output of `cloud-init status --format json`.

See Appendix C for list of possible cloud-init stages.

Limitations of internal errors
==============================
- Exported recoverable errors represent logged messages, which are not
  guaranteed to be stable between releases. The contents of the
  'errors' and 'recoverable_errors' keys are not guaranteed to have
  stable output!
- Exported errors and recoverable errors may occur at different stages
  since users may reorder configuration modules to run at different
  stages via cloud.cfg.

Appendices
==========

Appendix A: Extended states
---------------------------
  "not running"
  "running"
  "done"
  "error"
  "degraded done"
  "degraded running"
  "disabled"

Appendix B: Error levels
------------------------
Reported recoverable error messages are grouped by the level at which
they are logged. Complete list of levels:

  WARNING
  DEPRECATED
  ERROR
  CRITICAL

Appendix C: Stages of cloud-init
--------------------------------
The json representation of cloud-init stages (in run order) is:

  "init-local"
  "init"
  "modules-config"
  "modules-final"

This commit implements design specification US057[1].

[1] https://discourse.ubuntu.com/t/spec-improve-error-and-warning-visibility/39765
holmanb added a commit that referenced this pull request Nov 14, 2023
holmanb added a commit that referenced this pull request Nov 14, 2023
Commit f780cf9 removed the modules-init key from status.json v1 key.
Don't use it as example test data.
holmanb added a commit that referenced this pull request Nov 14, 2023
This key had more meaning to a developer than to a user. Replace with
"recoverable_errors", and align internal variable names with external
user UI for code legibility.
holmanb added a commit that referenced this pull request Nov 14, 2023
If different meaning for duplicate keys is required, then a v2 can be
added. Drop versioning scheme and duplicate keys to reduce unnecessary
verbosity.

BREAKING CHANGE: cloud-init status --json output
holmanb added a commit that referenced this pull request Nov 14, 2023
The detail in this key is duplicate, and changing the value of this key
during error condition is neither obvious nor documented. Make this key
behave the same regardless of error condition.

BREAKING CHANGE: status.json
holmanb added a commit that referenced this pull request Nov 14, 2023
Summary
=======
This commit `cloud-init status` to include:

  1. A new exit code (2)
  2. Additional running states, exported under a new key "extended_status"
  3. External representation of all internal errors:
    - aggregate recoverable errors
    - per-stage recoverable errors
    - per-stage non-recoverable errors (aggregate key already exists)

Current state: recoverable errors vs non-recoverable errors
===========================================================

critical failure
----------------
If cloud-init is unable to complete, the service returns with exit code
1, and error messages are visible in the log files and in output of
`cloud-init status --format json` under the top level 'error' key.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service returns with exit code 0 and messages are visible in the log
files.

Future state: recoverable errors vs non-recoverable errors
==========================================================

critical failure
----------------
If cloud-init is unable to complete, error messages will now
additionally be visible in output of `cloud-init status --format json`
within the 'error' key nested under the module-level keys: 'init-local',
'init', 'modules-config', 'modules-final'.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service will now return with exit code 2, and error messages will be
visible in the output of `cloud-init status --format` json under the top
level 'recoverable_errors' key as well as within the 'error' key nested
under the module-level keys: 'init-local', 'init', 'modules-config',
'modules-final'.

Implementation
==============

Cloud-init error codes
----------------------
 0 - success
 1 - unrecoverable error
 2 - recoverable error (new)

This new exit code indicates recoverable errors. If cloud-init exits
with exit code (2), cloud-init was able to complete gracefully, however
something went wrong and the user should investigate.

Additional states
-----------------
For backwards compatibility, the output of `cloud-init status` remains
unchanged. A new key 'extended_status' is included in the output:

$ cloud-init status --format json | jq .status
"done"

$ cloud-init status --format json | jq .extended_status
"degraded done"

See Appendix A for list of possible states.

Exported errors: Aggregated errors
----------------------------------
When a recoverable error occurs, the internal cloud-init state
information is made visible under a top level aggregate key
'recoverable_errors' with errors sorted by error level:

$ cloud-init status --format json | jq .recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config",
    "No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
    "No template found in /etc/cloud/templates for template named sources.list",
    "No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
  ]
}

See Appendix B for list of possible error levels.

Exported errors: Per-stage errors
---------------------------------
The keys 'errors' and 'recoverable_errors' are also exported for each
stage to allow attribution of recoverable and non-recoverable errors
to their source.

$ cloud-init status --format json | jq .init.recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config"
  ]
}

Note: Only cloud-init stages which have completed are listed in the
output of `cloud-init status --format json`.

See Appendix C for list of possible cloud-init stages.

Limitations of internal errors
==============================
- Exported recoverable errors represent logged messages, which are not
  guaranteed to be stable between releases. The contents of the
  'errors' and 'recoverable_errors' keys are not guaranteed to have
  stable output!
- Exported errors and recoverable errors may occur at different stages
  since users may reorder configuration modules to run at different
  stages via cloud.cfg.

Appendices
==========

Appendix A: Extended states
---------------------------
  "not running"
  "running"
  "done"
  "error"
  "degraded done"
  "degraded running"
  "disabled"

Appendix B: Error levels
------------------------
Reported recoverable error messages are grouped by the level at which
they are logged. Complete list of levels:

  WARNING
  DEPRECATED
  ERROR
  CRITICAL

Appendix C: Stages of cloud-init
--------------------------------
The json representation of cloud-init stages (in run order) is:

  "init-local"
  "init"
  "modules-config"
  "modules-final"

This commit implements design specification US057[1].

[1] https://discourse.ubuntu.com/t/spec-improve-error-and-warning-visibility/39765
holmanb added a commit that referenced this pull request Nov 14, 2023
holmanb added a commit that referenced this pull request Nov 14, 2023
Commit f780cf9 removed the modules-init key from status.json v1 key.
Don't use it as example test data.
holmanb added a commit that referenced this pull request Nov 14, 2023
This key had more meaning to a developer than to a user. Replace with
"recoverable_errors", and align internal variable names with external
user UI for code legibility.
holmanb added a commit that referenced this pull request Nov 14, 2023
If different meaning for duplicate keys is required, then a v2 can be
added. Drop versioning scheme and duplicate keys to reduce unnecessary
verbosity.

BREAKING CHANGE: cloud-init status --json output
holmanb added a commit that referenced this pull request Nov 14, 2023
The detail in this key is duplicate, and changing the value of this key
during error condition is neither obvious nor documented. Make this key
behave the same regardless of error condition.

BREAKING CHANGE: status.json
holmanb added a commit that referenced this pull request Nov 14, 2023
Summary
=======
This commit `cloud-init status` to include:

  1. A new exit code (2)
  2. Additional running states, exported under a new key "extended_status"
  3. External representation of all internal errors:
    - aggregate recoverable errors
    - per-stage recoverable errors
    - per-stage non-recoverable errors (aggregate key already exists)

Current state: recoverable errors vs non-recoverable errors
===========================================================

critical failure
----------------
If cloud-init is unable to complete, the service returns with exit code
1, and error messages are visible in the log files and in output of
`cloud-init status --format json` under the top level 'error' key.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service returns with exit code 0 and messages are visible in the log
files.

Future state: recoverable errors vs non-recoverable errors
==========================================================

critical failure
----------------
If cloud-init is unable to complete, error messages will now
additionally be visible in output of `cloud-init status --format json`
within the 'error' key nested under the module-level keys: 'init-local',
'init', 'modules-config', 'modules-final'.

recoverable failure
-------------------
In the case that cloud-init is able to complete yet something goes awry,
the service will now return with exit code 2, and error messages will be
visible in the output of `cloud-init status --format` json under the top
level 'recoverable_errors' key as well as within the 'error' key nested
under the module-level keys: 'init-local', 'init', 'modules-config',
'modules-final'.

Implementation
==============

Cloud-init error codes
----------------------
 0 - success
 1 - unrecoverable error
 2 - recoverable error (new)

This new exit code indicates recoverable errors. If cloud-init exits
with exit code (2), cloud-init was able to complete gracefully, however
something went wrong and the user should investigate.

Additional states
-----------------
For backwards compatibility, the output of `cloud-init status` remains
unchanged. A new key 'extended_status' is included in the output:

$ cloud-init status --format json | jq .status
"done"

$ cloud-init status --format json | jq .extended_status
"degraded done"

See Appendix A for list of possible states.

Exported errors: Aggregated errors
----------------------------------
When a recoverable error occurs, the internal cloud-init state
information is made visible under a top level aggregate key
'recoverable_errors' with errors sorted by error level:

$ cloud-init status --format json | jq .recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config",
    "No template found in /etc/cloud/templates for template named sources.list.ubuntu.deb822",
    "No template found in /etc/cloud/templates for template named sources.list",
    "No template found, not rendering /etc/apt/sources.list.d/ubuntu.sources"
  ]
}

See Appendix B for list of possible error levels.

Exported errors: Per-stage errors
---------------------------------
The keys 'errors' and 'recoverable_errors' are also exported for each
stage to allow attribution of recoverable and non-recoverable errors
to their source.

$ cloud-init status --format json | jq .init.recoverable_errors
{
  "WARNING": [
    "Failed at merging in cloud config part from part-001: empty cloud config"
  ]
}

Note: Only cloud-init stages which have completed are listed in the
output of `cloud-init status --format json`.

See Appendix C for list of possible cloud-init stages.

Limitations of internal errors
==============================
- Exported recoverable errors represent logged messages, which are not
  guaranteed to be stable between releases. The contents of the
  'errors' and 'recoverable_errors' keys are not guaranteed to have
  stable output!
- Exported errors and recoverable errors may occur at different stages
  since users may reorder configuration modules to run at different
  stages via cloud.cfg.

Appendices
==========

Appendix A: Extended states
---------------------------
  "not running"
  "running"
  "done"
  "error"
  "degraded done"
  "degraded running"
  "disabled"

Appendix B: Error levels
------------------------
Reported recoverable error messages are grouped by the level at which
they are logged. Complete list of levels:

  WARNING
  DEPRECATED
  ERROR
  CRITICAL

Appendix C: Stages of cloud-init
--------------------------------
The json representation of cloud-init stages (in run order) is:

  "init-local"
  "init"
  "modules-config"
  "modules-final"

This commit implements design specification US057[1].

[1] https://discourse.ubuntu.com/t/spec-improve-error-and-warning-visibility/39765
holmanb added a commit that referenced this pull request Nov 14, 2023
holmanb added a commit to holmanb/cloud-init that referenced this pull request Dec 5, 2023
- New page and content describing debugging for users
- New page and content documenting cloud-init's status
- New page and content documenting cloud-init's exported errors
- New page and content documenting cloud-init's failure states
- New page and content documenting how to re-run cloud-init
- New content documenting how validate user-data
- New content documenting how to use cloud-init with libvirt

Documents canonicalGH-4500
Fixes canonicalGH-4608
holmanb added a commit to holmanb/cloud-init that referenced this pull request Dec 5, 2023
- New page and content describing debugging for users
- New page and content documenting cloud-init's status
- New page and content documenting cloud-init's exported errors
- New page and content documenting cloud-init's failure states
- New page and content documenting how to re-run cloud-init
- New content documenting how validate user-data
- New content documenting how to use cloud-init with libvirt

Documents canonicalGH-4500
Fixes canonicalGH-4608
holmanb added a commit to holmanb/cloud-init that referenced this pull request Dec 6, 2023
- New page and content describing debugging for users
- New page and content documenting cloud-init's status
- New page and content documenting cloud-init's exported errors
- New page and content documenting cloud-init's failure states
- New page and content documenting how to re-run cloud-init
- New content documenting how validate user-data
- New content documenting how to use cloud-init with libvirt

Documents canonicalGH-4500
Fixes canonicalGH-4608
holmanb added a commit that referenced this pull request Dec 6, 2023
- New page and content describing debugging for users
- New page and content documenting cloud-init's status
- New page and content documenting cloud-init's exported errors
- New page and content documenting cloud-init's failure states
- New page and content documenting how to re-run cloud-init
- New content documenting how validate user-data
- New content documenting how to use cloud-init with libvirt

Documents GH-4500
Fixes GH-4608
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants