Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(*)!: surface application errors in status #379

Merged
merged 3 commits into from
Aug 12, 2024

Conversation

brooksmtownsend
Copy link
Member

@brooksmtownsend brooksmtownsend commented Aug 6, 2024

Feature or Problem

This PR adds an additional field to the BackoffWrapper (previously BackoffAwareScaler) that can store a failed message received by wadm in response to a command. Commands in wasmCloud, particularly the ScaleComponentCommand and StartProvider command have corresponding events that indicate whether it successfully scaled/started or it failed. In this PR, I used the same structure that we already used for notifying scalers of events that should come in to also interpret whether or not a newly received event was actually a failure to process the command. In the case where the event was a failure, we now store the status in the BackoffWrapper and report that status for 5 seconds, effectively backing off that scaler from additional reconciliation or status reporting until that status is cleared.

As mentioned in some comments, a followup PR fully fixing #253 should be done as well that exponentially backs this off up to a ceiling duration. In order to keep the discussion focused, I scoped this PR to just the logic that pulls the failed message off of the event.

Requesting the status of an application that, for example, refers to a provider that doesn't exist, will look like this:

{
  "result": "ok",
  "message": "Successfully fetched status for application rust-hello-world",
  "status": {
    "status": {
      "type": "failed",
      "message": "Scaling component on 1 host(s), failed to fetch provider: failed to fetch provider under OCI reference `ghcr.io/wasmcloud/http-server:0.22.1`: failed to fetch OCI path: failed to fetch OCI bytes: Registry error: url https://ghcr.io/v2/wasmcloud/http-server/manifests/0.22.1, envelope: OCI API errors: [OCI API error: manifest unknown]"
    },
    "scalers": [
      {
        "id": "3cc1670aad7ffbdaaa599b1fd2bfbe5877d6df9b57729ee68d89afd8b8a4e2fc",
        "kind": "SpreadScaler",
        "name": "rust_hello_world-http_component",
        "status": {
          "type": "reconciling",
          "message": "Scaling component on 1 host(s)"
        }
      },
      {
        "id": "ea010c3726e248ef3c112bf31f9ed1af75d7127438f43576ff66626890ae74e2",
        "kind": "LinkScaler",
        "name": "rust_hello_world-httpserver -(wasi:http)-> rust_hello_world-http_component",
        "status": {
          "type": "deployed"
        }
      },
      {
        "id": "bd2f7d542c821c5e50fead647f590c86e5c6bf35e75b7cd05fabb40c803e7727",
        "kind": "SpreadScaler",
        "name": "rust_hello_world-httpserver",
        "status": {
          "type": "failed",
          "message": "failed to fetch provider: failed to fetch provider under OCI reference `ghcr.io/wasmcloud/http-server:0.22.1`: failed to fetch OCI path: failed to fetch OCI bytes: Registry error: url https://ghcr.io/v2/wasmcloud/http-server/manifests/0.22.1, envelope: OCI API errors: [OCI API error: manifest unknown]"
        }
      }
    ],
    "version": "",
    "components": []
  }
}

Related Issues

Lays the ground work for #253

Release Information

wadm 0.14.0

Consumer Impact

This didn't require any over-the-wire breaking changes, but operators should ensure that all of their wadm instances are updated to use this version. Running this version and an older version of wadm may result in inconsistent statuses, as the newer wadm scalers will understand backing off status but the older ones will not.

Testing

Unit Test(s)

Acceptance or Integration

Manual Verification

I manually verified this worked for a provider reference, but there's more testing to be done around the general functionality.

@brooksmtownsend brooksmtownsend merged commit 78291c7 into main Aug 12, 2024
4 checks passed
@brooksmtownsend brooksmtownsend deleted the feat/surface-app-errors-in-status branch August 12, 2024 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants