Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix watch crash api unreachable #568

Merged

Conversation

diconico07
Copy link
Contributor

What this PR does / why we need it:
This PR harden the handling of the Watcher mechanisms in order to prevent the controller and the agent to crash when the kubernetes API is unreachable.

Hopefully this will fixes these issues:
Fixes #222
Fixes #557

Special notes for your reviewer:
This PR includes an update of kube-rs and k8s-openapi to be able to use more friendly constructs for backoff mechanisms.

The biggest part change is a refactor of ConfigMap needed to be able to correctly handle Restarted events to even further eliminate agent's restart risks related to watch of resource.

If applicable:

  • this PR has an associated PR with documentation in akri-docs
  • this PR contains unit tests
  • added code adheres to standard Rust formatting (cargo fmt)
  • code builds properly (cargo build)
  • code is free of common mistakes (cargo clippy)
  • all Akri tests succeed (cargo test)
  • inline documentation builds (cargo doc)
  • all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

@diconico07
Copy link
Contributor Author

/version patch

@github-actions github-actions bot added the version/patch Patch version change is needed label Mar 10, 2023
Copy link
Contributor

@kate-goldenring kate-goldenring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @diconico07

According to Kube-rs documentation, when a watch error occurs (e.g
because of the kube API being currently unavailable), the `Watcher`
still forwards the error, but will retry if the receiver continues
to wait for a new event.

This commit changes how the `Watcher` is used in order to handle those
errors and let the `Watcher` retry. A backoff mechanism is also setup so
we don't end up retrying forever.

Signed-off-by: Nicolas Belouin <[email protected]>
This commit changes the way config are watched and the events handled,
this change have, as main purpose, to ensure the agent won't restart if
it looses the connection to the kube API. Two mechanisms are employed to
do this:
 - Change the Watcher use to make it re-try in case of error (same
   change that was done in 846662e for
   the controller)
 - Handle the `Restarted` event so that in case of de-sync with API we
   are still able to recover without restarting (i.e handle
   creation/modification and deletion of `Configuration resources that
   occured during the connection loss).

The latter required some refactor of the `ConfigMap` to ensure we have
all the required information to do delete operation without having the
related event (after a `Restart` event only the current state of the
resources are available, not their events).

This refactor also fixed a yet undetected issue: We were unable to
handle two `Configuration` resources with the same name even accross
different namespaces.

Signed-off-by: Nicolas Belouin <[email protected]>
@diconico07 diconico07 force-pushed the fix-watch-crash-api-unreachable branch from 38901f9 to ffdfa86 Compare April 5, 2023 06:45
All calls to `handle_config_add` used a newly created `KubeImpl` object
this led to tests using the real Kube client rather than the mocked one.

In order to fix this, use `Arc` (as required by `handle_config_add`) at
creation time and propagate the `Arc` where needed. Also goes from
`impl` to `dyn` wherever needed.

Signed-off-by: Nicolas Belouin <[email protected]>
Signed-off-by: Nicolas Belouin <[email protected]>
@diconico07 diconico07 force-pushed the fix-watch-crash-api-unreachable branch from ffdfa86 to e491948 Compare April 5, 2023 09:41
@yujinkim-msft yujinkim-msft merged commit 3477694 into project-akri:main Apr 5, 2023
@diconico07 diconico07 deleted the fix-watch-crash-api-unreachable branch April 17, 2023 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version/patch Patch version change is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AKRI core components are restarting very often Agent crashes when API server is not available
3 participants