Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness endpoint behaviour when unable to check in with fleet server #1157

Closed
michel-laterman opened this issue Sep 12, 2022 · 8 comments · Fixed by #1285
Closed

Liveness endpoint behaviour when unable to check in with fleet server #1157

michel-laterman opened this issue Sep 12, 2022 · 8 comments · Fixed by #1285
Assignees
Labels
8.6-candidate enhancement New feature or request Team:Elastic-Agent Label for the Agent team v8.5.0

Comments

@michel-laterman
Copy link
Contributor

As a fix to the issue in #1148 made in #1152. The elastic-agent will no longer report a degraded state if the checkin to fleet-server fails.

This degraded state reporting was used by the liveness endpoint to ensure that the agent reported a 200 status when healthy.
The original changes were added as part of #569.

We need to decide what the liveness endpoint should be reporting and how that interacts with what the agent report to fleet.

@michel-laterman michel-laterman added enhancement New feature or request Team:Elastic-Agent Label for the Agent team 8.6-candidate labels Sep 12, 2022
@michel-laterman
Copy link
Contributor Author

cc @blakerouse, I believe the original liveness endpoint behaviour made it into v2

@cmacknz cmacknz changed the title Liveness endpoint behaviour Liveness endpoint behaviour when unable to check in with fleet server Sep 12, 2022
@cmacknz cmacknz added the v8.5.0 label Sep 12, 2022
@cmacknz
Copy link
Member

cmacknz commented Sep 12, 2022

I think the issue with the original implementation was the possibility for the agent to check in and report a degraded state because of the inability to check in, which is a bit of a paradoxical situation given that you have to check in to do it.

To me it makes the most sense to align what the agent's local liveness endpoint does with the equivalent of what Fleet considers the agent offline state.

We released this functionality in v8.4.0 so we shouldn't break it by not reporting when the agent cannot connect to Fleet server. I think we should address this in v8.5.0 by:

  1. Reporting an error from the agent's liveness endpoint when the agent is offline using the same threshold and logic that Fleet itself uses to determine when agent is offline. The Fleet logic for offline is defined here: https://github.com/elastic/kibana/blob/342e7a17839dc78a4b29d7770eaad3138a8bddb8/x-pack/plugins/fleet/common/services/agent_status.ts#L24-L42

  2. Only reporting this offline state via the liveness endpoint, and never reporting to Fleet. Hopefully we can accomplish this by filtering a new offline state from the state reporter or something similarly easy.

@michel-laterman thoughts on this?

@blakerouse
Copy link
Contributor

Yes I think what we need here is 2 different statuses. The first is a local status, so what is my status as the Elastic Agent is local to the machine. The second is the status that is reported to Fleet Server.

When elastic-agent status is called or the liveness probe is used the fact that communication with Fleet Server is failing is important information.

@michel-laterman
Copy link
Contributor Author

@cmacknz, that sounds good it should also appear in elastic-agent status as that would be an easy indicator for customers

@jlind23
Copy link
Contributor

jlind23 commented Sep 21, 2022

@michel-laterman would you be able to take this one on your plate as it seems you understand the whole picture here?

@michel-laterman
Copy link
Contributor Author

michel-laterman commented Sep 22, 2022

@jlind23 Would this need to be backported to 8.5? The changes we're making for 8.6 means a fix for 8.5 is a completely separate fix.

@jlind23
Copy link
Contributor

jlind23 commented Sep 23, 2022

@michel-laterman as previous fix landed in 8.5 I think it would be great to have a fix backported to 8.5 too.
@cmacknz thoughts?

@cmacknz
Copy link
Member

cmacknz commented Sep 23, 2022

Yes lets backport to 8.5 so as not to break anything in that release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.6-candidate enhancement New feature or request Team:Elastic-Agent Label for the Agent team v8.5.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants