Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health-server: Do not cleanup health checking result on node updates. #30917

Merged
merged 1 commit into from
Mar 4, 2024

Conversation

marseel
Copy link
Contributor

@marseel marseel commented Feb 22, 2024

Whenever a node was updated, healtch-checking was removing and re-adding that node. This caused it to lose information about previously performed probes, which resulted in unknown status for such nodes. This can happen often, especially in ENI mode, where node updates happen each time a new pod is scheduled on the node.

Backstory:
#29566 fixed issue where we were reporting deleted nodes as unhealthy, but it introduced an issue when the updating node was overriding probe status, marking such nodes status as unreachable. Fix in #30504 changed it to mark such nodes as unknown. Now we are preserving probe status on node updates.

Fixes: #29566

Fixed issue when updated nodes were being reported with unknown connectivity status in health report

@marseel marseel added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/health Relates to the cilium-health component affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch labels Feb 22, 2024
@marseel marseel requested a review from a team as a code owner February 22, 2024 18:32
@marseel marseel requested a review from danehans February 22, 2024 18:32
@marseel
Copy link
Contributor Author

marseel commented Feb 22, 2024

/cc @christarazi
as you are probably most familiar with this part

Copy link
Contributor

@danehans danehans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM but can a test case be added to TestProbersetNodes() that exercises the code changes?

@marseel marseel force-pushed the fix_health_checking_even_more branch from b53d562 to 0212cf7 Compare February 26, 2024 10:12
@marseel
Copy link
Contributor Author

marseel commented Feb 26, 2024

Good idea @danehans I've added a test. I was already thinking about refactoring that part and totally forgot to add the test.
I've also checked that before the change test was failing.

@marseel
Copy link
Contributor Author

marseel commented Feb 26, 2024

/test

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this code is so subtle. Do we need better testing / more coverage in this code? We needed 3 separate PRs just to get here and who knows if there will need to be more :)

@marseel
Copy link
Contributor Author

marseel commented Feb 28, 2024

In theory, this code should be simple, I think that we need a better design, I will follow up with the issue and do it as my side-project.
There was also one PR with a fix from Tim in this area in between ours PRs, that was fixing two other issues :)

@marseel marseel force-pushed the fix_health_checking_even_more branch from 0212cf7 to ad51c2b Compare March 1, 2024 13:54
@marseel
Copy link
Contributor Author

marseel commented Mar 1, 2024

Rebased on the main as Cilium L4LB XDP test seems to be complaining without apparent reason.

@marseel
Copy link
Contributor Author

marseel commented Mar 1, 2024

/test

Whenever node was updated, healtch-checking was removing and re-adding
that node. This caused it to lose information about previously performed
probes, which resulted in `unknown` status for such nodes. This
can happen often especially in ENI mode, where node updates happen each
time new pod is scheduled on the node.

Signed-off-by: Marcel Zieba <[email protected]>
@marseel marseel force-pushed the fix_health_checking_even_more branch from ad51c2b to fa1bccf Compare March 1, 2024 14:37
@danehans
Copy link
Contributor

danehans commented Mar 1, 2024

I will follow up with the issue and do it as my side-project.

@marseel when ^ issue is created, can you link it to this PR?

@danehans
Copy link
Contributor

danehans commented Mar 1, 2024

/test

Copy link
Contributor

@danehans danehans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marseel thanks for adding the unit test.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 4, 2024
@joestringer joestringer added this pull request to the merge queue Mar 4, 2024
Merged via the queue into cilium:main with commit d6e7c5d Mar 4, 2024
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/health Relates to the cilium-health component area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants