health-server: Do not cleanup health checking result on node updates. #30917

marseel · 2024-02-22T18:32:55Z

Whenever a node was updated, healtch-checking was removing and re-adding that node. This caused it to lose information about previously performed probes, which resulted in unknown status for such nodes. This can happen often, especially in ENI mode, where node updates happen each time a new pod is scheduled on the node.

Backstory:
#29566 fixed issue where we were reporting deleted nodes as unhealthy, but it introduced an issue when the updating node was overriding probe status, marking such nodes status as unreachable. Fix in #30504 changed it to mark such nodes as unknown. Now we are preserving probe status on node updates.

Fixes: #29566

Fixed issue when updated nodes were being reported with unknown connectivity status in health report

marseel · 2024-02-22T18:34:22Z

/cc @christarazi
as you are probably most familiar with this part

danehans

/LGTM but can a test case be added to TestProbersetNodes() that exercises the code changes?

marseel · 2024-02-26T10:13:20Z

Good idea @danehans I've added a test. I was already thinking about refactoring that part and totally forgot to add the test.
I've also checked that before the change test was failing.

marseel · 2024-02-26T10:31:31Z

/test

christarazi

Good catch, this code is so subtle. Do we need better testing / more coverage in this code? We needed 3 separate PRs just to get here and who knows if there will need to be more :)

marseel · 2024-02-28T08:41:55Z

In theory, this code should be simple, I think that we need a better design, I will follow up with the issue and do it as my side-project.
There was also one PR with a fix from Tim in this area in between ours PRs, that was fixing two other issues :)

marseel · 2024-03-01T13:55:22Z

Rebased on the main as Cilium L4LB XDP test seems to be complaining without apparent reason.

marseel · 2024-03-01T13:55:30Z

/test

Whenever node was updated, healtch-checking was removing and re-adding that node. This caused it to lose information about previously performed probes, which resulted in `unknown` status for such nodes. This can happen often especially in ENI mode, where node updates happen each time new pod is scheduled on the node. Signed-off-by: Marcel Zieba <[email protected]>

danehans · 2024-03-01T16:21:12Z

I will follow up with the issue and do it as my side-project.

@marseel when ^ issue is created, can you link it to this PR?

danehans · 2024-03-01T16:21:58Z

/test

danehans

@marseel thanks for adding the unit test.

marseel requested a review from a team as a code owner February 22, 2024 18:32

marseel requested a review from danehans February 22, 2024 18:32

danehans reviewed Feb 23, 2024

View reviewed changes

marseel force-pushed the fix_health_checking_even_more branch from b53d562 to 0212cf7 Compare February 26, 2024 10:12

christarazi approved these changes Feb 28, 2024

View reviewed changes

marseel force-pushed the fix_health_checking_even_more branch from 0212cf7 to ad51c2b Compare March 1, 2024 13:54

marseel force-pushed the fix_health_checking_even_more branch from ad51c2b to fa1bccf Compare March 1, 2024 14:37

danehans approved these changes Mar 1, 2024

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 4, 2024

joestringer added this pull request to the merge queue Mar 4, 2024

Merged via the queue into cilium:main with commit d6e7c5d Mar 4, 2024
62 checks passed

marseel mentioned this pull request Apr 10, 2024

Cilium 1.14.9 failing endpoint connectivity everytime a pod is created #31846

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

health-server: Do not cleanup health checking result on node updates. #30917

health-server: Do not cleanup health checking result on node updates. #30917

marseel commented Feb 22, 2024

marseel commented Feb 22, 2024

danehans left a comment

marseel commented Feb 26, 2024

marseel commented Feb 26, 2024

christarazi left a comment •

edited

Loading

marseel commented Feb 28, 2024

marseel commented Mar 1, 2024

marseel commented Mar 1, 2024

danehans commented Mar 1, 2024

danehans commented Mar 1, 2024

danehans left a comment

health-server: Do not cleanup health checking result on node updates. #30917

health-server: Do not cleanup health checking result on node updates. #30917

Conversation

marseel commented Feb 22, 2024

marseel commented Feb 22, 2024

danehans left a comment

Choose a reason for hiding this comment

marseel commented Feb 26, 2024

marseel commented Feb 26, 2024

christarazi left a comment • edited Loading

Choose a reason for hiding this comment

marseel commented Feb 28, 2024

marseel commented Mar 1, 2024

marseel commented Mar 1, 2024

danehans commented Mar 1, 2024

danehans commented Mar 1, 2024

danehans left a comment

Choose a reason for hiding this comment

christarazi left a comment •

edited

Loading