-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
health-server: Do not cleanup health checking result on node updates. #30917
Conversation
/cc @christarazi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM but can a test case be added to TestProbersetNodes()
that exercises the code changes?
b53d562
to
0212cf7
Compare
Good idea @danehans I've added a test. I was already thinking about refactoring that part and totally forgot to add the test. |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this code is so subtle. Do we need better testing / more coverage in this code? We needed 3 separate PRs just to get here and who knows if there will need to be more :)
In theory, this code should be simple, I think that we need a better design, I will follow up with the issue and do it as my side-project. |
0212cf7
to
ad51c2b
Compare
Rebased on the main as |
/test |
Whenever node was updated, healtch-checking was removing and re-adding that node. This caused it to lose information about previously performed probes, which resulted in `unknown` status for such nodes. This can happen often especially in ENI mode, where node updates happen each time new pod is scheduled on the node. Signed-off-by: Marcel Zieba <[email protected]>
ad51c2b
to
fa1bccf
Compare
@marseel when ^ issue is created, can you link it to this PR? |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marseel thanks for adding the unit test.
Whenever a node was updated, healtch-checking was removing and re-adding that node. This caused it to lose information about previously performed probes, which resulted in
unknown
status for such nodes. This can happen often, especially in ENI mode, where node updates happen each time a new pod is scheduled on the node.Backstory:
#29566 fixed issue where we were reporting deleted nodes as unhealthy, but it introduced an issue when the updating node was overriding probe status, marking such nodes status as unreachable. Fix in #30504 changed it to mark such nodes as unknown. Now we are preserving probe status on node updates.
Fixes: #29566