-
Notifications
You must be signed in to change notification settings - Fork 462
Bug 1873955: Add support for stopping and starting keepalived #2085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cybertron: This pull request references Bugzilla bug 1873955, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.
When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.
4db580d to
3649d33
Compare
|
/test e2e-metal-ipi |
|
/test e2e-metal-ipi |
|
/bugzilla refresh |
|
@cybertron: This pull request references Bugzilla bug 1873955, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
manifests/baremetal/keepalived.yaml
Outdated
| stop_keepalived() | ||
| { | ||
| if pid=$(pgrep -o keepalived); then | ||
| kill "$pid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two suggestions:
- Be specific about the signal we are sending
- Wait for the graceful termination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean add a kill -9 if it doesn't terminate gracefully (which seems like a good thing to do)? Otherwise there isn't much need to wait since we aren't going to do anything else after it exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I meant wait a reasonable time for SIGTERM to do its thing (probably we can use the same systemd uses) and then SIGKILL it. Otherwise, couldn't we end up with not starting again due to the pgrep we have in start_keepalived?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's possible. I'll make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this should be done. I set the timeout to 9 seconds since the monitor runs every 10 and I doubt we want to have multiple stop commands going at once.
a7fa81f to
224ad66
Compare
224ad66 to
b71a351
Compare
|
/retest |
|
/retest Do we expect metal tests to pass on this one? |
bcrochet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest
|
@cybertron: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
manifests/baremetal/keepalived.yaml
Outdated
| set -ex | ||
| declare -r keepalived_sock="/var/run/keepalived/keepalived.sock" | ||
| export -f msg_handler | ||
| export -f reload_keepalived | ||
| export -f stop_keepalived | ||
| export -f start_keepalived | ||
| if [ -s "/etc/keepalived/keepalived.conf" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to cover also the following (corner) case ? :
- Monitor container send STOP request
1.1 Keepalived container stop Keepalived process - After sometime Kubelet restart Kaapalived container for some reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to say no. If the whole keepalived container gets restarted, I think we should let it attempt to run normally. If the monitor continues to fail it will stop it again anyway.
| if pid=$(pgrep -o keepalived); then | ||
| kill -s SIGTERM "$pid" | ||
| # The monitor runs every 10 seconds | ||
| sleep 9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the sleep here?
Is it because Keepalived container processes the messages in parrallel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to give the process time to exit before we kill -9 it below.
|
/hold This BZ was already resolved by the VIP mask change, do we still want to push the workers/masters Keepalived start/stop mechanism for 4.6 ? |
When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.
b71a351 to
6a90694
Compare
|
New changes are detected. LGTM label has been removed. |
When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.
6a90694 to
53097b9
Compare
|
Sorry, somehow I missed Yossi's comments on this before. I've rebased it and tested it locally and it still seems to be working fine. |
53097b9 to
084bcf0
Compare
When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.
|
@cybertron: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
084bcf0 to
488c329
Compare
There are circumstances where keepalived can cause issues with the networking on a node, notably when bridging a physical interface. After the address has been moved to the bridge, it is possible for old routes to exist that cause problems talking to other nodes, which breaks the apiserver and prevents us from updating the keepalived config to reflect the networking change. This leaves us in a situation where the code can't recover properly from the bad configuration. In short, the apiserver is waiting for keepalived to update its configuration, but keepalived needs the apiserver in order to do so. This change addresses the problem by stopping keepalived if the monitor fails to update the config more than 3 times in a row. That will unconfigure any VIPs on the node, which should fix the error described above. Once the bad routes related to the VIP(s) are gone, the apiserver will recover and we'll be able to update the keepalived config again. After that happens, keepalived is restarted. This is one half of the fix. The other half will be in baremetal-runtimecfg to call the control socket with stop and start commands as appropriate.
488c329 to
d1698a5
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bcrochet, cybertron The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.
|
@cybertron: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Underlying BZ was closed by openshift/baremetal-runtimecfg#100 Closing this PR, reopen if necessary |
There are circumstances where keepalived can cause issues with the
networking on a node, notably when bridging a physical interface.
After the address has been moved to the bridge, it is possible for
old routes to exist that cause problems talking to other nodes, which
breaks the apiserver and prevents us from updating the keepalived
config to reflect the networking change. This leaves us in a situation
where the code can't recover properly from the bad configuration.
In short, the apiserver is waiting for keepalived to update its
configuration, but keepalived needs the apiserver in order to do so.
This change addresses the problem by stopping keepalived if the
monitor fails to update the config more than 3 times in a row. That
will unconfigure any VIPs on the node, which should fix the error
described above. Once the bad routes related to the VIP(s) are gone,
the apiserver will recover and we'll be able to update the keepalived
config again. After that happens, keepalived is restarted.
This is one half of the fix. The other half will be in
baremetal-runtimecfg to call the control socket with stop and start
commands as appropriate.
- What I did
- How to verify it
- Description for the changelog