-
Notifications
You must be signed in to change notification settings - Fork 45
Bug 1878905: Stop keepalived when we fail to retrieve its config #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cybertron The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@cybertron: This pull request references Bugzilla bug 1878905, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
87a12b8 to
99a7f37
Compare
|
Do we understand why we fail repeatedly to update the keepalived configuration? For the api-vip config we need to be able to render the correct config file even if api-vip isn't reachable to avoid the circular dependency: Keepalived-monitor --> api-vip --> Keepalived-monitor. It feels to me like the root cause is a bug in the code that constructs the api-vip unicast config and I think we should try to find this bug and fix it instead of running 'reset keepalived' without understanding the root cause. Now, if we would like to add the 'reset-keepalived' mechanism to our system we should first define what conditions should trigger (maybe something like we fail to ping api-vip ? ) 'reset Keepalived' process. |
|
I think that #98 might help with this bug |
Yes. There were old routes from the VIP that still pointed at the physical interface after it had been added to a bridge. This messed up the node's ability to talk to the other nodes, which caused its local apiserver to go down. That meant the monitor couldn't talk to either the API VIP or the local apiserver to get updated keepalived config, so it was wedged. Stopping keepalived will drop the VIPs and cause the bad routes to be removed. Then the local apiserver will recover and we can update the keepalived config to point at the bridge like it should.
Right, this relies on a functional local apiserver to recover. Although in this particular instance it may also be able to recover from the api vip since that will move to a different node after keepalived stops. The bug only broke the node holding the VIPs at the time the interface was bridged.
The condition that triggers this is "keepalived did something that messed up the host networking". Being unable to ping the api vip isn't necessarily a problem, but in this case the local apiserver also went down because it couldn't talk to the rest of the cluster. Everything routing to 192.168.111.X was broken. My thinking is that if a node can't ensure its keepalived config is up to date, we don't want that node holding the VIP anyway. It could have incorrect peers or some other mismatch with the other nodes. |
|
/bugzilla refresh |
|
@cybertron: This pull request references Bugzilla bug 1878905, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| newConfig.EnableUnicast = curEnableUnicast | ||
| } | ||
| updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig) | ||
| err = updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be better to have a separate GO function (and thread ) for that purpose, this function should periodically monitor the condition trigger the main process using a channel.
It will be easier to add logic in the future to this 'reset' mechanism
|
I think that #100 should fix the wrong route problem |
99a7f37 to
330c3cf
Compare
330c3cf to
02d4a81
Compare
|
IIUC, this PR covers only the unicast mode (== baremetal platform) it starts/stopped keepalived based on updateUnicastConfig function status, is it correct? If ^ is correct, I believe we should use another method to check keepalived health so also other platforms can benefit from that. Additionally, I think that having this mechanism in a separate thread/go function would make it easier to add/update logic in the future. |
Correct.
It's only relevant for unicast. Non-unicast platforms don't need to talk to the API to generate their configs, so they'll never hit this.
That would just mean we have to duplicate all of the config update logic into a separate thread and make extra calls to the API solely for error handling. It's much simpler and less error-prone to just write it as an error handler. I could probably move the error handling logic for this into a separate function. I didn't do it that way initially because of the status variables that need to persist, but if I moved them to be module-level I think that could work. I'll try that out locally. |
|
Ohh, I thought that we want to add something more generic here that will stop/start Keepalived based on some checks but I'm good with just handling the unicast issues, and overall it looks OK. I still have one concern here, if I didn't miss something in the code, adding the start/stop mechanism will trigger a redundant start message to the Keepalived container every polling interval although everything is green (which will be the case in 99.99% of the time). Although I assume that the keepalived container should handle that, maybe we should consider optimizing this flow by sending the start/stop messages only when it is needed. |
|
Yeah, that's something I meant to discuss in the commit message. I did intentionally have it send start and stop messages continually, just in case we somehow get out of sync with keepalived itself. The MCO side of this is set up so it's a noop to send a start message when keepalived is already running, and vice versa for stop. It felt safer to continually sync rather than rely on the communication being perfect. This way if anything happened and a start or stop message wasn't handled properly it would just get updated on the next monitor check. That said, right now it does kind of spam the logs because we log every command sent to the socket. Since start and stop are the two commands sent repeatedly, I had thoughts of skipping the logging when the sockets gets one of those. What do you think? |
02d4a81 to
92caca1
Compare
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
| isBootstrap := os.Getenv("IS_BOOTSTRAP") == "yes" | ||
| var err error | ||
| var backends []Backend | ||
| // On the bootstrap we only want to look at the local apiserver. Once that goes | ||
| // down we want to shut down keepalived so the API VIP can move to the masters. | ||
| if !isBootstrap { | ||
| backends, err = getSortedBackends(kubeconfigPath, false) | ||
| } | ||
| if err != nil || isBootstrap { | ||
| if !isBootstrap { | ||
| log.Infof("An error occurred while trying to read master nodes details from api-vip:kube-apiserver: %v", err) | ||
| log.Infof("Trying to read master nodes details from localhost:kube-apiserver") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need this change,
the GetLBConfig function is used also by haproxy-monitor and IS_BOOTSTRAP ENV var isn't set in haproxy-monitor container (though it should work).
Additionally, the bootstrap's kubeconfig is pointing to localhost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while since I wrote this, but IS_BOOTSTRAP was being checked before in the monitor (see line 276 in dynkeepalived). If that's not correct we can remove it, but I think it was necessary here to maintain the same behavior as before.
You make a good point that kubeconfig is already pointed at localhost on bootstrap, but I think part of the reason for this logic was to avoid the log message about the api-vip when the api-vip is not actually what we care about. Maybe I should update the comment though?
pkg/monitor/dynkeepalived.go
Outdated
| time.Sleep(interval) | ||
| continue | ||
| } | ||
| err = ensureRunning(conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a bit uncomfortable with the fact that a start message will be sent every interval to the keepalived container in all nodes other than bootstrap even when everything is OK ( that would be the case 99.99% of the time)
Do you think there's a way to optimize this ? maybe we can start with sending the command only upon a change and then update it if needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but then we have to keep track of the process state in the monitor too. Right now the state is managed entirely on the haproxy container side so there's no chance of getting out of sync. Since the command is sent over a local socket, the only meaningful load to come out of it is the pgrep that happens to check if keepalived is already running. I don't think that's going to be significant.
We might be able to optimize the other side even more to reuse the results of the liveness probe instead of looking for the process, if that would make this more acceptable.
When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.
92caca1 to
2af33f7
Compare
| } | ||
| updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig) | ||
| err = updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig) | ||
| if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the bootstrap case, we don't want to stop Keepalived before kube-apiservers start running , see https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/monitor/dynkeepalived.go#L132-#L143 , does this PR covers this case?
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
@cybertron: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When we fail repeatedly to update the keepalived configuration, it is
possible that an outdated keepalived config itself is the cause of
the failure. Since we can't update the keepalived config without the
api, the only thing we can do in this scenario is stop keepalived.
That should resolve any keepalived-related networking issues, and
then we'll be able to update the keepalived config using the local
apiserver (in the case that all instances of keepalived go down and
we lose the API VIP).
This depends on openshift/machine-config-operator#2085
to provide the start and stop functionality on the keepalived side.