Skip to content

Conversation

@cybertron
Copy link
Member

When we fail repeatedly to update the keepalived configuration, it is
possible that an outdated keepalived config itself is the cause of
the failure. Since we can't update the keepalived config without the
api, the only thing we can do in this scenario is stop keepalived.
That should resolve any keepalived-related networking issues, and
then we'll be able to update the keepalived config using the local
apiserver (in the case that all instances of keepalived go down and
we lose the API VIP).

This depends on openshift/machine-config-operator#2085
to provide the start and stop functionality on the keepalived side.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2020
@cybertron cybertron changed the title Stop keepalived when we fail to retrieve its config Bug 1878905: Stop keepalived when we fail to retrieve its config Sep 14, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 14, 2020
@openshift-ci-robot
Copy link
Contributor

@cybertron: This pull request references Bugzilla bug 1878905, which is invalid:

  • expected the bug to target the "4.6.0" release, but it targets "4.7.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1878905: Stop keepalived when we fail to retrieve its config

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cybertron cybertron force-pushed the keepalived-error-counter branch from 87a12b8 to 99a7f37 Compare September 14, 2020 21:21
@yboaron
Copy link
Contributor

yboaron commented Sep 15, 2020

Do we understand why we fail repeatedly to update the keepalived configuration?

For the api-vip config we need to be able to render the correct config file even if api-vip isn't reachable to avoid the circular dependency: Keepalived-monitor --> api-vip --> Keepalived-monitor.

It feels to me like the root cause is a bug in the code that constructs the api-vip unicast config and I think we should try to find this bug and fix it instead of running 'reset keepalived' without understanding the root cause.

Now, if we would like to add the 'reset-keepalived' mechanism to our system we should first define what conditions should trigger (maybe something like we fail to ping api-vip ? ) 'reset Keepalived' process.

@yboaron
Copy link
Contributor

yboaron commented Sep 15, 2020

I think that #98 might help with this bug

@cybertron
Copy link
Member Author

Do we understand why we fail repeatedly to update the keepalived configuration?

Yes. There were old routes from the VIP that still pointed at the physical interface after it had been added to a bridge. This messed up the node's ability to talk to the other nodes, which caused its local apiserver to go down. That meant the monitor couldn't talk to either the API VIP or the local apiserver to get updated keepalived config, so it was wedged. Stopping keepalived will drop the VIPs and cause the bad routes to be removed. Then the local apiserver will recover and we can update the keepalived config to point at the bridge like it should.

For the api-vip config we need to be able to render the correct config file even if api-vip isn't reachable to avoid the circular dependency: Keepalived-monitor --> api-vip --> Keepalived-monitor.

Right, this relies on a functional local apiserver to recover. Although in this particular instance it may also be able to recover from the api vip since that will move to a different node after keepalived stops. The bug only broke the node holding the VIPs at the time the interface was bridged.

It feels to me like the root cause is a bug in the code that constructs the api-vip unicast config and I think we should try to find this bug and fix it instead of running 'reset keepalived' without understanding the root cause.

Now, if we would like to add the 'reset-keepalived' mechanism to our system we should first define what conditions should trigger (maybe something like we fail to ping api-vip ? ) 'reset Keepalived' process.

The condition that triggers this is "keepalived did something that messed up the host networking". Being unable to ping the api vip isn't necessarily a problem, but in this case the local apiserver also went down because it couldn't talk to the rest of the cluster. Everything routing to 192.168.111.X was broken.

My thinking is that if a node can't ensure its keepalived config is up to date, we don't want that node holding the VIP anyway. It could have incorrect peers or some other mismatch with the other nodes.

@cybertron
Copy link
Member Author

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@cybertron: This pull request references Bugzilla bug 1878905, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 16, 2020
newConfig.EnableUnicast = curEnableUnicast
}
updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig)
err = updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be better to have a separate GO function (and thread ) for that purpose, this function should periodically monitor the condition trigger the main process using a channel.
It will be easier to add logic in the future to this 'reset' mechanism

@yboaron
Copy link
Contributor

yboaron commented Sep 17, 2020

I think that #100 should fix the wrong route problem
I'll try to reproduce the bug in my local env

@cybertron cybertron force-pushed the keepalived-error-counter branch from 99a7f37 to 330c3cf Compare October 21, 2020 17:20
@cybertron cybertron force-pushed the keepalived-error-counter branch from 330c3cf to 02d4a81 Compare November 5, 2020 17:41
@yboaron
Copy link
Contributor

yboaron commented Nov 5, 2020

IIUC, this PR covers only the unicast mode (== baremetal platform) it starts/stopped keepalived based on updateUnicastConfig function status, is it correct?

If ^ is correct, I believe we should use another method to check keepalived health so also other platforms can benefit from that.

Additionally, I think that having this mechanism in a separate thread/go function would make it easier to add/update logic in the future.

@cybertron
Copy link
Member Author

IIUC, this PR covers only the unicast mode (== baremetal platform) it starts/stopped keepalived based on updateUnicastConfig function status, is it correct?

Correct.

If ^ is correct, I believe we should use another method to check keepalived health so also other platforms can benefit from that.

It's only relevant for unicast. Non-unicast platforms don't need to talk to the API to generate their configs, so they'll never hit this.

Additionally, I think that having this mechanism in a separate thread/go function would make it easier to add/update logic in the future.

That would just mean we have to duplicate all of the config update logic into a separate thread and make extra calls to the API solely for error handling. It's much simpler and less error-prone to just write it as an error handler.

I could probably move the error handling logic for this into a separate function. I didn't do it that way initially because of the status variables that need to persist, but if I moved them to be module-level I think that could work. I'll try that out locally.

@yboaron
Copy link
Contributor

yboaron commented Nov 10, 2020

Ohh, I thought that we want to add something more generic here that will stop/start Keepalived based on some checks but I'm good with just handling the unicast issues, and overall it looks OK.

I still have one concern here, if I didn't miss something in the code, adding the start/stop mechanism will trigger a redundant start message to the Keepalived container every polling interval although everything is green (which will be the case in 99.99% of the time).

Although I assume that the keepalived container should handle that, maybe we should consider optimizing this flow by sending the start/stop messages only when it is needed.

@cybertron
Copy link
Member Author

Yeah, that's something I meant to discuss in the commit message. I did intentionally have it send start and stop messages continually, just in case we somehow get out of sync with keepalived itself. The MCO side of this is set up so it's a noop to send a start message when keepalived is already running, and vice versa for stop. It felt safer to continually sync rather than rely on the communication being perfect. This way if anything happened and a start or stop message wasn't handled properly it would just get updated on the next monitor check.

That said, right now it does kind of spam the logs because we log every command sent to the socket. Since start and stop are the two commands sent repeatedly, I had thoughts of skipping the logging when the sockets gets one of those. What do you think?

@cybertron cybertron force-pushed the keepalived-error-counter branch from 02d4a81 to 92caca1 Compare November 12, 2020 22:45
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2021
@bcrochet
Copy link
Member

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2021
Comment on lines +412 to +418
isBootstrap := os.Getenv("IS_BOOTSTRAP") == "yes"
var err error
var backends []Backend
// On the bootstrap we only want to look at the local apiserver. Once that goes
// down we want to shut down keepalived so the API VIP can move to the masters.
if !isBootstrap {
backends, err = getSortedBackends(kubeconfigPath, false)
}
if err != nil || isBootstrap {
if !isBootstrap {
log.Infof("An error occurred while trying to read master nodes details from api-vip:kube-apiserver: %v", err)
log.Infof("Trying to read master nodes details from localhost:kube-apiserver")
Copy link
Contributor

@yboaron yboaron Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need this change,
the GetLBConfig function is used also by haproxy-monitor and IS_BOOTSTRAP ENV var isn't set in haproxy-monitor container (though it should work).
Additionally, the bootstrap's kubeconfig is pointing to localhost

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while since I wrote this, but IS_BOOTSTRAP was being checked before in the monitor (see line 276 in dynkeepalived). If that's not correct we can remove it, but I think it was necessary here to maintain the same behavior as before.

You make a good point that kubeconfig is already pointed at localhost on bootstrap, but I think part of the reason for this logic was to avoid the log message about the api-vip when the api-vip is not actually what we care about. Maybe I should update the comment though?

time.Sleep(interval)
continue
}
err = ensureRunning(conn)
Copy link
Contributor

@yboaron yboaron Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit uncomfortable with the fact that a start message will be sent every interval to the keepalived container in all nodes other than bootstrap even when everything is OK ( that would be the case 99.99% of the time)

Do you think there's a way to optimize this ? maybe we can start with sending the command only upon a change and then update it if needed ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but then we have to keep track of the process state in the monitor too. Right now the state is managed entirely on the haproxy container side so there's no chance of getting out of sync. Since the command is sent over a local socket, the only meaningful load to come out of it is the pgrep that happens to check if keepalived is already running. I don't think that's going to be significant.

We might be able to optimize the other side even more to reuse the results of the liveness probe instead of looking for the process, if that would make this more acceptable.

When we fail repeatedly to update the keepalived configuration, it is
possible that an outdated keepalived config itself is the cause of
the failure. Since we can't update the keepalived config without the
api, the only thing we can do in this scenario is stop keepalived.
That should resolve any keepalived-related networking issues, and
then we'll be able to update the keepalived config using the local
apiserver (in the case that all instances of keepalived go down and
we lose the API VIP).

This depends on openshift/machine-config-operator#2085
to provide the start and stop functionality on the keepalived side.
@cybertron cybertron force-pushed the keepalived-error-counter branch from 92caca1 to 2af33f7 Compare March 31, 2021 20:05
}
updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig)
err = updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the bootstrap case, we don't want to stop Keepalived before kube-apiservers start running , see https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/monitor/dynkeepalived.go#L132-#L143 , does this PR covers this case?

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 7, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 7, 2021

@cybertron: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 6, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 5, 2021

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Sep 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants