Bug 1878905: Stop keepalived when we fail to retrieve its config #96

cybertron · 2020-09-14T21:07:28Z

When we fail repeatedly to update the keepalived configuration, it is
possible that an outdated keepalived config itself is the cause of
the failure. Since we can't update the keepalived config without the
api, the only thing we can do in this scenario is stop keepalived.
That should resolve any keepalived-related networking issues, and
then we'll be able to update the keepalived config using the local
apiserver (in the case that all instances of keepalived go down and
we lose the API VIP).

This depends on openshift/machine-config-operator#2085
to provide the start and stop functionality on the keepalived side.

openshift-ci-robot · 2020-09-14T21:07:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cybertron]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2020-09-14T21:20:43Z

@cybertron: This pull request references Bugzilla bug 1878905, which is invalid:

expected the bug to target the "4.6.0" release, but it targets "4.7.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1878905: Stop keepalived when we fail to retrieve its config

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yboaron · 2020-09-15T07:19:23Z

Do we understand why we fail repeatedly to update the keepalived configuration?

For the api-vip config we need to be able to render the correct config file even if api-vip isn't reachable to avoid the circular dependency: Keepalived-monitor --> api-vip --> Keepalived-monitor.

It feels to me like the root cause is a bug in the code that constructs the api-vip unicast config and I think we should try to find this bug and fix it instead of running 'reset keepalived' without understanding the root cause.

Now, if we would like to add the 'reset-keepalived' mechanism to our system we should first define what conditions should trigger (maybe something like we fail to ping api-vip ? ) 'reset Keepalived' process.

yboaron · 2020-09-15T14:42:36Z

I think that #98 might help with this bug

cybertron · 2020-09-16T15:51:50Z

Do we understand why we fail repeatedly to update the keepalived configuration?

Yes. There were old routes from the VIP that still pointed at the physical interface after it had been added to a bridge. This messed up the node's ability to talk to the other nodes, which caused its local apiserver to go down. That meant the monitor couldn't talk to either the API VIP or the local apiserver to get updated keepalived config, so it was wedged. Stopping keepalived will drop the VIPs and cause the bad routes to be removed. Then the local apiserver will recover and we can update the keepalived config to point at the bridge like it should.

For the api-vip config we need to be able to render the correct config file even if api-vip isn't reachable to avoid the circular dependency: Keepalived-monitor --> api-vip --> Keepalived-monitor.

Right, this relies on a functional local apiserver to recover. Although in this particular instance it may also be able to recover from the api vip since that will move to a different node after keepalived stops. The bug only broke the node holding the VIPs at the time the interface was bridged.

It feels to me like the root cause is a bug in the code that constructs the api-vip unicast config and I think we should try to find this bug and fix it instead of running 'reset keepalived' without understanding the root cause.

Now, if we would like to add the 'reset-keepalived' mechanism to our system we should first define what conditions should trigger (maybe something like we fail to ping api-vip ? ) 'reset Keepalived' process.

The condition that triggers this is "keepalived did something that messed up the host networking". Being unable to ping the api vip isn't necessarily a problem, but in this case the local apiserver also went down because it couldn't talk to the rest of the cluster. Everything routing to 192.168.111.X was broken.

My thinking is that if a node can't ensure its keepalived config is up to date, we don't want that node holding the VIP anyway. It could have incorrect peers or some other mismatch with the other nodes.

cybertron · 2020-09-16T16:04:04Z

/bugzilla refresh

openshift-ci-robot · 2020-09-16T16:04:12Z

@cybertron: This pull request references Bugzilla bug 1878905, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yboaron · 2020-09-17T13:55:31Z

pkg/monitor/dynkeepalived.go

 				newConfig.EnableUnicast = curEnableUnicast
 			}
-			updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig)
+			err = updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig)


I think it will be better to have a separate GO function (and thread ) for that purpose, this function should periodically monitor the condition trigger the main process using a channel.
It will be easier to add logic in the future to this 'reset' mechanism

yboaron · 2020-09-17T13:57:17Z

I think that #100 should fix the wrong route problem
I'll try to reproduce the bug in my local env

yboaron · 2020-11-05T21:47:13Z

IIUC, this PR covers only the unicast mode (== baremetal platform) it starts/stopped keepalived based on updateUnicastConfig function status, is it correct?

If ^ is correct, I believe we should use another method to check keepalived health so also other platforms can benefit from that.

Additionally, I think that having this mechanism in a separate thread/go function would make it easier to add/update logic in the future.

cybertron · 2020-11-09T16:28:12Z

IIUC, this PR covers only the unicast mode (== baremetal platform) it starts/stopped keepalived based on updateUnicastConfig function status, is it correct?

Correct.

If ^ is correct, I believe we should use another method to check keepalived health so also other platforms can benefit from that.

It's only relevant for unicast. Non-unicast platforms don't need to talk to the API to generate their configs, so they'll never hit this.

Additionally, I think that having this mechanism in a separate thread/go function would make it easier to add/update logic in the future.

That would just mean we have to duplicate all of the config update logic into a separate thread and make extra calls to the API solely for error handling. It's much simpler and less error-prone to just write it as an error handler.

I could probably move the error handling logic for this into a separate function. I didn't do it that way initially because of the status variables that need to persist, but if I moved them to be module-level I think that could work. I'll try that out locally.

yboaron · 2020-11-10T08:36:26Z

Ohh, I thought that we want to add something more generic here that will stop/start Keepalived based on some checks but I'm good with just handling the unicast issues, and overall it looks OK.

I still have one concern here, if I didn't miss something in the code, adding the start/stop mechanism will trigger a redundant start message to the Keepalived container every polling interval although everything is green (which will be the case in 99.99% of the time).

Although I assume that the keepalived container should handle that, maybe we should consider optimizing this flow by sending the start/stop messages only when it is needed.

cybertron · 2020-11-10T20:53:10Z

Yeah, that's something I meant to discuss in the commit message. I did intentionally have it send start and stop messages continually, just in case we somehow get out of sync with keepalived itself. The MCO side of this is set up so it's a noop to send a start message when keepalived is already running, and vice versa for stop. It felt safer to continually sync rather than rely on the communication being perfect. This way if anything happened and a start or stop message wasn't handled properly it would just get updated on the next monitor check.

That said, right now it does kind of spam the logs because we log every command sent to the socket. Since start and stop are the two commands sent repeatedly, I had thoughts of skipping the logging when the sockets gets one of those. What do you think?

openshift-bot · 2021-02-10T23:18:18Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

bcrochet · 2021-02-11T12:15:42Z

/remove-lifecycle stale

yboaron · 2021-02-24T20:11:52Z

pkg/config/node.go

+	isBootstrap := os.Getenv("IS_BOOTSTRAP") == "yes"
+	var err error
+	var backends []Backend
+	// On the bootstrap we only want to look at the local apiserver. Once that goes
+	// down we want to shut down keepalived so the API VIP can move to the masters.
+	if !isBootstrap {
+		backends, err = getSortedBackends(kubeconfigPath, false)
+	}
+	if err != nil || isBootstrap {
+		if !isBootstrap {
+			log.Infof("An error occurred while trying to read master nodes details from api-vip:kube-apiserver: %v", err)
+			log.Infof("Trying to read master nodes details from localhost:kube-apiserver")


I'm not sure if we need this change,
the GetLBConfig function is used also by haproxy-monitor and IS_BOOTSTRAP ENV var isn't set in haproxy-monitor container (though it should work).
Additionally, the bootstrap's kubeconfig is pointing to localhost

It's been a while since I wrote this, but IS_BOOTSTRAP was being checked before in the monitor (see line 276 in dynkeepalived). If that's not correct we can remove it, but I think it was necessary here to maintain the same behavior as before.

You make a good point that kubeconfig is already pointed at localhost on bootstrap, but I think part of the reason for this logic was to avoid the log message about the api-vip when the api-vip is not actually what we care about. Maybe I should update the comment though?

yboaron · 2021-02-25T14:14:01Z

pkg/monitor/dynkeepalived.go

+				time.Sleep(interval)
+				continue
+			}
+			err = ensureRunning(conn)


I feel a bit uncomfortable with the fact that a start message will be sent every interval to the keepalived container in all nodes other than bootstrap even when everything is OK ( that would be the case 99.99% of the time)

Do you think there's a way to optimize this ? maybe we can start with sending the command only upon a change and then update it if needed ?

We could, but then we have to keep track of the process state in the monitor too. Right now the state is managed entirely on the haproxy container side so there's no chance of getting out of sync. Since the command is sent over a local socket, the only meaningful load to come out of it is the pgrep that happens to check if keepalived is already running. I don't think that's going to be significant.

We might be able to optimize the other side even more to reuse the results of the liveness probe instead of looking for the process, if that would make this more acceptable.

When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.

yboaron · 2021-04-08T08:20:46Z

pkg/monitor/dynkeepalived.go

 			}
-			updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig)
+			err = updateUnicastConfig(kubeconfigPath, &newConfig, appliedConfig)
+			if err != nil {


For the bootstrap case, we don't want to stop Keepalived before kube-apiservers start running , see https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/monitor/dynkeepalived.go#L132-#L143 , does this PR covers this case?

openshift-bot · 2021-07-07T14:09:56Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci · 2021-07-07T14:10:03Z

@cybertron: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2021-08-06T17:57:25Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-09-05T20:27:33Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-09-05T20:27:52Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested a review from bcrochet September 14, 2020 21:07

openshift-ci-robot requested a review from yboaron September 14, 2020 21:07

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2020

cybertron changed the title ~~Stop keepalived when we fail to retrieve its config~~ Bug 1878905: Stop keepalived when we fail to retrieve its config Sep 14, 2020

openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 14, 2020

cybertron force-pushed the keepalived-error-counter branch from 87a12b8 to 99a7f37 Compare September 14, 2020 21:21

cybertron mentioned this pull request Sep 16, 2020

Read properly information from localhost API when it's needed #98

Closed

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 16, 2020

yboaron reviewed Sep 17, 2020

View reviewed changes

cybertron force-pushed the keepalived-error-counter branch from 99a7f37 to 330c3cf Compare October 21, 2020 17:20

cybertron force-pushed the keepalived-error-counter branch from 330c3cf to 02d4a81 Compare November 5, 2020 17:41

cybertron force-pushed the keepalived-error-counter branch from 02d4a81 to 92caca1 Compare November 12, 2020 22:45

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2021

openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2021

yboaron reviewed Feb 24, 2021

View reviewed changes

yboaron reviewed Feb 25, 2021

View reviewed changes

cybertron force-pushed the keepalived-error-counter branch from 92caca1 to 2af33f7 Compare March 31, 2021 20:05

yboaron reviewed Apr 8, 2021

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 7, 2021

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 6, 2021

openshift-ci bot closed this Sep 5, 2021

Bug 1878905: Stop keepalived when we fail to retrieve its config #96

Bug 1878905: Stop keepalived when we fail to retrieve its config #96

Uh oh!

Conversation

cybertron commented Sep 14, 2020

Uh oh!

openshift-ci-robot commented Sep 14, 2020

Uh oh!

openshift-ci-robot commented Sep 14, 2020

Uh oh!

yboaron commented Sep 15, 2020

Uh oh!

yboaron commented Sep 15, 2020

Uh oh!

cybertron commented Sep 16, 2020

Uh oh!

cybertron commented Sep 16, 2020

Uh oh!

openshift-ci-robot commented Sep 16, 2020

Uh oh!

yboaron Sep 17, 2020

Choose a reason for hiding this comment

Uh oh!

yboaron commented Sep 17, 2020

Uh oh!

yboaron commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cybertron commented Nov 9, 2020

Uh oh!

yboaron commented Nov 10, 2020

Uh oh!

cybertron commented Nov 10, 2020

Uh oh!

openshift-bot commented Feb 10, 2021

Uh oh!

bcrochet commented Feb 11, 2021

Uh oh!

yboaron Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cybertron Mar 4, 2021

Choose a reason for hiding this comment

Uh oh!

yboaron Feb 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cybertron Mar 4, 2021

Choose a reason for hiding this comment

Uh oh!

yboaron Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

openshift-bot commented Jul 7, 2021

Uh oh!

openshift-ci bot commented Jul 7, 2021

Uh oh!

openshift-bot commented Aug 6, 2021

Uh oh!

openshift-bot commented Sep 5, 2021

Uh oh!

openshift-ci bot commented Sep 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yboaron commented Nov 5, 2020 •

edited

Loading

yboaron Feb 24, 2021 •

edited

Loading

yboaron Feb 25, 2021 •

edited

Loading