Bug 1873955: Add support for stopping and starting keepalived #2085

cybertron · 2020-09-14T21:06:04Z

There are circumstances where keepalived can cause issues with the
networking on a node, notably when bridging a physical interface.
After the address has been moved to the bridge, it is possible for
old routes to exist that cause problems talking to other nodes, which
breaks the apiserver and prevents us from updating the keepalived
config to reflect the networking change. This leaves us in a situation
where the code can't recover properly from the bad configuration.
In short, the apiserver is waiting for keepalived to update its
configuration, but keepalived needs the apiserver in order to do so.

This change addresses the problem by stopping keepalived if the
monitor fails to update the config more than 3 times in a row. That
will unconfigure any VIPs on the node, which should fix the error
described above. Once the bad routes related to the VIP(s) are gone,
the apiserver will recover and we'll be able to update the keepalived
config again. After that happens, keepalived is restarted.

This is one half of the fix. The other half will be in
baremetal-runtimecfg to call the control socket with stop and start
commands as appropriate.

- What I did

- How to verify it

- Description for the changelog

openshift-ci-robot · 2020-09-14T21:06:08Z

@cybertron: This pull request references Bugzilla bug 1873955, which is invalid:

expected the bug to target the "4.6.0" release, but it targets "4.7.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1873955: Add support for stopping and starting keepalived

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.

cybertron · 2020-09-15T20:09:08Z

/test e2e-metal-ipi
/test e2e-upgrade

cybertron · 2020-09-15T21:41:45Z

/test e2e-metal-ipi

cybertron · 2020-09-16T16:00:39Z

/bugzilla refresh
/test e2e-metal-ipi

openshift-ci-robot · 2020-09-16T16:00:50Z

@cybertron: This pull request references Bugzilla bug 1873955, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

/bugzilla refresh
/test e2e-metal-ipi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

celebdor · 2020-09-16T16:09:38Z

manifests/baremetal/keepalived.yaml

+      stop_keepalived()
+      {
+        if pid=$(pgrep -o keepalived); then
+          kill "$pid"


Two suggestions:

Be specific about the signal we are sending

Wait for the graceful termination.

Do you mean add a kill -9 if it doesn't terminate gracefully (which seems like a good thing to do)? Otherwise there isn't much need to wait since we aren't going to do anything else after it exits.

Yeah, I meant wait a reasonable time for SIGTERM to do its thing (probably we can use the same systemd uses) and then SIGKILL it. Otherwise, couldn't we end up with not starting again due to the pgrep we have in start_keepalived?

Yeah, that's possible. I'll make the change.

Okay, this should be done. I set the timeout to 9 seconds since the monitor runs every 10 and I doubt we want to have multiple stop commands going at once.

bcrochet · 2020-09-18T16:51:37Z

/retest

yuqi-zhang · 2020-09-18T21:47:34Z

/retest

Do we expect metal tests to pass on this one?

bcrochet

/lgtm
/retest

openshift-ci-robot · 2020-09-21T16:37:38Z

@cybertron: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/okd-e2e-aws	b71a351d574f1b8ebba081dcfb8da156146c88ab	link	`/test okd-e2e-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yboaron · 2020-09-23T12:22:30Z

manifests/baremetal/keepalived.yaml

      set -ex
      declare -r keepalived_sock="/var/run/keepalived/keepalived.sock"
      export -f msg_handler
      export -f reload_keepalived
+      export -f stop_keepalived
+      export -f start_keepalived
      if [ -s "/etc/keepalived/keepalived.conf" ]; then


Do we want to cover also the following (corner) case ? :

Monitor container send STOP request
1.1 Keepalived container stop Keepalived process

After sometime Kubelet restart Kaapalived container for some reason

I'm inclined to say no. If the whole keepalived container gets restarted, I think we should let it attempt to run normally. If the monitor continues to fail it will stop it again anyway.

yboaron · 2020-09-23T12:30:25Z

templates/common/baremetal/files/baremetal-keepalived.yaml

+            if pid=$(pgrep -o keepalived); then
+              kill -s SIGTERM "$pid"
+              # The monitor runs every 10 seconds
+              sleep 9


Why do we need the sleep here?
Is it because Keepalived container processes the messages in parrallel ?

It's to give the process time to exit before we kill -9 it below.

yboaron · 2020-09-23T12:49:54Z

/hold

This BZ was already resolved by the VIP mask change, do we still want to push the workers/masters Keepalived start/stop mechanism for 4.6 ?

When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.

openshift-ci-robot · 2020-10-21T17:20:57Z

New changes are detected. LGTM label has been removed.

When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.

cybertron · 2020-11-05T20:11:41Z

Sorry, somehow I missed Yossi's comments on this before. I've rebased it and tested it locally and it still seems to be working fine.

When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.

openshift-merge-robot · 2020-11-13T03:16:08Z

@cybertron: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp-op	084bcf02f435405453e7a1048ec215855b1811f7	link	`/test e2e-gcp-op`
ci/prow/e2e-agnostic-upgrade	084bcf02f435405453e7a1048ec215855b1811f7	link	`/test e2e-agnostic-upgrade`
ci/prow/e2e-aws-serial	084bcf02f435405453e7a1048ec215855b1811f7	link	`/test e2e-aws-serial`
ci/prow/okd-e2e-aws	084bcf02f435405453e7a1048ec215855b1811f7	link	`/test okd-e2e-aws`
ci/prow/e2e-aws-workers-rhel7	084bcf02f435405453e7a1048ec215855b1811f7	link	`/test e2e-aws-workers-rhel7`
ci/prow/e2e-aws	084bcf02f435405453e7a1048ec215855b1811f7	link	`/test e2e-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

There are circumstances where keepalived can cause issues with the networking on a node, notably when bridging a physical interface. After the address has been moved to the bridge, it is possible for old routes to exist that cause problems talking to other nodes, which breaks the apiserver and prevents us from updating the keepalived config to reflect the networking change. This leaves us in a situation where the code can't recover properly from the bad configuration. In short, the apiserver is waiting for keepalived to update its configuration, but keepalived needs the apiserver in order to do so. This change addresses the problem by stopping keepalived if the monitor fails to update the config more than 3 times in a row. That will unconfigure any VIPs on the node, which should fix the error described above. Once the bad routes related to the VIP(s) are gone, the apiserver will recover and we'll be able to update the keepalived config again. After that happens, keepalived is restarted. This is one half of the fix. The other half will be in baremetal-runtimecfg to call the control socket with stop and start commands as appropriate.

openshift-ci-robot · 2021-03-31T20:05:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bcrochet, cybertron
To complete the pull request process, please assign sinnykumari after the PR has been reviewed.
You can assign the PR to them by writing /assign @sinnykumari in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

When we fail repeatedly to update the keepalived configuration, it is possible that an outdated keepalived config itself is the cause of the failure. Since we can't update the keepalived config without the api, the only thing we can do in this scenario is stop keepalived. That should resolve any keepalived-related networking issues, and then we'll be able to update the keepalived config using the local apiserver (in the case that all instances of keepalived go down and we lose the API VIP). This depends on openshift/machine-config-operator#2085 to provide the start and stop functionality on the keepalived side.

openshift-ci · 2021-05-21T16:08:08Z

@cybertron: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-workers-rhel7	`d1698a5`	link	`/test e2e-aws-workers-rhel7`
ci/prow/e2e-ovn-step-registry	`d1698a5`	link	`/test e2e-ovn-step-registry`
ci/prow/okd-e2e-aws	`d1698a5`	link	`/test okd-e2e-aws`
ci/prow/e2e-metal-ipi	`d1698a5`	link	`/test e2e-metal-ipi`
ci/prow/e2e-vsphere-upgrade	`d1698a5`	link	`/test e2e-vsphere-upgrade`
ci/prow/images	`d1698a5`	link	`/test images`
ci/prow/e2e-aws-serial	`d1698a5`	link	`/test e2e-aws-serial`
ci/prow/e2e-agnostic-upgrade	`d1698a5`	link	`/test e2e-agnostic-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

kikisdeliveryservice · 2021-07-13T23:52:02Z

Underlying BZ was closed by openshift/baremetal-runtimecfg#100

Closing this PR, reopen if necessary

openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 14, 2020

openshift-ci-robot requested review from ericavonb and yuqi-zhang September 14, 2020 21:06

cybertron mentioned this pull request Sep 14, 2020

Bug 1878905: Stop keepalived when we fail to retrieve its config openshift/baremetal-runtimecfg#96

Closed

cybertron force-pushed the keepalived-monitor-errors branch from 4db580d to 3649d33 Compare September 14, 2020 21:24

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Sep 16, 2020

openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Sep 16, 2020

celebdor reviewed Sep 16, 2020

View reviewed changes

cybertron force-pushed the keepalived-monitor-errors branch 2 times, most recently from a7fa81f to 224ad66 Compare September 16, 2020 19:55

pliurh mentioned this pull request Sep 17, 2020

Bug 1854306: Initialize host ovs differently for Openshift-SDN and Ovn-kubernetes by ovs-configuration.service #2066

Merged

cybertron force-pushed the keepalived-monitor-errors branch from 224ad66 to b71a351 Compare September 17, 2020 16:45

bcrochet reviewed Sep 21, 2020

View reviewed changes

openshift-ci-robot assigned bcrochet Sep 21, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2020

yboaron reviewed Sep 23, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2020

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 25, 2020

cybertron force-pushed the keepalived-monitor-errors branch from b71a351 to 6a90694 Compare October 21, 2020 17:20

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2020

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2020

cybertron force-pushed the keepalived-monitor-errors branch from 6a90694 to 53097b9 Compare November 5, 2020 20:07

cybertron force-pushed the keepalived-monitor-errors branch from 53097b9 to 084bcf0 Compare November 12, 2020 22:45

cybertron force-pushed the keepalived-monitor-errors branch from 084bcf0 to 488c329 Compare January 21, 2021 23:21

cybertron force-pushed the keepalived-monitor-errors branch from 488c329 to d1698a5 Compare March 31, 2021 20:04

kikisdeliveryservice closed this Jul 13, 2021

Bug 1873955: Add support for stopping and starting keepalived #2085

Bug 1873955: Add support for stopping and starting keepalived #2085

Uh oh!

Conversation

cybertron commented Sep 14, 2020

Uh oh!

openshift-ci-robot commented Sep 14, 2020

Uh oh!

cybertron commented Sep 15, 2020

Uh oh!

cybertron commented Sep 15, 2020

Uh oh!

cybertron commented Sep 16, 2020

Uh oh!

openshift-ci-robot commented Sep 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

celebdor Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bcrochet commented Sep 18, 2020

Uh oh!

yuqi-zhang commented Sep 18, 2020

Uh oh!

bcrochet left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Sep 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yboaron commented Sep 23, 2020

Uh oh!

openshift-ci-robot commented Oct 21, 2020

Uh oh!

cybertron commented Nov 5, 2020

Uh oh!

openshift-merge-robot commented Nov 13, 2020

Uh oh!

openshift-ci-robot commented Mar 31, 2021

Uh oh!

openshift-ci bot commented May 21, 2021

Uh oh!

kikisdeliveryservice commented Jul 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

celebdor Sep 16, 2020 •

edited

Loading

openshift-ci-robot commented Sep 21, 2020 •

edited

Loading