Skip to content

Conversation

@cybertron
Copy link
Member

There are circumstances where keepalived can cause issues with the
networking on a node, notably when bridging a physical interface.
After the address has been moved to the bridge, it is possible for
old routes to exist that cause problems talking to other nodes, which
breaks the apiserver and prevents us from updating the keepalived
config to reflect the networking change. This leaves us in a situation
where the code can't recover properly from the bad configuration.
In short, the apiserver is waiting for keepalived to update its
configuration, but keepalived needs the apiserver in order to do so.

This change addresses the problem by stopping keepalived if the
monitor fails to update the config more than 3 times in a row. That
will unconfigure any VIPs on the node, which should fix the error
described above. Once the bad routes related to the VIP(s) are gone,
the apiserver will recover and we'll be able to update the keepalived
config again. After that happens, keepalived is restarted.

This is one half of the fix. The other half will be in
baremetal-runtimecfg to call the control socket with stop and start
commands as appropriate.

- What I did

- How to verify it

- Description for the changelog

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 14, 2020
@openshift-ci-robot
Copy link
Contributor

@cybertron: This pull request references Bugzilla bug 1873955, which is invalid:

  • expected the bug to target the "4.6.0" release, but it targets "4.7.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1873955: Add support for stopping and starting keepalived

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cybertron added a commit to cybertron/baremetal-runtimecfg that referenced this pull request Sep 14, 2020
When we fail repeatedly to update the keepalived configuration, it is
possible that an outdated keepalived config itself is the cause of
the failure. Since we can't update the keepalived config without the
api, the only thing we can do in this scenario is stop keepalived.
That should resolve any keepalived-related networking issues, and
then we'll be able to update the keepalived config using the local
apiserver (in the case that all instances of keepalived go down and
we lose the API VIP).

This depends on openshift/machine-config-operator#2085
to provide the start and stop functionality on the keepalived side.
cybertron added a commit to cybertron/baremetal-runtimecfg that referenced this pull request Sep 14, 2020
When we fail repeatedly to update the keepalived configuration, it is
possible that an outdated keepalived config itself is the cause of
the failure. Since we can't update the keepalived config without the
api, the only thing we can do in this scenario is stop keepalived.
That should resolve any keepalived-related networking issues, and
then we'll be able to update the keepalived config using the local
apiserver (in the case that all instances of keepalived go down and
we lose the API VIP).

This depends on openshift/machine-config-operator#2085
to provide the start and stop functionality on the keepalived side.
@cybertron cybertron force-pushed the keepalived-monitor-errors branch from 4db580d to 3649d33 Compare September 14, 2020 21:24
@cybertron
Copy link
Member Author

/test e2e-metal-ipi
/test e2e-upgrade

@cybertron
Copy link
Member Author

/test e2e-metal-ipi

@cybertron
Copy link
Member Author

/bugzilla refresh
/test e2e-metal-ipi

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Sep 16, 2020
@openshift-ci-robot
Copy link
Contributor

@cybertron: This pull request references Bugzilla bug 1873955, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/bugzilla refresh
/test e2e-metal-ipi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Sep 16, 2020
stop_keepalived()
{
if pid=$(pgrep -o keepalived); then
kill "$pid"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions:

  • Be specific about the signal we are sending
  • Wait for the graceful termination.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean add a kill -9 if it doesn't terminate gracefully (which seems like a good thing to do)? Otherwise there isn't much need to wait since we aren't going to do anything else after it exits.

Copy link
Contributor

@celebdor celebdor Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant wait a reasonable time for SIGTERM to do its thing (probably we can use the same systemd uses) and then SIGKILL it. Otherwise, couldn't we end up with not starting again due to the pgrep we have in start_keepalived?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's possible. I'll make the change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this should be done. I set the timeout to 9 seconds since the monitor runs every 10 and I doubt we want to have multiple stop commands going at once.

@bcrochet
Copy link
Member

/retest

@yuqi-zhang
Copy link
Contributor

/retest

Do we expect metal tests to pass on this one?

Copy link
Member

@bcrochet bcrochet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2020
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 21, 2020

@cybertron: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/okd-e2e-aws b71a351d574f1b8ebba081dcfb8da156146c88ab link /test okd-e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

set -ex
declare -r keepalived_sock="/var/run/keepalived/keepalived.sock"
export -f msg_handler
export -f reload_keepalived
export -f stop_keepalived
export -f start_keepalived
if [ -s "/etc/keepalived/keepalived.conf" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to cover also the following (corner) case ? :

  1. Monitor container send STOP request
    1.1 Keepalived container stop Keepalived process
  2. After sometime Kubelet restart Kaapalived container for some reason

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to say no. If the whole keepalived container gets restarted, I think we should let it attempt to run normally. If the monitor continues to fail it will stop it again anyway.

if pid=$(pgrep -o keepalived); then
kill -s SIGTERM "$pid"
# The monitor runs every 10 seconds
sleep 9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the sleep here?
Is it because Keepalived container processes the messages in parrallel ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to give the process time to exit before we kill -9 it below.

@yboaron
Copy link
Contributor

yboaron commented Sep 23, 2020

/hold

This BZ was already resolved by the VIP mask change, do we still want to push the workers/masters Keepalived start/stop mechanism for 4.6 ?

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2020
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 25, 2020
cybertron added a commit to cybertron/baremetal-runtimecfg that referenced this pull request Oct 21, 2020
When we fail repeatedly to update the keepalived configuration, it is
possible that an outdated keepalived config itself is the cause of
the failure. Since we can't update the keepalived config without the
api, the only thing we can do in this scenario is stop keepalived.
That should resolve any keepalived-related networking issues, and
then we'll be able to update the keepalived config using the local
apiserver (in the case that all instances of keepalived go down and
we lose the API VIP).

This depends on openshift/machine-config-operator#2085
to provide the start and stop functionality on the keepalived side.
@cybertron cybertron force-pushed the keepalived-monitor-errors branch from b71a351 to 6a90694 Compare October 21, 2020 17:20
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2020
@openshift-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 21, 2020
cybertron added a commit to cybertron/baremetal-runtimecfg that referenced this pull request Nov 5, 2020
When we fail repeatedly to update the keepalived configuration, it is
possible that an outdated keepalived config itself is the cause of
the failure. Since we can't update the keepalived config without the
api, the only thing we can do in this scenario is stop keepalived.
That should resolve any keepalived-related networking issues, and
then we'll be able to update the keepalived config using the local
apiserver (in the case that all instances of keepalived go down and
we lose the API VIP).

This depends on openshift/machine-config-operator#2085
to provide the start and stop functionality on the keepalived side.
@cybertron cybertron force-pushed the keepalived-monitor-errors branch from 6a90694 to 53097b9 Compare November 5, 2020 20:07
@cybertron
Copy link
Member Author

Sorry, somehow I missed Yossi's comments on this before. I've rebased it and tested it locally and it still seems to be working fine.

@cybertron cybertron force-pushed the keepalived-monitor-errors branch from 53097b9 to 084bcf0 Compare November 12, 2020 22:45
cybertron added a commit to cybertron/baremetal-runtimecfg that referenced this pull request Nov 12, 2020
When we fail repeatedly to update the keepalived configuration, it is
possible that an outdated keepalived config itself is the cause of
the failure. Since we can't update the keepalived config without the
api, the only thing we can do in this scenario is stop keepalived.
That should resolve any keepalived-related networking issues, and
then we'll be able to update the keepalived config using the local
apiserver (in the case that all instances of keepalived go down and
we lose the API VIP).

This depends on openshift/machine-config-operator#2085
to provide the start and stop functionality on the keepalived side.
@openshift-merge-robot
Copy link
Contributor

@cybertron: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-op 084bcf02f435405453e7a1048ec215855b1811f7 link /test e2e-gcp-op
ci/prow/e2e-agnostic-upgrade 084bcf02f435405453e7a1048ec215855b1811f7 link /test e2e-agnostic-upgrade
ci/prow/e2e-aws-serial 084bcf02f435405453e7a1048ec215855b1811f7 link /test e2e-aws-serial
ci/prow/okd-e2e-aws 084bcf02f435405453e7a1048ec215855b1811f7 link /test okd-e2e-aws
ci/prow/e2e-aws-workers-rhel7 084bcf02f435405453e7a1048ec215855b1811f7 link /test e2e-aws-workers-rhel7
ci/prow/e2e-aws 084bcf02f435405453e7a1048ec215855b1811f7 link /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cybertron cybertron force-pushed the keepalived-monitor-errors branch from 084bcf0 to 488c329 Compare January 21, 2021 23:21
There are circumstances where keepalived can cause issues with the
networking on a node, notably when bridging a physical interface.
After the address has been moved to the bridge, it is possible for
old routes to exist that cause problems talking to other nodes, which
breaks the apiserver and prevents us from updating the keepalived
config to reflect the networking change. This leaves us in a situation
where the code can't recover properly from the bad configuration.
In short, the apiserver is waiting for keepalived to update its
configuration, but keepalived needs the apiserver in order to do so.

This change addresses the problem by stopping keepalived if the
monitor fails to update the config more than 3 times in a row. That
will unconfigure any VIPs on the node, which should fix the error
described above. Once the bad routes related to the VIP(s) are gone,
the apiserver will recover and we'll be able to update the keepalived
config again. After that happens, keepalived is restarted.

This is one half of the fix. The other half will be in
baremetal-runtimecfg to call the control socket with stop and start
commands as appropriate.
@cybertron cybertron force-pushed the keepalived-monitor-errors branch from 488c329 to d1698a5 Compare March 31, 2021 20:04
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bcrochet, cybertron
To complete the pull request process, please assign sinnykumari after the PR has been reviewed.
You can assign the PR to them by writing /assign @sinnykumari in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cybertron added a commit to cybertron/baremetal-runtimecfg that referenced this pull request Mar 31, 2021
When we fail repeatedly to update the keepalived configuration, it is
possible that an outdated keepalived config itself is the cause of
the failure. Since we can't update the keepalived config without the
api, the only thing we can do in this scenario is stop keepalived.
That should resolve any keepalived-related networking issues, and
then we'll be able to update the keepalived config using the local
apiserver (in the case that all instances of keepalived go down and
we lose the API VIP).

This depends on openshift/machine-config-operator#2085
to provide the start and stop functionality on the keepalived side.
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 21, 2021

@cybertron: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-workers-rhel7 d1698a5 link /test e2e-aws-workers-rhel7
ci/prow/e2e-ovn-step-registry d1698a5 link /test e2e-ovn-step-registry
ci/prow/okd-e2e-aws d1698a5 link /test okd-e2e-aws
ci/prow/e2e-metal-ipi d1698a5 link /test e2e-metal-ipi
ci/prow/e2e-vsphere-upgrade d1698a5 link /test e2e-vsphere-upgrade
ci/prow/images d1698a5 link /test images
ci/prow/e2e-aws-serial d1698a5 link /test e2e-aws-serial
ci/prow/e2e-agnostic-upgrade d1698a5 link /test e2e-agnostic-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kikisdeliveryservice
Copy link
Contributor

Underlying BZ was closed by openshift/baremetal-runtimecfg#100

Closing this PR, reopen if necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants