Skip to content

Conversation

@celebdor
Copy link
Contributor

When using CNV or other operators that modify how the node is connected
to the network, we may end up in the case where the configured VRRP
interface no longer has an address in the network that it is configured
to hold virtual IPs in.

This patch takes a page from what we do for HAProxy and adds a monitor
side car container that checks keepalived and reloads it when necessary.

Fixes: #1751978
Signed-off-by: Antoni Segura Puimedon antoni@redhat.com

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 18, 2019
@celebdor
Copy link
Contributor Author

Depends on openshift/baremetal-runtimecfg#20 being merged

@celebdor
Copy link
Contributor Author

/assign runcom

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is runtimecfg still used by anyone or it can be dropped from https://github.com/openshift/baremetal-runtimecfg/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used by mdns-publisher

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a possibility so safe one (!) character by dropping the extra space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You remove it and start listening to it on the next line? Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the container is restarted the socket file might be there and socat won't be able to start unless it is gone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a regression. Previously it would fail if the config does not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now instead of failing it will just wait in socat for a reload when the config is written. What's the issue with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

@kikisdeliveryservice kikisdeliveryservice changed the title Fix keepalived dysfunction on vrrp iface change Bug 1751978: templates/baremetal: Fix keepalived dysfunction on vrrp iface change Sep 18, 2019
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Sep 18, 2019
@openshift-ci-robot
Copy link
Contributor

@celebdor: This pull request references Bugzilla bug 1751978, which is valid. The bug has been moved to the POST state.

Details

In response to this:

Bug 1751978: templates/baremetal: Fix keepalived dysfunction on vrrp iface change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Sep 18, 2019

since this depends on openshift/baremetal-runtimecfg#20 & response to the reviewer's questions above.

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 18, 2019
@kikisdeliveryservice kikisdeliveryservice requested review from kikisdeliveryservice and removed request for ashcrow September 18, 2019 22:54
@kikisdeliveryservice
Copy link
Contributor

Also updated the PR title, please use this format in the future :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@celebdor could you please elaborate about the logic of the LivenessProbe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely :-)

If keepalived is not running, the pgrep will be empty and the kill command will fail with exit code 1. This means the liveness check will consider it failed.

If keepalived is running, sending SIGUSR1 to the parent keepalived process will make keepalived to write /tmp/keepalived.data containing information about each vrrp group including the state. The state can be MASTER, BACKUP or FAULT. If any of the states are FAULT grep will exit with exit code 1 and the liveness test will fail.

Oops. I just realized I forgot to put put the negative. Will fix it now

Copy link
Contributor

@yboaron yboaron Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@celebdor I think that "$pid" could be empty (the keepalived container starts before monitor container and .conf is missing), should we update something in 'kill -s SIGHUP "$pid"' ?

@celebdor celebdor force-pushed the bz1751978 branch 2 times, most recently from 1773fda to b2cf75e Compare September 19, 2019 10:02
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 19, 2019
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@celebdor, not sure that I (still :-) ) fully understand the liveness probe logic.
So, if some component (e.g: CNV ) changed network interfaces configuration, that may lead to keepalived failure (State=FAULT), and this container will be restarted by Kubelet. IIUC, the problem will be fixed only after monitor container updates .conf file, right?
Could u please explain what added value do we get from this Liveness check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added value is that if any of the VIP management (ingress, api or dns) gets faulty for any cause, we'll restart keepalived.

@celebdor
Copy link
Contributor Author

Verified with success that the IP was moved to brext and the VIPs did as well. The downtime for reconfiguration of the interfaces was of about 90seconds.

@celebdor
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 20, 2019
When using CNV or other operators that modify how the node is connected
to the network, we may end up in the case where the configured VRRP
interface no longer has an address in the network that it is configured
to hold virtual IPs in.

This patch takes a page from what we do for HAProxy and adds a monitor
side car container that checks keepalived and reloads it when necessary.

Signed-off-by: Antoni Segura Puimedon <antoni@redhat.com>
Copy link
Member

@runcom runcom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 20, 2019
@runcom
Copy link
Member

runcom commented Sep 20, 2019

/retest

@kikisdeliveryservice
Copy link
Contributor

looks good.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 20, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: celebdor, kikisdeliveryservice, phoracek, runcom, yboaron

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [kikisdeliveryservice,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 1571463 into openshift:master Sep 20, 2019
@openshift-ci-robot
Copy link
Contributor

@celebdor: All pull requests linked via external trackers have merged. Bugzilla bug 1751978 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1751978: templates/baremetal: Fix keepalived dysfunction on vrrp iface change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mandre added a commit to mandre/machine-config-operator that referenced this pull request Sep 11, 2020
When using CNV or other operators that modify how the node is connected
to the network, we may end up in the case where the configured VRRP
interface no longer has an address in the network that it is configured
to hold virtual IPs in.

This patch takes a page from what we do for HAProxy and adds a monitor
side car container that checks keepalived and reloads it when necessary.

This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508
and openshift#1604.
mandre added a commit to mandre/machine-config-operator that referenced this pull request Sep 21, 2020
When using CNV or other operators that modify how the node is connected
to the network, we may end up in the case where the configured VRRP
interface no longer has an address in the network that it is configured
to hold virtual IPs in.

This patch takes a page from what we do for HAProxy and adds a monitor
side car container that checks keepalived and reloads it when necessary.

This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508
and openshift#1604.
vrutkovs pushed a commit to vrutkovs/machine-config-operator that referenced this pull request Oct 13, 2020
When using CNV or other operators that modify how the node is connected
to the network, we may end up in the case where the configured VRRP
interface no longer has an address in the network that it is configured
to hold virtual IPs in.

This patch takes a page from what we do for HAProxy and adds a monitor
side car container that checks keepalived and reloads it when necessary.

This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508
and openshift#1604.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants