-
Notifications
You must be signed in to change notification settings - Fork 461
Bug 1751978: templates/baremetal: Fix keepalived dysfunction on vrrp iface change #1124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Depends on openshift/baremetal-runtimecfg#20 being merged |
|
/assign runcom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is runtimecfg still used by anyone or it can be dropped from https://github.com/openshift/baremetal-runtimecfg/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used by mdns-publisher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a possibility so safe one (!) character by dropping the extra space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You remove it and start listening to it on the next line? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the container is restarted the socket file might be there and socat won't be able to start unless it is gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a regression. Previously it would fail if the config does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now instead of failing it will just wait in socat for a reload when the config is written. What's the issue with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
|
@celebdor: This pull request references Bugzilla bug 1751978, which is valid. The bug has been moved to the POST state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
since this depends on openshift/baremetal-runtimecfg#20 & response to the reviewer's questions above. /hold |
|
Also updated the PR title, please use this format in the future :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@celebdor could you please elaborate about the logic of the LivenessProbe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely :-)
If keepalived is not running, the pgrep will be empty and the kill command will fail with exit code 1. This means the liveness check will consider it failed.
If keepalived is running, sending SIGUSR1 to the parent keepalived process will make keepalived to write /tmp/keepalived.data containing information about each vrrp group including the state. The state can be MASTER, BACKUP or FAULT. If any of the states are FAULT grep will exit with exit code 1 and the liveness test will fail.
Oops. I just realized I forgot to put put the negative. Will fix it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@celebdor I think that "$pid" could be empty (the keepalived container starts before monitor container and .conf is missing), should we update something in 'kill -s SIGHUP "$pid"' ?
1773fda to
b2cf75e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@celebdor, not sure that I (still :-) ) fully understand the liveness probe logic.
So, if some component (e.g: CNV ) changed network interfaces configuration, that may lead to keepalived failure (State=FAULT), and this container will be restarted by Kubelet. IIUC, the problem will be fixed only after monitor container updates .conf file, right?
Could u please explain what added value do we get from this Liveness check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added value is that if any of the VIP management (ingress, api or dns) gets faulty for any cause, we'll restart keepalived.
|
Verified with success that the IP was moved to brext and the VIPs did as well. The downtime for reconfiguration of the interfaces was of about 90seconds. |
|
/hold cancel |
When using CNV or other operators that modify how the node is connected to the network, we may end up in the case where the configured VRRP interface no longer has an address in the network that it is configured to hold virtual IPs in. This patch takes a page from what we do for HAProxy and adds a monitor side car container that checks keepalived and reloads it when necessary. Signed-off-by: Antoni Segura Puimedon <antoni@redhat.com>
runcom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
|
/retest |
|
looks good. /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: celebdor, kikisdeliveryservice, phoracek, runcom, yboaron The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@celebdor: All pull requests linked via external trackers have merged. Bugzilla bug 1751978 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When using CNV or other operators that modify how the node is connected to the network, we may end up in the case where the configured VRRP interface no longer has an address in the network that it is configured to hold virtual IPs in. This patch takes a page from what we do for HAProxy and adds a monitor side car container that checks keepalived and reloads it when necessary. This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508 and openshift#1604.
When using CNV or other operators that modify how the node is connected to the network, we may end up in the case where the configured VRRP interface no longer has an address in the network that it is configured to hold virtual IPs in. This patch takes a page from what we do for HAProxy and adds a monitor side car container that checks keepalived and reloads it when necessary. This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508 and openshift#1604.
When using CNV or other operators that modify how the node is connected to the network, we may end up in the case where the configured VRRP interface no longer has an address in the network that it is configured to hold virtual IPs in. This patch takes a page from what we do for HAProxy and adds a monitor side car container that checks keepalived and reloads it when necessary. This ports openshift#1124 to OpenStack platform, alongside with fixes from openshift#1508 and openshift#1604.
When using CNV or other operators that modify how the node is connected
to the network, we may end up in the case where the configured VRRP
interface no longer has an address in the network that it is configured
to hold virtual IPs in.
This patch takes a page from what we do for HAProxy and adds a monitor
side car container that checks keepalived and reloads it when necessary.
Fixes: #1751978
Signed-off-by: Antoni Segura Puimedon antoni@redhat.com