-
Notifications
You must be signed in to change notification settings - Fork 463
Bug 2040671: configure-ovs: restart NetworkManager-wait-online #2927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/test e2e-metal-ipi-ovn-dualstack |
|
@zaneb: An error was encountered querying GitHub for users with public email ([email protected]) for bug 2040671 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
non-200 OK status code: 403 Forbidden body: "{\n \"documentation_url\": \"https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits\",\n \"message\": \"You have exceeded a secondary rate limit. Please wait a few minutes before you try again.\"\n}\n"
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zaneb The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/assign @trozet |
|
/bugzilla refresh |
|
@zaneb: This pull request references Bugzilla bug 2040671, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in reload_nm it has nm-online -s -t 60; then which is the equivalent of NetworkManager-wait-online.service. I'm wondering if the root cause is because this wait does not wait on more than one ip address family unless "may-fail" is added:
By default, connections have the ipv4.may-fail and ipv6.may-fail properties set to yes; this means that NetworkManager waits for one of the two address families to complete configuration before considering the connection activated. If you need a specific address family configured before network-online.target is reached, set the corresponding may-fail property to no.
So unless we add this may-fail set to no, as long as the connection has an ipv4 addr it will pass the wait-online test and proceed. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case then openshift/cluster-baremetal-operator#239 (+ an equivalent change to the installer) should solve it, by passing ip=dhcp,dhcp6. It doesn't by itself though.
I'm assuming that because we only do activate_nm_conn after calling nm-online, we may not be waiting for those interfaces that weren't active at that time. Then we bring them up and immediately declare success without waiting for DHCP. That's what this is attempting to address.
To date we accidentally passed ip=dhcp, but previously the 5s sleep was doing the trick in CI at least. It's certainly possible that we'd need both patches for it to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't nmcli conn up (part of activate_nm_conn) take into account if may-fail=no is set on ipv4/ipv6 and not return until they are both up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested it locally and ^ is correct. If ipv6.may-fail=no and it cant get an IP it times out and fails to activate the connection. I think we should try setting those attributes if we detect the original interface had ipv4 and ipv6 addresses. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NetworkManage-wait-online is repeatedly failing for me, so adding this is making ovs-configuration fail for me locally.
The default behavior for ip=dhcp,dhcp6 is to create a connection that looks like this:
[ipv4]
dhcp-timeout=90
dns-search=
method=auto
required-timeout=20000
[ipv6]
addr-gen-mode=eui64
dhcp-timeout=90
dns-search=
method=auto
Applying this configuration to my local node isn't helping either, so I don't think the CBO patch is going to fix this.
I tried setting may-fail, but it's causing other things to fail. I probably need to try again in a clean environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a log from a test with ip=dhcp,dhcp6: https://paste.centos.org/view/86dc24b4
A theory: the interface we're activating does indeed come up:
br-ex:55ed8fd1-0525-416a-8619-ae9520472080:ovs-bridge:1642815247:Sat Jan 22 01\:34\:07 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/2:yes:br-ex:activated:/org/freedesktop/NetworkManager/ActiveConnection/3::/etc/NetworkManager/system-connections/br-ex.nmconnection
But the default interface (I suspect the only one that the kernel command line applies to) is still activating:
Wired Connection:fe74bdf4-f9f7-484c-b324-d3d221987983:802-3-ethernet:1642815246:Sat Jan 22 01\:34\:06 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/1:yes:enp1s0:activating:/org/freedesktop/NetworkManager/ActiveConnection/1::/run/NetworkManager/system-connections/default_connection.nmconnection
So we need to wait for all interfaces. after bringing up br-ex.
It's a theory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried adding extra_brex_args+="ipv4.may-fail no ipv6.may-fail no " to configure-ovs, but for some reason that's causing the ovs ports to fail. o.O
After restarting NetworkManager and bringing up interfaces, also restart NetworkManager-wait-online so that systemd does not declare network-online.target complete at a point where the interfaces are not yet completely online again after the restart (e.g. they are still waiting for addresses from DHCP). Prior to 9cc7ac4 there was an unconditional 5s sleep that resulted in the interfaces usually being up by the time configure-ovs finished, but that was not necessarily robust in the real world. Tests began failing when the sleep was removed, because nodeip-configuration.service runs immediately after network-online.target completes.
038f4b1 to
e2f4d02
Compare
|
|
|
Failed to get the must-gather or logs on that occasion. |
|
@zaneb: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Local testing shows this is failing with: Nothing useful in that output: |
|
/close |
|
@zaneb: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@zaneb: This pull request references Bugzilla bug 2040671. The bug has been updated to no longer refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
#2929 looks like a more promising candidate. |
After restarting NetworkManager and bringing up interfaces, also restart
NetworkManager-wait-online so that systemd does not declare
network-online.target complete at a point where the interfaces are not
yet completely online again after the restart (e.g. they are still
waiting for addresses from DHCP).
Prior to 9cc7ac4 there was an
unconditional 5s sleep that resulted in the interfaces usually being up
by the time configure-ovs finished, but that was not necessarily robust
in the real world. Tests began failing when the sleep was removed,
because nodeip-configuration.service runs immediately after
network-online.target completes.