Skip to content

Conversation

@zaneb
Copy link
Member

@zaneb zaneb commented Jan 24, 2022

After restarting NetworkManager and bringing up interfaces, also restart
NetworkManager-wait-online so that systemd does not declare
network-online.target complete at a point where the interfaces are not
yet completely online again after the restart (e.g. they are still
waiting for addresses from DHCP).

Prior to 9cc7ac4 there was an
unconditional 5s sleep that resulted in the interfaces usually being up
by the time configure-ovs finished, but that was not necessarily robust
in the real world. Tests began failing when the sleep was removed,
because nodeip-configuration.service runs immediately after
network-online.target completes.

@zaneb
Copy link
Member Author

zaneb commented Jan 24, 2022

/test e2e-metal-ipi-ovn-dualstack

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 24, 2022

@zaneb: An error was encountered querying GitHub for users with public email ([email protected]) for bug 2040671 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. non-200 OK status code: 403 Forbidden body: "{\n \"documentation_url\": \"https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits\",\n \"message\": \"You have exceeded a secondary rate limit. Please wait a few minutes before you try again.\"\n}\n"

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

Details

In response to this:

Bug 2040671: configure-ovs: restart NetworkManager-wait-online

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 24, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zaneb
To complete the pull request process, please assign yuqi-zhang after the PR has been reviewed.
You can assign the PR to them by writing /assign @yuqi-zhang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot requested review from jkyros and sinnykumari January 24, 2022 17:47
@kikisdeliveryservice
Copy link
Contributor

/assign @trozet

@zaneb
Copy link
Member Author

zaneb commented Jan 24, 2022

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 24, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 24, 2022

@zaneb: This pull request references Bugzilla bug 2040671, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in reload_nm it has nm-online -s -t 60; then which is the equivalent of NetworkManager-wait-online.service. I'm wondering if the root cause is because this wait does not wait on more than one ip address family unless "may-fail" is added:

By default, connections have the ipv4.may-fail and ipv6.may-fail properties set to yes; this means that NetworkManager waits for one of the two address families to complete configuration before considering the connection activated. If you need a specific address family configured before network-online.target is reached, set the corresponding may-fail property to no.

So unless we add this may-fail set to no, as long as the connection has an ipv4 addr it will pass the wait-online test and proceed. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case then openshift/cluster-baremetal-operator#239 (+ an equivalent change to the installer) should solve it, by passing ip=dhcp,dhcp6. It doesn't by itself though.

I'm assuming that because we only do activate_nm_conn after calling nm-online, we may not be waiting for those interfaces that weren't active at that time. Then we bring them up and immediately declare success without waiting for DHCP. That's what this is attempting to address.

To date we accidentally passed ip=dhcp, but previously the 5s sleep was doing the trick in CI at least. It's certainly possible that we'd need both patches for it to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't nmcli conn up (part of activate_nm_conn) take into account if may-fail=no is set on ipv4/ipv6 and not return until they are both up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested it locally and ^ is correct. If ipv6.may-fail=no and it cant get an IP it times out and fails to activate the connection. I think we should try setting those attributes if we detect the original interface had ipv4 and ipv6 addresses. wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NetworkManage-wait-online is repeatedly failing for me, so adding this is making ovs-configuration fail for me locally.

The default behavior for ip=dhcp,dhcp6 is to create a connection that looks like this:

[ipv4]
dhcp-timeout=90
dns-search=
method=auto
required-timeout=20000

[ipv6]
addr-gen-mode=eui64
dhcp-timeout=90
dns-search=
method=auto

Applying this configuration to my local node isn't helping either, so I don't think the CBO patch is going to fix this.

I tried setting may-fail, but it's causing other things to fail. I probably need to try again in a clean environment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a log from a test with ip=dhcp,dhcp6: https://paste.centos.org/view/86dc24b4
A theory: the interface we're activating does indeed come up:

br-ex:55ed8fd1-0525-416a-8619-ae9520472080:ovs-bridge:1642815247:Sat Jan 22 01\:34\:07 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/2:yes:br-ex:activated:/org/freedesktop/NetworkManager/ActiveConnection/3::/etc/NetworkManager/system-connections/br-ex.nmconnection

But the default interface (I suspect the only one that the kernel command line applies to) is still activating:

Wired Connection:fe74bdf4-f9f7-484c-b324-d3d221987983:802-3-ethernet:1642815246:Sat Jan 22 01\:34\:06 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/1:yes:enp1s0:activating:/org/freedesktop/NetworkManager/ActiveConnection/1::/run/NetworkManager/system-connections/default_connection.nmconnection

So we need to wait for all interfaces. after bringing up br-ex.

It's a theory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding extra_brex_args+="ipv4.may-fail no ipv6.may-fail no " to configure-ovs, but for some reason that's causing the ovs ports to fail. o.O

After restarting NetworkManager and bringing up interfaces, also restart
NetworkManager-wait-online so that systemd does not declare
network-online.target complete at a point where the interfaces are not
yet completely online again after the restart (e.g. they are still
waiting for addresses from DHCP).

Prior to 9cc7ac4 there was an
unconditional 5s sleep that resulted in the interfaces usually being up
by the time configure-ovs finished, but that was not necessarily robust
in the real world. Tests began failing when the sleep was removed,
because nodeip-configuration.service runs immediately after
network-online.target completes.
@zaneb zaneb force-pushed the nm-wait-online-restart branch from 038f4b1 to e2f4d02 Compare January 24, 2022 21:31
@zaneb
Copy link
Member Author

zaneb commented Jan 24, 2022

try-restart didn't seem to have any effect because the NetworkManager-wait-online.service was in a failed state. Trying with just restart.
/test e2e-metal-ipi-ovn-dualstack

@zaneb
Copy link
Member Author

zaneb commented Jan 25, 2022

Failed to get the must-gather or logs on that occasion.
/test e2e-metal-ipi-ovn-dualstack

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 25, 2022

@zaneb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-upgrade-single-node e2f4d02 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-aws-disruptive e2f4d02 link false /test e2e-aws-disruptive
ci/prow/e2e-aws-single-node e2f4d02 link false /test e2e-aws-single-node
ci/prow/e2e-metal-ipi-ovn-dualstack e2f4d02 link false /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-vsphere-upgrade e2f4d02 link false /test e2e-vsphere-upgrade
ci/prow/e2e-aws-workers-rhel8 e2f4d02 link false /test e2e-aws-workers-rhel8
ci/prow/e2e-aws-workers-rhel7 e2f4d02 link false /test e2e-aws-workers-rhel7
ci/prow/okd-e2e-aws e2f4d02 link false /test okd-e2e-aws
ci/prow/e2e-aws e2f4d02 link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@zaneb
Copy link
Member Author

zaneb commented Jan 25, 2022

Local testing shows this is failing with:

Jan 25 02:48:52 master-0.ostest.test.metalkube.org configure-ovs.sh[1583]: + systemctl restart NetworkManager-wait-online.service
Jan 25 02:49:52 master-0.ostest.test.metalkube.org configure-ovs.sh[1583]: Job for NetworkManager-wait-online.service failed because the control process exited with error code.
Jan 25 02:49:52 master-0.ostest.test.metalkube.org configure-ovs.sh[1583]: See "systemctl status NetworkManager-wait-online.service" and "journalctl -xe" for details.

Nothing useful in that output:

● NetworkManager-wait-online.service - Network Manager Wait Online
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2022-01-25 02:49:52 UTC; 14h ago
     Docs: man:nm-online(1)
 Main PID: 3563 (code=exited, status=1/FAILURE)
      CPU: 27ms

Jan 25 02:48:52 master-0.ostest.test.metalkube.org systemd[1]: Starting Network Manager Wait Online...
Jan 25 02:49:52 master-0.ostest.test.metalkube.org systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jan 25 02:49:52 master-0.ostest.test.metalkube.org systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Jan 25 02:49:52 master-0.ostest.test.metalkube.org systemd[1]: Failed to start Network Manager Wait Online.
Jan 25 02:49:52 master-0.ostest.test.metalkube.org systemd[1]: NetworkManager-wait-online.service: Consumed 27ms CPU time

@zaneb
Copy link
Member Author

zaneb commented Jan 25, 2022

/close

@openshift-ci openshift-ci bot closed this Jan 25, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 25, 2022

@zaneb: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 25, 2022

@zaneb: This pull request references Bugzilla bug 2040671. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 2040671: configure-ovs: restart NetworkManager-wait-online

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zaneb
Copy link
Member Author

zaneb commented Jan 25, 2022

#2929 looks like a more promising candidate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants