openshift-sdn, ovn-kubernetes: adopt systemd-managed openvswitch if present #672

squeed · 2020-06-16T13:44:35Z

Needed to remove some of the readiness probes because they don't work, but this doesn't actually lose much coverage.

And liveness probes are not useful in this context.

This contains a hack to pass a strange case that an artifact of CI upgrade tests. It won't be in the wild, and could be removed.

Note to reviewers: hide whitespace changes :-)

squeed · 2020-06-16T15:54:18Z

/cc @mccv1r0 @dcbw

mccv1r0 · 2020-06-16T16:15:58Z

bindata/network/ovn-kubernetes/006-ovs-node.yaml

+            echo "openvswitch is running in systemd"
+            # Don't need to worry about restoring flows; this can only change if we've rebooted
+            exec tail -F /var/log/openvswitch-host/ovs-vswitchd.log /var/log/openvswitch-host/ovsdb-server.log
+          fi


Do you mean /var/log/openvswitch instead of openvswitch-host?

Is the goal to block here, and never return?

No, this is correct. /var/log/openvswitch is not a host directory.
The goal is to block here. We need to keep a process running so kubelet doesn't think the pod has died.

No, this is correct. /var/log/openvswitch is not a host directory.

Then shouldn't e.g.

tail -F --pid=$(cat /var/run/openvswitch/ovs-vswitchd.pid) /var/log/openvswitch/ovs-vswitchd.log &

below also use /var/log/openvswitch-host ?

no... because that's for the containerized case.

bindata/network/ovn-kubernetes/006-ovs-node.yaml

squeed · 2020-06-16T18:33:01Z

e2e-gcp has a single freaking test failure. the network came up just fine.

squeed · 2020-06-16T19:39:51Z

/retest

vishnoianil · 2020-06-16T22:02:00Z

@squeed this is the only PR where gcp-ovn test passed, rest all PR are failing this test because CNO is degrading becayse br-int is not created on the nodes.

abhat · 2020-06-16T22:27:42Z

I think we should revert: openshift/machine-config-operator#1830 first.

squeed · 2020-06-17T08:08:48Z

test failures are all flakes or infra issues
/retest

squeed · 2020-06-17T12:37:50Z

Upgrades are failing, nuts:

● ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static; vendor preset: enabled)
   Active: active (running) since Wed 2020-06-17 08:33:56 UTC; 25min ago
  Process: 1177 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-vswitchd --no-monitor --system-id=random ${OVS_USER_OPT} start $OPTIONS (code=exited, status=0/SUCCESS)
  Process: 1170 ExecStartPre=/bin/sh -c if [ "$${OVS_USER_ID/:*/}" != "root" ]; then /usr/bin/echo "OVS_USER_OPT=--ovs-user=${OVS_USER_ID}" >> /run/openvswitch.useropts; fi (code=exited, status=0/SUCCESS)
  Process: 1146 ExecStartPre=/bin/sh -c rm -f /run/openvswitch.useropts; /usr/bin/echo "OVS_USER_ID=${OVS_USER_ID}" > /run/openvswitch.useropts (code=exited, status=0/SUCCESS)
  Process: 1135 ExecStartPre=/usr/bin/chown ${OVS_USER_ID} /var/run/openvswitch /var/log/openvswitch (code=exited, status=1/FAILURE)
 Main PID: 1234 (ovsdb-server)
    Tasks: 1 (limit: 95371)
   Memory: 20.0M
   CGroup: /system.slice/ovsdb-server.service
           └─1234 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach
openvswitch is running in systemd
==> /var/log/openvswitch-host/ovs-vswitchd.log <==
2020-06-17T08:59:30.577Z|09841|fatal_signal|WARN|could not unlink "/var/run/openvswitch/br0.snoop" (Permission denied)
2020-06-17T08:59:30.577Z|09842|stream_unix|ERR|/var/run/openvswitch/br0.snoop: binding failed: Permission denied
2020-06-17T08:59:30.577Z|09843|connmgr|ERR|failed to listen on punix:/var/run/openvswitch/br0.snoop: Permission denied
2020-06-17T08:59:30.583Z|09844|socket_util_unix|WARN|unlinking "/var/run/openvswitch/br0.mgmt": Permission denied
2020-06-17T08:59:30.583Z|09845|fatal_signal|WARN|could not unlink "/var/run/openvswitch/br0.mgmt" (Permission denied)
2020-06-17T08:59:30.583Z|09846|stream_unix|ERR|/var/run/openvswitch/br0.mgmt: binding failed: Permission denied
2020-06-17T08:59:30.583Z|09847|socket_util_unix|WARN|unlinking "/var/run/openvswitch/br0.snoop": Permission denied
2020-06-17T08:59:30.583Z|09848|fatal_signal|WARN|could not unlink "/var/run/openvswitch/br0.snoop" (Permission denied)
2020-06-17T08:59:30.583Z|09849|stream_unix|ERR|/var/run/openvswitch/br0.snoop: binding failed: Permission denied
2020-06-17T08:59:30.583Z|09850|connmgr|ERR|failed to listen on punix:/var/run/openvswitch/br0.snoop: Permission denied

squeed · 2020-06-17T12:47:28Z

Huh: chown[1135]: /usr/bin/chown: cannot access '/var/run/openvswitch': No such file or directory

squeed · 2020-06-17T16:03:00Z

Okay, that should fix it.
@mccv1r0 want to take a quick look? This isn't perfect, but it should get us moving forward.

bindata/network/openshift-sdn/sdn-ovs.yaml

squeed · 2020-06-22T14:45:09Z

/retest

squeed · 2020-06-22T15:04:06Z

So, this PR was failing because system openvswitch was added first, and we needed to handle an impossible upgrade case.

Now that the MCO pr has been reverted, we can get this in, then enable system openvswitch.

squeed · 2020-06-22T15:11:20Z

@juanluisvaladas all valid criticisms... but I didn't add that :-) Those are unchanged, left over from the containerized case.

squeed · 2020-06-22T15:39:35Z

Tested this by manually running systemd enable openvswitch && reboot on a test cluster, worked perfectly.

trozet

ovn-k8s part looks good to me.
/lgtm

squeed · 2020-06-23T08:03:25Z

Then since you mount /host you can check for the openvswitch unit file existing or you need to check the return code from systemctl status. On my setup 3 indicates the service exists but is not up, while 4 means service does not exist.

The unit file already exists, it's just not enabled by default. So we can't use whether or not the service exists as a predicate.

We could use is-enabled as a predicate, but that's not reliable for a few reasons - first of all, and this is the case on RHCOS, units without an [Install] section are defined as enabled (even though they're not). Second of all, we might not enable the unit directly in the future; we could make it a dependency of a meta-service. That would break this predicate again.

I don't want to get in the business of checking for systemd configuration files on disk. They can live in a lot of places, so that's just setting ourselves up for trouble.

In any case, the chances of openvswitch.service being stopped on boot, then later enabled, are pretty slim. If it crashes, systemd will restart it. I think this is a reasonable test.

squeed · 2020-06-23T08:03:43Z

/retest

squeed · 2020-06-23T10:34:54Z

All required CI runs are green. Failing jobs are oddities - their network came up.

trozet · 2020-06-23T13:38:24Z

Then since you mount /host you can check for the openvswitch unit file existing or you need to check the return code from systemctl status. On my setup 3 indicates the service exists but is not up, while 4 means service does not exist.

The unit file already exists, it's just not enabled by default. So we can't use whether or not the service exists as a predicate.

We could use is-enabled as a predicate, but that's not reliable for a few reasons - first of all, and this is the case on RHCOS, units without an [Install] section are defined as enabled (even though they're not). Second of all, we might not enable the unit directly in the future; we could make it a dependency of a meta-service. That would break this predicate again.

I don't want to get in the business of checking for systemd configuration files on disk. They can live in a lot of places, so that's just setting ourselves up for trouble.

In any case, the chances of openvswitch.service being stopped on boot, then later enabled, are pretty slim. If it crashes, systemd will restart it. I think this is a reasonable test.

I see your point now. I did't realize OVS RPM was in 4.5 already, so checking unit file existence doesn't really help us. @dcbw fyi

trozet

lgtm but would like @mccv1r0 @dcbw to sign off on it

trozet · 2020-06-23T13:45:43Z

/hold

trozet · 2020-06-23T13:56:15Z

/hold cancel

mccv1r0 · 2020-06-23T13:57:53Z

I fed this and the MCO PR to cluster-bot and things seemed to behave. The logging of e.g. echo "openvswitch is running in systemd" seemed to get lost. Is this going to be an issue?

squeed · 2020-06-23T14:00:16Z

I fed this and the MCO PR to cluster-bot and things seemed to behave. The logging of e.g. echo "openvswitch is running in systemd" seemed to get lost. Is this going to be an issue?

Huh. Well, did the PR do the right thing? Do you have a link to the CI run?

mccv1r0 · 2020-06-23T14:03:52Z

I fed this and the MCO PR to cluster-bot and things seemed to behave. The logging of e.g. echo "openvswitch is running in systemd" seemed to get lost. Is this going to be an issue?

Huh. Well, did the PR do the right thing? Do you have a link to the CI run?

"things seemed to behave" The cluster did come up. I didn't get a link just kubeconfig so I can run oc logs

squeed · 2020-06-23T14:36:29Z

@mccv1r0 if this looks good to you, can you /lgtm?

mccv1r0 · 2020-06-23T14:39:37Z

/lgtm

openshift-ci-robot · 2020-06-23T14:40:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mccv1r0, squeed, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [squeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-06-23T15:31:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T15:57:15Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T16:36:07Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T16:49:08Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T17:41:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T18:07:08Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T18:20:50Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-23T18:33:18Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-06-23T19:21:21Z

@squeed: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-windows-hybrid-network	`80e0098`	link	`/test e2e-windows-hybrid-network`
ci/prow/e2e-vsphere	`80e0098`	link	`/test e2e-vsphere`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-06-23T19:25:18Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot requested review from abhat and juanluisvaladas June 16, 2020 13:44

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2020

squeed changed the title ~~openshift-sdn: playing around with host-openvswitch~~ openshift-sdn: quick change to adopt host-level openvswitch Jun 16, 2020

openshift-ci-robot requested review from dcbw and mccv1r0 June 16, 2020 15:54

squeed removed request for abhat and juanluisvaladas June 16, 2020 15:54

squeed changed the title ~~openshift-sdn: quick change to adopt host-level openvswitch~~ openshift-sdn, ovn-kubernetes: adopt systemd-managed openvswitch if present Jun 16, 2020

mccv1r0 reviewed Jun 16, 2020

View reviewed changes

bindata/network/ovn-kubernetes/006-ovs-node.yaml Outdated Show resolved Hide resolved

squeed mentioned this pull request Jun 16, 2020

Revert "Start openvswitch and ovsdb-server when network is ovn/ovs" openshift/machine-config-operator#1830

Merged

juanluisvaladas suggested changes Jun 18, 2020

View reviewed changes

bindata/network/openshift-sdn/sdn-ovs.yaml Show resolved Hide resolved

bindata/network/openshift-sdn/sdn-ovs.yaml Outdated Show resolved Hide resolved

bindata/network/openshift-sdn/sdn-ovs.yaml Outdated Show resolved Hide resolved

squeed closed this Jun 22, 2020

squeed reopened this Jun 22, 2020

trozet approved these changes Jun 22, 2020

View reviewed changes

openshift-ci-robot assigned trozet Jun 22, 2020

trozet approved these changes Jun 23, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2020

openshift-ci-robot assigned mccv1r0 Jun 23, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2020

mccv1r0 removed their assignment Jun 23, 2020

openshift-merge-robot merged commit d4cccbf into openshift:master Jun 23, 2020

rbbratta mentioned this pull request Sep 16, 2020

Bug 1874696: Remove systemctl calls #785

Merged

openshift-sdn, ovn-kubernetes: adopt systemd-managed openvswitch if present #672

openshift-sdn, ovn-kubernetes: adopt systemd-managed openvswitch if present #672

Uh oh!

Conversation

squeed commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squeed commented Jun 16, 2020

Uh oh!

mccv1r0 Jun 16, 2020

Choose a reason for hiding this comment

Uh oh!

squeed Jun 16, 2020

Choose a reason for hiding this comment

Uh oh!

mccv1r0 Jun 16, 2020

Choose a reason for hiding this comment

Uh oh!

squeed Jun 16, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

squeed commented Jun 16, 2020

Uh oh!

squeed commented Jun 16, 2020

Uh oh!

vishnoianil commented Jun 16, 2020

Uh oh!

abhat commented Jun 16, 2020

Uh oh!

squeed commented Jun 17, 2020

Uh oh!

squeed commented Jun 17, 2020

Uh oh!

squeed commented Jun 17, 2020

Uh oh!

squeed commented Jun 17, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

squeed commented Jun 22, 2020

Uh oh!

squeed commented Jun 22, 2020

Uh oh!

squeed commented Jun 22, 2020

Uh oh!

squeed commented Jun 22, 2020

Uh oh!

trozet left a comment

Choose a reason for hiding this comment

Uh oh!

squeed commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squeed commented Jun 23, 2020

Uh oh!

squeed commented Jun 23, 2020

Uh oh!

trozet commented Jun 23, 2020

Uh oh!

trozet left a comment

Choose a reason for hiding this comment

Uh oh!

trozet commented Jun 23, 2020

Uh oh!

trozet commented Jun 23, 2020

Uh oh!

mccv1r0 commented Jun 23, 2020

Uh oh!

squeed commented Jun 23, 2020

Uh oh!

mccv1r0 commented Jun 23, 2020

Uh oh!

squeed commented Jun 23, 2020

Uh oh!

mccv1r0 commented Jun 23, 2020

Uh oh!

openshift-ci-robot commented Jun 23, 2020

Uh oh!

squeed commented Jun 16, 2020 •

edited

Loading

squeed commented Jun 23, 2020 •

edited

Loading

openshift-ci-robot commented Jun 23, 2020 •

edited

Loading