Skip to content

Conversation

@squeed
Copy link
Contributor

@squeed squeed commented Jun 16, 2020

Needed to remove some of the readiness probes because they don't work, but this doesn't actually lose much coverage.

And liveness probes are not useful in this context.

This contains a hack to pass a strange case that an artifact of CI upgrade tests. It won't be in the wild, and could be removed.

Note to reviewers: hide whitespace changes :-)

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2020
@squeed squeed changed the title openshift-sdn: playing around with host-openvswitch openshift-sdn: quick change to adopt host-level openvswitch Jun 16, 2020
@squeed
Copy link
Contributor Author

squeed commented Jun 16, 2020

/cc @mccv1r0 @dcbw

@squeed squeed changed the title openshift-sdn: quick change to adopt host-level openvswitch openshift-sdn, ovn-kubernetes: adopt systemd-managed openvswitch if present Jun 16, 2020
echo "openvswitch is running in systemd"
# Don't need to worry about restoring flows; this can only change if we've rebooted
exec tail -F /var/log/openvswitch-host/ovs-vswitchd.log /var/log/openvswitch-host/ovsdb-server.log
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean /var/log/openvswitch instead of openvswitch-host?

Is the goal to block here, and never return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is correct. /var/log/openvswitch is not a host directory.
The goal is to block here. We need to keep a process running so kubelet doesn't think the pod has died.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is correct. /var/log/openvswitch is not a host directory.

Then shouldn't e.g.

tail -F --pid=$(cat /var/run/openvswitch/ovs-vswitchd.pid) /var/log/openvswitch/ovs-vswitchd.log &

below also use /var/log/openvswitch-host ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no... because that's for the containerized case.

@squeed
Copy link
Contributor Author

squeed commented Jun 16, 2020

e2e-gcp has a single freaking test failure. the network came up just fine.

@squeed
Copy link
Contributor Author

squeed commented Jun 16, 2020

/retest

@vishnoianil
Copy link
Contributor

@squeed this is the only PR where gcp-ovn test passed, rest all PR are failing this test because CNO is degrading becayse br-int is not created on the nodes.

@abhat
Copy link
Contributor

abhat commented Jun 16, 2020

I think we should revert: openshift/machine-config-operator#1830 first.

@squeed
Copy link
Contributor Author

squeed commented Jun 17, 2020

test failures are all flakes or infra issues
/retest

@squeed
Copy link
Contributor Author

squeed commented Jun 17, 2020

Upgrades are failing, nuts:

● ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static; vendor preset: enabled)
   Active: active (running) since Wed 2020-06-17 08:33:56 UTC; 25min ago
  Process: 1177 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-vswitchd --no-monitor --system-id=random ${OVS_USER_OPT} start $OPTIONS (code=exited, status=0/SUCCESS)
  Process: 1170 ExecStartPre=/bin/sh -c if [ "$${OVS_USER_ID/:*/}" != "root" ]; then /usr/bin/echo "OVS_USER_OPT=--ovs-user=${OVS_USER_ID}" >> /run/openvswitch.useropts; fi (code=exited, status=0/SUCCESS)
  Process: 1146 ExecStartPre=/bin/sh -c rm -f /run/openvswitch.useropts; /usr/bin/echo "OVS_USER_ID=${OVS_USER_ID}" > /run/openvswitch.useropts (code=exited, status=0/SUCCESS)
  Process: 1135 ExecStartPre=/usr/bin/chown ${OVS_USER_ID} /var/run/openvswitch /var/log/openvswitch (code=exited, status=1/FAILURE)
 Main PID: 1234 (ovsdb-server)
    Tasks: 1 (limit: 95371)
   Memory: 20.0M
   CGroup: /system.slice/ovsdb-server.service
           └─1234 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach
openvswitch is running in systemd
==> /var/log/openvswitch-host/ovs-vswitchd.log <==
2020-06-17T08:59:30.577Z|09841|fatal_signal|WARN|could not unlink "/var/run/openvswitch/br0.snoop" (Permission denied)
2020-06-17T08:59:30.577Z|09842|stream_unix|ERR|/var/run/openvswitch/br0.snoop: binding failed: Permission denied
2020-06-17T08:59:30.577Z|09843|connmgr|ERR|failed to listen on punix:/var/run/openvswitch/br0.snoop: Permission denied
2020-06-17T08:59:30.583Z|09844|socket_util_unix|WARN|unlinking "/var/run/openvswitch/br0.mgmt": Permission denied
2020-06-17T08:59:30.583Z|09845|fatal_signal|WARN|could not unlink "/var/run/openvswitch/br0.mgmt" (Permission denied)
2020-06-17T08:59:30.583Z|09846|stream_unix|ERR|/var/run/openvswitch/br0.mgmt: binding failed: Permission denied
2020-06-17T08:59:30.583Z|09847|socket_util_unix|WARN|unlinking "/var/run/openvswitch/br0.snoop": Permission denied
2020-06-17T08:59:30.583Z|09848|fatal_signal|WARN|could not unlink "/var/run/openvswitch/br0.snoop" (Permission denied)
2020-06-17T08:59:30.583Z|09849|stream_unix|ERR|/var/run/openvswitch/br0.snoop: binding failed: Permission denied
2020-06-17T08:59:30.583Z|09850|connmgr|ERR|failed to listen on punix:/var/run/openvswitch/br0.snoop: Permission denied

@squeed
Copy link
Contributor Author

squeed commented Jun 17, 2020

Huh: chown[1135]: /usr/bin/chown: cannot access '/var/run/openvswitch': No such file or directory

@squeed
Copy link
Contributor Author

squeed commented Jun 17, 2020

Okay, that should fix it.
@mccv1r0 want to take a quick look? This isn't perfect, but it should get us moving forward.

@squeed
Copy link
Contributor Author

squeed commented Jun 22, 2020

/retest

@squeed squeed closed this Jun 22, 2020
@squeed squeed reopened this Jun 22, 2020
@squeed
Copy link
Contributor Author

squeed commented Jun 22, 2020

So, this PR was failing because system openvswitch was added first, and we needed to handle an impossible upgrade case.

Now that the MCO pr has been reverted, we can get this in, then enable system openvswitch.

@squeed
Copy link
Contributor Author

squeed commented Jun 22, 2020

@juanluisvaladas all valid criticisms... but I didn't add that :-) Those are unchanged, left over from the containerized case.

@squeed
Copy link
Contributor Author

squeed commented Jun 22, 2020

Tested this by manually running systemd enable openvswitch && reboot on a test cluster, worked perfectly.

Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ovn-k8s part looks good to me.
/lgtm

@squeed
Copy link
Contributor Author

squeed commented Jun 23, 2020

Then since you mount /host you can check for the openvswitch unit file existing or you need to check the return code from systemctl status. On my setup 3 indicates the service exists but is not up, while 4 means service does not exist.

The unit file already exists, it's just not enabled by default. So we can't use whether or not the service exists as a predicate.

We could use is-enabled as a predicate, but that's not reliable for a few reasons - first of all, and this is the case on RHCOS, units without an [Install] section are defined as enabled (even though they're not). Second of all, we might not enable the unit directly in the future; we could make it a dependency of a meta-service. That would break this predicate again.

I don't want to get in the business of checking for systemd configuration files on disk. They can live in a lot of places, so that's just setting ourselves up for trouble.

In any case, the chances of openvswitch.service being stopped on boot, then later enabled, are pretty slim. If it crashes, systemd will restart it. I think this is a reasonable test.

@squeed
Copy link
Contributor Author

squeed commented Jun 23, 2020

/retest

@squeed
Copy link
Contributor Author

squeed commented Jun 23, 2020

All required CI runs are green. Failing jobs are oddities - their network came up.

@trozet
Copy link
Contributor

trozet commented Jun 23, 2020

Then since you mount /host you can check for the openvswitch unit file existing or you need to check the return code from systemctl status. On my setup 3 indicates the service exists but is not up, while 4 means service does not exist.

The unit file already exists, it's just not enabled by default. So we can't use whether or not the service exists as a predicate.

We could use is-enabled as a predicate, but that's not reliable for a few reasons - first of all, and this is the case on RHCOS, units without an [Install] section are defined as enabled (even though they're not). Second of all, we might not enable the unit directly in the future; we could make it a dependency of a meta-service. That would break this predicate again.

I don't want to get in the business of checking for systemd configuration files on disk. They can live in a lot of places, so that's just setting ourselves up for trouble.

In any case, the chances of openvswitch.service being stopped on boot, then later enabled, are pretty slim. If it crashes, systemd will restart it. I think this is a reasonable test.

I see your point now. I did't realize OVS RPM was in 4.5 already, so checking unit file existence doesn't really help us. @dcbw fyi

Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but would like @mccv1r0 @dcbw to sign off on it

@trozet
Copy link
Contributor

trozet commented Jun 23, 2020

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2020
@trozet
Copy link
Contributor

trozet commented Jun 23, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2020
@mccv1r0
Copy link
Contributor

mccv1r0 commented Jun 23, 2020

I fed this and the MCO PR to cluster-bot and things seemed to behave. The logging of e.g. echo "openvswitch is running in systemd" seemed to get lost. Is this going to be an issue?

@squeed
Copy link
Contributor Author

squeed commented Jun 23, 2020

I fed this and the MCO PR to cluster-bot and things seemed to behave. The logging of e.g. echo "openvswitch is running in systemd" seemed to get lost. Is this going to be an issue?

Huh. Well, did the PR do the right thing? Do you have a link to the CI run?

@mccv1r0
Copy link
Contributor

mccv1r0 commented Jun 23, 2020

I fed this and the MCO PR to cluster-bot and things seemed to behave. The logging of e.g. echo "openvswitch is running in systemd" seemed to get lost. Is this going to be an issue?

Huh. Well, did the PR do the right thing? Do you have a link to the CI run?

"things seemed to behave" The cluster did come up. I didn't get a link just kubeconfig so I can run oc logs

@squeed
Copy link
Contributor Author

squeed commented Jun 23, 2020

@mccv1r0 if this looks good to you, can you /lgtm?

@mccv1r0
Copy link
Contributor

mccv1r0 commented Jun 23, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mccv1r0, squeed, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mccv1r0 mccv1r0 removed their assignment Jun 23, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 23, 2020

@squeed: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-windows-hybrid-network 80e0098 link /test e2e-windows-hybrid-network
ci/prow/e2e-vsphere 80e0098 link /test e2e-vsphere

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit d4cccbf into openshift:master Jun 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants