Skip to content

Conversation

@pliurh
Copy link
Contributor

@pliurh pliurh commented Mar 2, 2022

In ovnkube based cluster, the connectivity between MCD pods and
kube-api-server relies on the openflow rules injected by ovnkube.
If due to some reason, the ovnkube-node pod cannot start after
the reboot of applying new MC. The MCD will not be able to reach
the api-server.

This PR lets the kubeclient in MCD use the kube-api-server url of
the node kubeconfig file instead. It eliminates the dependency
on ovnkube-node pod from MCD.

@openshift-ci openshift-ci bot requested review from cgwalters and yuqi-zhang March 2, 2022 08:01
@pliurh
Copy link
Contributor Author

pliurh commented Mar 2, 2022

/retest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to be aware of is that setenv() is unsafe in the presence of threads that may be executing C code. https://internals.rust-lang.org/t/synchronized-ffi-access-to-posix-environment-variable-functions/15475 is a Rust thread on this; which links to e.g. https://sourceware.org/bugzilla/show_bug.cgi?id=15607

The Go runtime uses a middle ground trick of only calling the C setenv if cgo is in use which...I think will happen with us when linking to openssl at least.

So we should (at least eventually) fix the client API to allow overriding these things without setenv().

But...for now it's probably OK, I would think (hope) that we're not running any other active goroutines at this point.

@cgwalters
Copy link
Member

This is a bit related to #2190 too right?

/approve
of the general idea.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2022
@pliurh
Copy link
Contributor Author

pliurh commented Mar 2, 2022

This is a bit related to #2190 too right?

Yes. For the same purpose, decouple the MCO from the network provider.

Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment

@kikisdeliveryservice
Copy link
Contributor

As this affects OVN, @trozet @jcaamano PTAL

@kikisdeliveryservice
Copy link
Contributor

Also this sounds like a bug? Is there an existing bz?

In ovnkube based cluster, the connectivity between MCD pods and
kube-api-server relies on the openflow rules injected by ovnkube.
If due to some reason, the ovnkube-node pod cannot start after
the reboot of applying new MC. The MCD will not be able to reach
the api-server.

This PR let the kubeclient in MCD use the kube-api-server url of
the node kubeconfig file instead. It eliminates the dependency
on ovnkube-node pod from MCD.
@pliurh
Copy link
Contributor Author

pliurh commented Mar 3, 2022

Also this sounds like a bug? Is there an existing bz?

No. I found this issue in a cluster with DPU (a smart NIC), which we plan to support in a future release. I suppose it might also happen on a regular cluster.

@pliurh
Copy link
Contributor Author

pliurh commented Mar 4, 2022

/retest

@jcaamano
Copy link
Contributor

jcaamano commented Mar 7, 2022

Looks good to me

@pliurh
Copy link
Contributor Author

pliurh commented Mar 9, 2022

/retest

@cgwalters
Copy link
Member

I was looking at logs from an unrelated PR and saw e.g.:

W0308 19:26:11.024920    1727 reflector.go:324] k8s.io/client-go/informers/factory.go:134: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
W0308 19:26:11.024930    1727 reflector.go:324] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout

over there, and then realized that this PR will probably fix it, and indeed it seems to. We're no longer depending on the SDN pod for the MCD, which indeed seems like a huge reliability improvement.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, pliurh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters
Copy link
Member

/skip

@cgwalters
Copy link
Member

OK let's give this one more spin of the wheel of fortune, but if that fails I think we should override.
/test e2e-agnostic-upgrade

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 11, 2022

@pliurh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-upgrade-single-node 2b51f26 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-aws-disruptive 2b51f26 link false /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit bbe78cf into openshift:master Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants