Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vrutkovs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@vrutkovs awesome, thank you so much. You're right - all nodes should be running Calico-node. The Node selector was introduced as part of an upgrade process from systemd to self-hosted so that nodes could be upgraded one-by-one. The installation for fresh clusters should see all nodes assigned the label, which it looks like your PR does. The error you posted sounds like a general "pods can't communicate". I usually start by seeing if its cross-node or same-node pods that cannot communicate. |
Okay, there is no transition in 3.11, so I'll remove it.
I tried on 3 plain AWS VMs, not sure what's happening there. In any case it doesn't seem to caused by this PR, so let roll with it - removing WIP part |
The transition files are still checked into master. So as follow-up, we will also need to remove those, and prevent upgrades to 3.11 if you haven't run them yet. Perhaps we should revert that last commit and look at implementing that in a follow-up? |
|
/hold Agree, lets remove the legacy code there |
|
Its WIP, actually /hold cancel |
c08a1ff to
b60465b
Compare
|
/test gcp |
1 similar comment
|
/test gcp |
|
/test gcp |
| msg: You are running a systemd based installation of Calico. Please run the calico upgrade playbook to upgrade to a self-hosted installation. | ||
| when: sym.stat.exists | ||
|
|
||
| - name: Configure NetworkManager to ignore Calico interfaces |
There was a problem hiding this comment.
@vrutkovs This would explain why you're having issues connecting to services. Calico requires that NetworkManager be disabled on it's interfaces (cali*/tunl0), which you've removed in your PR. If NetworkManager is not disabled for these interfaces you will see frequent connectivity issues to pods (and therefore services).
There was a problem hiding this comment.
@kprabhak that's somewhat surprising; what are the issues here and which versions of NM do you typically see them with? OpenShift installations (including ones with many, many nodes like OpenShift Online) typically run NetworkManager and we haven't had problems in those configurations.
Might we worth opening a discussion with the NetworkManager project ( https://github.com/NetworkManager/NetworkManager ) about it, since I'm sure they want to figure out if/how NM is interacting.
There was a problem hiding this comment.
@dcbw This is due to the design elements for scale that are leveraged by Calico that are not exercised in the same way by ovs. Specifically, when all the pod interfaces on a node are reset by NetworkManager every few minutes, Calico has to withdraw the bgp route, and then wait for bgp route propagation & convergence on subsequent restore of the interfaces.
This is a design artifact of L3 routing protocols (avoiding route flapping in large-scale topologies), and something that will not be exposed in ovs.
Also, to clarify, the requirement is that NetworkManager be configured to ignore calico interfaces only, it can continue to operate on other interfaces on the system.
https://docs.projectcalico.org/v3.2/usage/troubleshooting/#configure-networkmanager
There was a problem hiding this comment.
We should probably add the NetworkManager config back in so that it will properly ignore any changes to the calico interfaces. This means that we will need to add the calico role that runs on each node back in.
b60465b to
d531542
Compare
d531542 to
c8b0cb6
Compare
|
@vrutkovs: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@vrutkovs I added the fixes for the NetworkManager in a commit here: mgleung@521c3f8 . We should also probably merge #9863 and #9867 prior to rebasing since this PR moves all changes from the |
|
@mgleung agree, there's been a lot of changes so I'll leave this and see if any other fixes are required |
|
I created #10226 with these changes recreated as well as the NetworkManager fixes since it was easier than rebasing this branch. |
|
Closing in favor of #10226 - this needs a lot of rebases |
openshift.master.sdn_cluster_network_cidrprojectcalico.org/ds-ready=truenodeselector and node labelling - in 3.10 the nodes are already converted from systemd setupThis should not be cherrypicked to release-3.10 yet, as it might break during 3.9 -> 3.10 upgrade
Status: webconsole/hosted pods won't come up on AWS:
failed to set up sandbox container "ff3633a9fd5e650a5341257e6490be1d6645f84040d83b3eb43716f174fec07e" network for pod "webconsole-7df4f9f689-lcbd4": NetworkPlugin cni failed to set up pod "webconsole-7df4f9f689-lcbd4_openshift-web-console" network: Get https://[172.30.0.1]:443/api/v1/namespaces/openshift-web-console/pods/webconsole-7df4f9f689-lcbd4: dial tcp 172.30.0.1:443: i/o timeout/cc @dmmcquay
/cc @ozdanborne