Skip to content

calico fixes#9435

Closed
vrutkovs wants to merge 5 commits intoopenshift:masterfrom
vrutkovs:calico-fixes
Closed

calico fixes#9435
vrutkovs wants to merge 5 commits intoopenshift:masterfrom
vrutkovs:calico-fixes

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented Aug 6, 2018

  • avoid using outdated openshift.master.sdn_cluster_network_cidr
  • Remove projectcalico.org/ds-ready=true nodeselector and node labelling - in 3.10 the nodes are already converted from systemd setup

This should not be cherrypicked to release-3.10 yet, as it might break during 3.9 -> 3.10 upgrade

Status: webconsole/hosted pods won't come up on AWS:
failed to set up sandbox container "ff3633a9fd5e650a5341257e6490be1d6645f84040d83b3eb43716f174fec07e" network for pod "webconsole-7df4f9f689-lcbd4": NetworkPlugin cni failed to set up pod "webconsole-7df4f9f689-lcbd4_openshift-web-console" network: Get https://[172.30.0.1]:443/api/v1/namespaces/openshift-web-console/pods/webconsole-7df4f9f689-lcbd4: dial tcp 172.30.0.1:443: i/o timeout

/cc @dmmcquay
/cc @ozdanborne

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 6, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2018
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 6, 2018
@ozdanborne
Copy link
Contributor

ozdanborne commented Aug 7, 2018

@vrutkovs awesome, thank you so much.

You're right - all nodes should be running Calico-node. The Node selector was introduced as part of an upgrade process from systemd to self-hosted so that nodes could be upgraded one-by-one. The installation for fresh clusters should see all nodes assigned the label, which it looks like your PR does.

The error you posted sounds like a general "pods can't communicate". I usually start by seeing if its cross-node or same-node pods that cannot communicate.

@vrutkovs
Copy link
Contributor Author

vrutkovs commented Aug 8, 2018

The Node selector was introduced as part of an upgrade process from systemd to self-hosted so that nodes could be upgraded one-by-one

Okay, there is no transition in 3.11, so I'll remove it.

I usually start by seeing if its cross-node or same-node pods that cannot communicate.

I tried on 3 plain AWS VMs, not sure what's happening there. In any case it doesn't seem to caused by this PR, so let roll with it - removing WIP part

@vrutkovs vrutkovs changed the title WIP calico fixes calico fixes Aug 8, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 8, 2018
@ozdanborne
Copy link
Contributor

Okay, there is no transition in 3.11, so I'll remove it.

The transition files are still checked into master. So as follow-up, we will also need to remove those, and prevent upgrades to 3.11 if you haven't run them yet.

Perhaps we should revert that last commit and look at implementing that in a follow-up?

@vrutkovs
Copy link
Contributor Author

vrutkovs commented Aug 8, 2018

/hold

Agree, lets remove the legacy code there

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2018
@vrutkovs vrutkovs changed the title calico fixes WIP calico fixes Aug 8, 2018
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 8, 2018
@vrutkovs
Copy link
Contributor Author

vrutkovs commented Aug 8, 2018

Its WIP, actually

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2018
@vrutkovs vrutkovs force-pushed the calico-fixes branch 2 times, most recently from c08a1ff to b60465b Compare August 9, 2018 11:33
@vrutkovs vrutkovs changed the title WIP calico fixes calico fixes Aug 10, 2018
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 10, 2018
@vrutkovs vrutkovs closed this Aug 10, 2018
@vrutkovs vrutkovs reopened this Aug 10, 2018
@vrutkovs
Copy link
Contributor Author

/test gcp

1 similar comment
@vrutkovs
Copy link
Contributor Author

/test gcp

@vrutkovs
Copy link
Contributor Author

/test gcp

msg: You are running a systemd based installation of Calico. Please run the calico upgrade playbook to upgrade to a self-hosted installation.
when: sym.stat.exists

- name: Configure NetworkManager to ignore Calico interfaces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vrutkovs This would explain why you're having issues connecting to services. Calico requires that NetworkManager be disabled on it's interfaces (cali*/tunl0), which you've removed in your PR. If NetworkManager is not disabled for these interfaces you will see frequent connectivity issues to pods (and therefore services).

Copy link
Contributor

@dcbw dcbw Aug 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kprabhak that's somewhat surprising; what are the issues here and which versions of NM do you typically see them with? OpenShift installations (including ones with many, many nodes like OpenShift Online) typically run NetworkManager and we haven't had problems in those configurations.

Might we worth opening a discussion with the NetworkManager project ( https://github.com/NetworkManager/NetworkManager ) about it, since I'm sure they want to figure out if/how NM is interacting.

@thom311 @bengal @lkundrak

Copy link
Contributor

@kprabhak kprabhak Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcbw This is due to the design elements for scale that are leveraged by Calico that are not exercised in the same way by ovs. Specifically, when all the pod interfaces on a node are reset by NetworkManager every few minutes, Calico has to withdraw the bgp route, and then wait for bgp route propagation & convergence on subsequent restore of the interfaces.

This is a design artifact of L3 routing protocols (avoiding route flapping in large-scale topologies), and something that will not be exposed in ovs.

Also, to clarify, the requirement is that NetworkManager be configured to ignore calico interfaces only, it can continue to operate on other interfaces on the system.
https://docs.projectcalico.org/v3.2/usage/troubleshooting/#configure-networkmanager

@ozdanborne @mgleung @dmmcquay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add the NetworkManager config back in so that it will properly ignore any changes to the calico interfaces. This means that we will need to add the calico role that runs on each node back in.

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2018
@openshift-bot
Copy link

@vrutkovs: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mgleung
Copy link
Contributor

mgleung commented Sep 14, 2018

@vrutkovs I added the fixes for the NetworkManager in a commit here: mgleung@521c3f8 .

We should also probably merge #9863 and #9867 prior to rebasing since this PR moves all changes from the calico_master role to the calico role and all of the other PRs do not account for the role changes.

@vrutkovs
Copy link
Contributor Author

@mgleung agree, there's been a lot of changes so I'll leave this and see if any other fixes are required

@kimcie
Copy link

kimcie commented Sep 25, 2018

@vrutkovs Any updates on this pull request. Pull requests #9863 and #9867 have been already merged.

@mgleung
Copy link
Contributor

mgleung commented Sep 26, 2018

I created #10226 with these changes recreated as well as the NetworkManager fixes since it was easier than rebasing this branch.

@vrutkovs
Copy link
Contributor Author

Closing in favor of #10226 - this needs a lot of rebases

@vrutkovs vrutkovs closed this Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants