calico fixes by vrutkovs · Pull Request #9435 · openshift/openshift-ansible

vrutkovs · 2018-08-06T08:41:41Z

avoid using outdated openshift.master.sdn_cluster_network_cidr
Remove projectcalico.org/ds-ready=true nodeselector and node labelling - in 3.10 the nodes are already converted from systemd setup

This should not be cherrypicked to release-3.10 yet, as it might break during 3.9 -> 3.10 upgrade

Status: webconsole/hosted pods won't come up on AWS:
failed to set up sandbox container "ff3633a9fd5e650a5341257e6490be1d6645f84040d83b3eb43716f174fec07e" network for pod "webconsole-7df4f9f689-lcbd4": NetworkPlugin cni failed to set up pod "webconsole-7df4f9f689-lcbd4_openshift-web-console" network: Get https://[172.30.0.1]:443/api/v1/namespaces/openshift-web-console/pods/webconsole-7df4f9f689-lcbd4: dial tcp 172.30.0.1:443: i/o timeout

/cc @dmmcquay
/cc @ozdanborne

openshift-ci-robot · 2018-08-06T08:41:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [vrutkovs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ozdanborne · 2018-08-07T19:46:42Z

@vrutkovs awesome, thank you so much.

You're right - all nodes should be running Calico-node. The Node selector was introduced as part of an upgrade process from systemd to self-hosted so that nodes could be upgraded one-by-one. The installation for fresh clusters should see all nodes assigned the label, which it looks like your PR does.

The error you posted sounds like a general "pods can't communicate". I usually start by seeing if its cross-node or same-node pods that cannot communicate.

vrutkovs · 2018-08-08T08:46:37Z

The Node selector was introduced as part of an upgrade process from systemd to self-hosted so that nodes could be upgraded one-by-one

Okay, there is no transition in 3.11, so I'll remove it.

I usually start by seeing if its cross-node or same-node pods that cannot communicate.

I tried on 3 plain AWS VMs, not sure what's happening there. In any case it doesn't seem to caused by this PR, so let roll with it - removing WIP part

ozdanborne · 2018-08-08T16:41:56Z

Okay, there is no transition in 3.11, so I'll remove it.

The transition files are still checked into master. So as follow-up, we will also need to remove those, and prevent upgrades to 3.11 if you haven't run them yet.

Perhaps we should revert that last commit and look at implementing that in a follow-up?

vrutkovs · 2018-08-08T17:00:39Z

/hold

Agree, lets remove the legacy code there

vrutkovs · 2018-08-08T17:01:05Z

Its WIP, actually

/hold cancel

vrutkovs · 2018-08-14T07:56:29Z

/test gcp

vrutkovs · 2018-08-14T11:55:03Z

/test gcp

vrutkovs · 2018-08-15T08:36:42Z

/test gcp

kprabhak · 2018-08-19T19:13:01Z

roles/calico/tasks/main.yml

-    msg: You are running a systemd based installation of Calico. Please run the calico upgrade playbook to upgrade to a self-hosted installation.
-  when: sym.stat.exists
-
- name: Configure NetworkManager to ignore Calico interfaces


@vrutkovs This would explain why you're having issues connecting to services. Calico requires that NetworkManager be disabled on it's interfaces (cali*/tunl0), which you've removed in your PR. If NetworkManager is not disabled for these interfaces you will see frequent connectivity issues to pods (and therefore services).

@kprabhak that's somewhat surprising; what are the issues here and which versions of NM do you typically see them with? OpenShift installations (including ones with many, many nodes like OpenShift Online) typically run NetworkManager and we haven't had problems in those configurations.

Might we worth opening a discussion with the NetworkManager project ( https://github.com/NetworkManager/NetworkManager ) about it, since I'm sure they want to figure out if/how NM is interacting.

@thom311 @bengal @lkundrak

@dcbw This is due to the design elements for scale that are leveraged by Calico that are not exercised in the same way by ovs. Specifically, when all the pod interfaces on a node are reset by NetworkManager every few minutes, Calico has to withdraw the bgp route, and then wait for bgp route propagation & convergence on subsequent restore of the interfaces.

This is a design artifact of L3 routing protocols (avoiding route flapping in large-scale topologies), and something that will not be exposed in ovs.

Also, to clarify, the requirement is that NetworkManager be configured to ignore calico interfaces only, it can continue to operate on other interfaces on the system.
https://docs.projectcalico.org/v3.2/usage/troubleshooting/#configure-networkmanager

@ozdanborne @mgleung @dmmcquay

We should probably add the NetworkManager config back in so that it will properly ignore any changes to the calico interfaces. This means that we will need to add the calico role that runs on each node back in.

openshift-bot · 2018-09-13T11:33:38Z

@vrutkovs: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mgleung · 2018-09-14T01:40:31Z

@vrutkovs I added the fixes for the NetworkManager in a commit here: mgleung@521c3f8 .

We should also probably merge #9863 and #9867 prior to rebasing since this PR moves all changes from the calico_master role to the calico role and all of the other PRs do not account for the role changes.

vrutkovs · 2018-09-14T08:29:49Z

@mgleung agree, there's been a lot of changes so I'll leave this and see if any other fixes are required

kimcie · 2018-09-25T10:14:52Z

@vrutkovs Any updates on this pull request. Pull requests #9863 and #9867 have been already merged.

mgleung · 2018-09-26T00:38:12Z

I created #10226 with these changes recreated as well as the NetworkManager fixes since it was easier than rebasing this branch.

vrutkovs · 2018-09-26T08:17:26Z

Closing in favor of #10226 - this needs a lot of rebases

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 6, 2018

openshift-ci-robot requested review from michaelgugino and mtnbikenc August 6, 2018 08:41

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2018

vrutkovs mentioned this pull request Aug 6, 2018

apply the container_runtime for calico #9392

Merged

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 6, 2018

vrutkovs changed the title ~~WIP calico fixes~~ calico fixes Aug 8, 2018

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 8, 2018

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2018

vrutkovs changed the title ~~calico fixes~~ WIP calico fixes Aug 8, 2018

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 8, 2018

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2018

vrutkovs force-pushed the calico-fixes branch 2 times, most recently from c08a1ff to b60465b Compare August 9, 2018 11:33

vrutkovs changed the title ~~WIP calico fixes~~ calico fixes Aug 10, 2018

vrutkovs closed this Aug 10, 2018

vrutkovs reopened this Aug 10, 2018

vrutkovs mentioned this pull request Aug 16, 2018

kube_proxy_and_dns: add role that runs standalone kube-proxy and DNS #9621

Merged

mgleung mentioned this pull request Aug 17, 2018

Update playbooks for Calico in OpenShift 3.10 #9657

Merged

kprabhak reviewed Aug 19, 2018

View reviewed changes

vrutkovs force-pushed the calico-fixes branch from b60465b to d531542 Compare September 3, 2018 08:18

Vadim Rutkovsky added 5 commits September 7, 2018 09:55

calico: avoid using outdated openshift.master.sdn_cluster_network_cidr

2e94e8a

calico: apply node label to run calico-node daemonset

f4ba29f

calico: skip errors if applied objects already exist

3e87de8

calico: remove DS nodeselector and avoid labelling nodes on new installs

dbcbd35

Remove outdated calico role and rename calico_master

c8b0cb6

vrutkovs force-pushed the calico-fixes branch from d531542 to c8b0cb6 Compare September 7, 2018 07:55

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2018

mgleung mentioned this pull request Sep 20, 2018

Fix Calico launch on multi-master cluster #10119

Merged

mgleung mentioned this pull request Sep 26, 2018

Refactored Calico and updated playbooks to reflect self-hosted Calico installs only #10226

Merged

vrutkovs closed this Sep 26, 2018

Conversation

vrutkovs commented Aug 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Aug 6, 2018

Uh oh!

ozdanborne commented Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented Aug 8, 2018

Uh oh!

ozdanborne commented Aug 8, 2018

Uh oh!

vrutkovs commented Aug 8, 2018

Uh oh!

vrutkovs commented Aug 8, 2018

Uh oh!

vrutkovs commented Aug 14, 2018

Uh oh!

vrutkovs commented Aug 14, 2018

Uh oh!

vrutkovs commented Aug 15, 2018

Uh oh!

kprabhak Aug 19, 2018

Choose a reason for hiding this comment

Uh oh!

dcbw Aug 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kprabhak Aug 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgleung Sep 14, 2018

Choose a reason for hiding this comment

Uh oh!

openshift-bot commented Sep 13, 2018

Uh oh!

mgleung commented Sep 14, 2018

Uh oh!

vrutkovs commented Sep 14, 2018

Uh oh!

kimcie commented Sep 25, 2018

Uh oh!

mgleung commented Sep 26, 2018

Uh oh!

vrutkovs commented Sep 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

vrutkovs commented Aug 6, 2018 •

edited

Loading

ozdanborne commented Aug 7, 2018 •

edited

Loading

dcbw Aug 30, 2018 •

edited

Loading

kprabhak Aug 31, 2018 •

edited

Loading