[fcos] Bug 1802534: gcp-routes: move to MCO, implement downfile, tweak timing #1741

openshift-cherrypick-robot · 2020-05-20T15:04:15Z

This is an automated cherry-pick of #1670

/assign LorbusChris

This is required for F31

This would disable cgroupsv2 and Spectre mitigations

Zincati would hit FCOS servers and update machines

with `make go-deps`

This is necessary to prevent races

EtcdInformer is only used by the MCO pod to reconcile image names. This pulls out the logic of creating etcd informer from the generic controller context to MCO pod start method.

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>

On an upgrade from a 4.3 cluster to a cluster-etcd-operator enabled cluster discovery fails because the expected operator environment is not getting created.

[fcos] Cherry-pick commits to enable cluster-etcd-operator

The new cluster etcd operator flow is: 1) start bootstrap mcs 2) start etcd on bootstrap 3) wait for bootstrapping to finish i.e. atleast one control-plane is ready and there is MCS running on cluster 4) turn down bootstrap mcs What the above does is giving a chance to workers to grab the ignition config from the bootstap server which now stays up longer. However, by the time they attempt to create a CSR the kube-apiserver has rotated that bootstrap chain of trust out which causes the workers to error out with: Jan 29 19:55:20 ip-10-0-130-205 hyperkube[2623]: E0129 19:55:20.869251 2623 certificate_manager.go:421] Failed while requesting a signed certificate from the master: cannot create certificate signing request: Unauthorized The above results in workers not being able to join the cluster eventually. What this patch does is denying serving the configuration to all pools but master within the bootstrap server, effectively delaying workers to grab the wrong config from the wrong server. Workers will keep polling for configuration and they'll eventually grab the correct one from the server running within the new cluster. Signed-off-by: Antonio Murdaca <runcom@linux.com>

…rry-pick-1421-to-fcos [fcos] pkg/server: serve config only to master in bootstrap server

This would ensure masters / workers would report as healthy to GCP LBs

[fcos] GCP: add a script and a service which ensure GCP routes are set correctly

Make sure daemon doesn't panic when mask or contents is nil

[FCOS] pkg/daemon/update.go: check for nil pointers

[fcos] remove the etcd-member pod because we no longer need it

Avoid running python scripts on host and use `podman run --net=host` instead Cherrypicked to master as openshift#1521

…ontainer [FCOS] non-virtual-ip: replace the script with podman wrapper

node-ip is a subcommand that allows the user to see which IP should the node use in cases of multiple interface and multiple address nodes. This is useful to prevent cases where Container Runtime related services bind to an interface that is not reachable in the control plane. It has two commands: * show: Takes one or more Virtual IPs of the control plane and it gives you one eligible IP on stdout. * set: Takes one or more Virtual IPs of the control plane and sets systemd service configuration for services like CRI-O or Kubelet that need to bind to the control plane. Signed-off-by: Antoni Segura Puimedon <antoni@redhat.com>

[fcos] Use mcd subcommand to determine node ip

This patch adds the workaround suggested on [1] to make nodeport work, instead of ethtool we use NM to apply the fix for each connction before it is up. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1794714 Signed-off-by: Gal Zaidman <gzaidman@redhat.com>

Signed-off-by: Antonio Murdaca <runcom@linux.com>

as it's difficult to debug with cri-o only set to log level error Signed-off-by: Peter Hunt <pehunt@redhat.com>

…d ovn-k8s-mp0

When keepalived sets the VIP, it triggers a connection event in NetworkManager that, since the name we set from DHCP was transient it would do a reverse lookup on the connection address. Unfortunately, NM does not filter out the deprecated VIP for the reverse lookup and ends up overriding the hostname with DNS names that map to VIPs configured in the system. This fix is a workaround that on environments that have DHCP provide the hostname, will prevent the erroneous behavior of NM by setting DHCP provided FQDN addresses as static, which prevents NM from doing further address lookups. On hostname-less DHCP environments we'd want to hook to the hostname event in NM and make sure that the first one that does not map to a VIP is the one that gets set as static. Signed-off-by: Antoni Segura Puimedon <antoni@redhat.com>

On PR [1] we added a workaround for Bug [2], this fails when the worker starts for the first time since openshift-sdn is created only when the sdn pod is starting. Instead we will disable by default leave as is only when running with OVNkubernetes [1] On PR openshift#1606, [2] https://bugzilla.redhat.com/show_bug.cgi?id=1794714 Signed-off-by: Gal Zaidman <gzaidman@redhat.com>

This patch adds the workaround suggested on [1] to make nodeport work, instead of ethtool we use NM to apply the fix for each connction before it is up. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1794714 Brings the following ovirt fixes to openstack platform: - openshift#1606 - openshift#1621

This is inferred by the golang-1.13 container now Remove now erroring copying of scripts and add workaround for k8s.io/code-generator not playing nice with go modules.

Restore code-generator's go.mod file and remove vendor/k8s.io/code-generator/vendor directory after running verification in`make verify`

A map doesn't guarantee order when we are creating new ignitions. When we update the image CR with blocked registries, the ctrcfg controller needs to update two files registries.conf and policy.json. Since we get an update from the image CR about every 20 mins, we compare the semantics to see if anything has changed. But since the order is not guaranteed, the controller might think that the semantics is not equal even if nothing in the data changed. Hence another MC is created, and everytime we get an update the MC applied to the nodes keeps flipping back and forth for the 2 possible orders causing the nodes to reboot a bunch of times. So move to using a struct array to ensure the order is always the same and we don't have two similar MCs being created. Signed-off-by: Urvashi Mohnani <umohnani@redhat.com>

[fcos] rebase on recent master

Ensure `curl` command has NSS_SDB_USE_CACHE env var set

…rry-pick-1648-to-fcos etcdquorumguard_deployment: pass NSS_SDB_USE_CACHE=no to curl

MCO should apply etcd-quorum-guard deployment instead of CVO. It also controls the number of replicas in this deployment: it would scale 1 replica if CEO's useUnsupportedUnsafeNonHANonProductionUnstableEtcd option is enabled. This allows creating single node clusters

[FCOS] Control amount of replicas in etcd-quorum-guard deployment

They're missing a lot of things, but let's just add a reasonable Infra object for now.

A new mode for gcp-routes allows for marking individual VIPs as down. If that's available, use it.

OpenShift wants a pretty different version of gcp-routes.service compared to the generic case. So we really should just control it ourselves. This creates a service, openshift-gcp-routes.service, that conflicts with the gcp-routes.service from the RHCOS overlay. It also picks up some pending improvements to it, namely downfile support.

openshift-bot · 2020-10-20T12:56:14Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

LorbusChris · 2020-10-20T22:40:56Z

/close

openshift-ci-robot · 2020-10-20T22:41:14Z

@LorbusChris: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Vadim Rutkovsky and others added 30 commits January 18, 2020 06:13

Add vrutkovs and LorbusChris to OWNERS

5d48938

FCOS support

df7d00e

Update code to use Ignition spec3

fa23f0a

Allow tuning cgroupv2 and mitigation arguments

6610e53

This is required for F31

templates: place /etc/pivot/kernel-args on masters/workers

ad3334a

This would disable cgroupsv2 and Spectre mitigations

Disable Zincati on masters/workers

14a90a9

Zincati would hit FCOS servers and update machines

Add a service to pull MCO image and extract MCD to /usr/local/bin

62f3246

Fix update codegen script to be compatible with go mod vendor

afd5896

Run go mod tidy

312e132

Update vendor

d0ff651

with `make go-deps`

install: seed configmap and image-references with ceo images

d6da554

This is necessary to prevent races

*: get etcd CR from kube api to check if it managed/unmanaged

12a3ea9

controller_context.go: terminate MCO pod for transient apiserver errors

bfe5af7

*: undo the change to controller_context

5a585ca

EtcdInformer is only used by the MCO pod to reconcile image names. This pulls out the logic of creating etcd informer from the generic controller context to MCO pod start method.

templates: add wait-for-kube

523f153

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>

setup-etcd-environment/run.go: fix upgrade path

f097d98

On an upgrade from a 4.3 cluster to a cluster-etcd-operator enabled cluster discovery fails because the expected operator environment is not getting created.

Merge pull request openshift#1415 from vrutkovs/fcos-ceo

a8c6f1f

[fcos] Cherry-pick commits to enable cluster-etcd-operator

Merge pull request openshift#1423 from openshift-cherrypick-robot/che…

9de672e

…rry-pick-1421-to-fcos [fcos] pkg/server: serve config only to master in bootstrap server

GCP: add a script and a service which ensure GCP routes are correctly

aa25f77

This would ensure masters / workers would report as healthy to GCP LBs

Merge pull request openshift#1442 from vrutkovs/fcos-gcp

1b5c2d5

[fcos] GCP: add a script and a service which ensure GCP routes are set correctly

pkg/daemon/update.go: check for nil pointers

f1b7da9

Make sure daemon doesn't panic when mask or contents is nil

remove the etcd-member pod because we no longer need it

bac2c90

Merge pull request openshift#1498 from vrutkovs/fcos-pointer-nil

f642ee3

[FCOS] pkg/daemon/update.go: check for nil pointers

Merge pull request openshift#1500 from vrutkovs/fcos-cherrypick-1465

471f728

[fcos] remove the etcd-member pod because we no longer need it

non-virtual-ip: replace the script with podman wrapper

918fdc7

Avoid running python scripts on host and use `podman run --net=host` instead Cherrypicked to master as openshift#1521

Merge pull request openshift#1478 from vrutkovs/fcos-non-virtual-ip-c…

1c5ad92

…ontainer [FCOS] non-virtual-ip: replace the script with podman wrapper

Update NM's non_vip prependers

534531b

Merge pull request openshift#1565 from vrutkovs/fcos-node-ip

2d7ff58

[fcos] Use mcd subcommand to determine node ip

yuqi-zhang and others added 21 commits April 13, 2020 10:55

deleteStaleFiles: only check for /usr/etc on *COS systems

becd472

pkg/daemon: Add event for drain failures

9abae6d

OWNERS: add BZ component name

fb52d7d

Signed-off-by: Antonio Murdaca <runcom@linux.com>

cri-o: set log level to info

ffdde2f

as it's difficult to debug with cri-o only set to log level error Signed-off-by: Peter Hunt <pehunt@redhat.com>

sdn: ignore new ovn-kubernetes OVS internal port names ovn-k8s-gw0 an…

9f8fd22

…d ovn-k8s-mp0

Makefile: Drop -mod=vendor flag from tests

49439a0

This is inferred by the golang-1.13 container now Remove now erroring copying of scripts and add workaround for k8s.io/code-generator not playing nice with go modules.

Makefile: Cleanup after make verify

239447e

Restore code-generator's go.mod file and remove vendor/k8s.io/code-generator/vendor directory after running verification in`make verify`

make go-deps

f4b8c5e

Merge pull request openshift#1641 from vrutkovs/fcos-rebase-apr13

1de18b1

[fcos] rebase on recent master

etcdquorumguard_deployment: pass NSS_SDB_USE_CACHE=no to curl

6190dcb

Ensure `curl` command has NSS_SDB_USE_CACHE env var set

Merge pull request openshift#1653 from openshift-cherrypick-robot/che…

a2067c3

…rry-pick-1648-to-fcos etcdquorumguard_deployment: pass NSS_SDB_USE_CACHE=no to curl

Merge pull request openshift#1708 from vrutkovs/fcos-quorum-guard-scale

0e34bdf

[FCOS] Control amount of replicas in etcd-quorum-guard deployment

controller/template tests: more realistic controller configs

14b02a4

They're missing a lot of things, but let's just add a reasonable Infra object for now.

cmd/gcp-routes-controller: use downfile if available, tweak timing

83299a0

A new mode for gcp-routes allows for marking individual VIPs as down. If that's available, use it.

openshift-ci-robot assigned LorbusChris May 20, 2020

openshift-cherrypick-robot mentioned this pull request May 20, 2020

Bug 1802534: gcp-routes: move to MCO, implement downfile, tweak timing #1670

Merged

LorbusChris force-pushed the fcos branch 2 times, most recently from 784081c to 6f5b3c5 Compare June 10, 2020 14:15

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 20, 2020

openshift-ci-robot closed this Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fcos] Bug 1802534: gcp-routes: move to MCO, implement downfile, tweak timing #1741

[fcos] Bug 1802534: gcp-routes: move to MCO, implement downfile, tweak timing #1741

Uh oh!

openshift-cherrypick-robot commented May 20, 2020

Uh oh!

openshift-bot commented Oct 20, 2020

Uh oh!

LorbusChris commented Oct 20, 2020

Uh oh!

openshift-ci-robot commented Oct 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[fcos] Bug 1802534: gcp-routes: move to MCO, implement downfile, tweak timing #1741

[fcos] Bug 1802534: gcp-routes: move to MCO, implement downfile, tweak timing #1741

Uh oh!

Conversation

openshift-cherrypick-robot commented May 20, 2020

Uh oh!

openshift-bot commented Oct 20, 2020

Uh oh!

LorbusChris commented Oct 20, 2020

Uh oh!

openshift-ci-robot commented Oct 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants