installer template: add a function to test cluster state restore #3595

vrutkovs · 2019-04-25T12:35:43Z

This adds a new function to installer template, which emulates cluster state restore. A new optional test e2e-restore-cluster-state is added to test it via rehearse jobs.

See https://jira.coreos.com/browse/CORS-1062

TODO:

MCO doesn't seem to switch to Degraded in this case anymore (cc @runcom)
Previously machine-config operator switched to Degraded state when master configs were updated. This was a bug, which was fixed. The correct way to track config rollouts is watching MachineConfigPool statuses
Join other etcd members in the restored cluster
Install an older version of the cluster so that it could later on be updated
This would be taken care of later

vrutkovs · 2019-04-26T07:14:48Z

/test pj-rehearse

runcom · 2019-04-29T11:35:12Z

MCO doesn't seem to switch to Degraded in this case anymore (cc @runcom)

@vrutkovs why would it be or was it? just for context

patrickdillon

I tested this against an unaffiliated origin PR (which does not run the function) just to make sure editing the e2e template would not break other tests and it was fine. All tests completed successfully.

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

patrickdillon · 2019-05-01T14:12:32Z

I don't think I have privileges but trying anyway:
/lgtm
/approve
NB: the rehearsal jobs should be failing at this stage

stevekuznetsov · 2019-05-03T00:11:06Z

Do we have consensus here? Please LMK

vrutkovs · 2019-05-10T12:31:36Z

/test pj-rehearse

vrutkovs · 2019-05-10T14:16:46Z

/test pj-rehearse

vrutkovs · 2019-05-10T16:06:47Z

/hold

Breaks aws test

wking · 2019-05-10T16:08:07Z

Some kind of CI hiccup for e2e-aws:

2019/05/10 14:52:31 Container setup in pod e2e-aws completed successfully
failed to open log file "/var/log/pods/7c363a33-732f-11e9-8cec-42010a8e0004/test/0.log": open /var/log/pods/7c363a33-732f-11e9-8cec-42010a8e0004/test/0.log: no such file or directory2019/05/10 15:21:02 Container test in pod e2e-aws failed, exit code 1, reason Error
2019/05/10 15:21:11 Container artifacts in pod e2e-aws completed successfully
2019/05/10 15:21:11 Container teardown in pod e2e-aws completed successfully
2019/05/10 15:21:12 error: unable to signal to artifacts container to terminate in pod e2e-aws, triggering deletion: could not run remote command: pods "e2e-aws" is forbidden: pods "e2e-aws" not found
2019/05/10 15:21:12 error: unable to retrieve artifacts from pod e2e-aws: could not read gzipped artifacts: pods "e2e-aws" is forbidden: pods "e2e-aws" not found
E0510 15:32:06.936001      13 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".159d5c1d41e6b47f", GenerateName:"", Namespace:"ci-op-j2mgkr2k", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-op-j2mgkr2k", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job rehearse-3595-pull-ci-openshift-builder-master-e2e-aws for PRs () in namespace ci-op-j2mgkr2k from authors ()", Source:v1.EventSource{Component:"ci-op-j2mgkr2k", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf2d843db3d3987f, ext:4348204536303, loc:(*time.Location)(0x1d61ae0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf2d843db3d3987f, ext:4348204536303, loc:(*time.Location)(0x1d61ae0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "ci-op-j2mgkr2k" not found' (will not retry!)

/retest

vrutkovs · 2019-05-10T16:21:14Z

/test pj-rehearse

vrutkovs · 2019-05-11T07:42:28Z

Failing AWS job was just a hiccup indeed
/hold cancel
/test pj-rehearse

vrutkovs · 2019-05-13T11:13:27Z

/test pj-rehearse

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

wking · 2019-05-15T19:23:20Z

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

I think we want exit 1 -> return 1, so the caller can decide if they want to add additional error handling.

never mind, the rest of this script is all set -e, so this matches that.

wking · 2019-05-16T13:35:08Z

e2e-aws:

failed: (1s) 2019-05-16T12:49:32 "[k8s.io] [sig-node] Security Context [Feature:SecurityContext] should support seccomp alpha runtime/default annotation [Feature:Seccomp] [Suite:openshift/conformance/parallel] [Suite:k8s]"

/test e2e-aws

vrutkovs · 2019-05-16T13:50:06Z

/test pj-rehearse

hexfusion · 2019-05-16T14:57:01Z

/lgtm
thanks @vrutkovs

wking · 2019-05-16T15:08:08Z

/lgtm

🎉

openshift-ci-robot · 2019-05-16T15:08:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, patrickdillon, sdodson, vrutkovs, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/config/openshift/installer/OWNERS~~ [wking]
~~ci-operator/jobs/openshift/installer/OWNERS~~ [wking]
~~ci-operator/templates/openshift/installer/OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-05-16T15:15:26Z

@vrutkovs: Updated the following 5 configmaps:

ci-operator-master-configs configmap in namespace ci using the following files:
- key openshift-installer-master.yaml using file ci-operator/config/openshift/installer/openshift-installer-master.yaml
ci-operator-master-configs configmap in namespace ci-stg using the following files:
- key openshift-installer-master.yaml using file ci-operator/config/openshift/installer/openshift-installer-master.yaml
job-config-master configmap in namespace ci using the following files:
- key openshift-installer-master-presubmits.yaml using file ci-operator/jobs/openshift/installer/openshift-installer-master-presubmits.yaml
prow-job-cluster-launch-installer-e2e configmap in namespace ci using the following files:
- key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
- key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

Details

In response to this:

This adds a new function to installer template, which emulates cluster state restore. A new optional test e2e-restore-cluster-state is added to test it via rehearse jobs.

See https://jira.coreos.com/browse/CORS-1062

TODO:

MCO doesn't seem to switch to Degraded in this case anymore (cc @runcom)
Previously machine-config operator switched to Degraded state when master configs were updated. This was a bug, which was fixed. The correct way to track config rollouts is watching MachineConfigPool statuses

Join other etcd members in the restored cluster

Install an older version of the cluster so that it could later on be updated
This would be taken care of later

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 25, 2019

openshift-ci-robot requested review from abhinavdahiya and flaper87 April 25, 2019 12:36

vrutkovs force-pushed the restore-cluster-state branch 4 times, most recently from 3f55a80 to 45eb608 Compare April 25, 2019 16:33

vrutkovs force-pushed the restore-cluster-state branch 8 times, most recently from d87e882 to 4bf53d4 Compare April 30, 2019 16:41

patrickdillon reviewed Apr 30, 2019

View reviewed changes

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated Show resolved Hide resolved

openshift-ci-robot assigned patrickdillon May 1, 2019

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels May 1, 2019

vrutkovs force-pushed the restore-cluster-state branch 2 times, most recently from 90c3eb5 to 47045a4 Compare May 2, 2019 18:47

vrutkovs force-pushed the restore-cluster-state branch 5 times, most recently from c469ee0 to 2b9b38e Compare May 3, 2019 12:11

vrutkovs force-pushed the restore-cluster-state branch 2 times, most recently from e8cfe3d to 28f4a27 Compare May 10, 2019 12:25

vrutkovs force-pushed the restore-cluster-state branch from 28f4a27 to 1b3d15d Compare May 10, 2019 14:17

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2019

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2019

vrutkovs force-pushed the restore-cluster-state branch from 1b3d15d to c99a3b4 Compare May 13, 2019 08:32

wking reviewed May 15, 2019

View reviewed changes

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml Outdated Show resolved Hide resolved

wking reviewed May 15, 2019

View reviewed changes

vrutkovs force-pushed the restore-cluster-state branch from c2f0048 to 4866887 Compare May 16, 2019 10:07

installer template: add a function to test cluster state restore

5e02459

vrutkovs force-pushed the restore-cluster-state branch from 4866887 to 5e02459 Compare May 16, 2019 11:57

vrutkovs mentioned this pull request May 16, 2019

installer template: add 'recover from etcd quorum loss' test #3572

Closed

1 task

openshift-ci-robot assigned hexfusion May 16, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2019

openshift-ci-robot assigned wking May 16, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 16, 2019

openshift-merge-robot merged commit 499afed into openshift:master May 16, 2019

installer template: add a function to test cluster state restore #3595

installer template: add a function to test cluster state restore #3595

Uh oh!

Conversation

vrutkovs commented Apr 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented Apr 26, 2019

Uh oh!

runcom commented Apr 29, 2019

Uh oh!

patrickdillon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patrickdillon commented May 1, 2019

Uh oh!

stevekuznetsov commented May 3, 2019

Uh oh!

vrutkovs commented May 10, 2019

Uh oh!

vrutkovs commented May 10, 2019

Uh oh!

vrutkovs commented May 10, 2019

Uh oh!

wking commented May 10, 2019

Uh oh!

vrutkovs commented May 10, 2019

Uh oh!

vrutkovs commented May 11, 2019

Uh oh!

vrutkovs commented May 13, 2019

Uh oh!

Uh oh!

wking May 15, 2019

Choose a reason for hiding this comment

Uh oh!

wking May 15, 2019

Choose a reason for hiding this comment

Uh oh!

wking commented May 16, 2019

Uh oh!

vrutkovs commented May 16, 2019

Uh oh!

hexfusion commented May 16, 2019

Uh oh!

wking commented May 16, 2019

Uh oh!

openshift-ci-robot commented May 16, 2019

Uh oh!

openshift-ci-robot commented May 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

vrutkovs commented Apr 25, 2019 •

edited

Loading