Skip to content

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented Apr 25, 2019

This adds a new function to installer template, which emulates cluster state restore. A new optional test e2e-restore-cluster-state is added to test it via rehearse jobs.

See https://jira.coreos.com/browse/CORS-1062

TODO:

  • MCO doesn't seem to switch to Degraded in this case anymore (cc @runcom)
    Previously machine-config operator switched to Degraded state when master configs were updated. This was a bug, which was fixed. The correct way to track config rollouts is watching MachineConfigPool statuses
  • Join other etcd members in the restored cluster
  • Install an older version of the cluster so that it could later on be updated
    This would be taken care of later

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 25, 2019
@vrutkovs vrutkovs force-pushed the restore-cluster-state branch 4 times, most recently from 3f55a80 to 45eb608 Compare April 25, 2019 16:33
@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

@runcom
Copy link
Member

runcom commented Apr 29, 2019

  • MCO doesn't seem to switch to Degraded in this case anymore (cc @runcom)

@vrutkovs why would it be or was it? just for context

@vrutkovs vrutkovs force-pushed the restore-cluster-state branch 8 times, most recently from d87e882 to 4bf53d4 Compare April 30, 2019 16:41
Copy link
Contributor

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this against an unaffiliated origin PR (which does not run the function) just to make sure editing the e2e template would not break other tests and it was fine. All tests completed successfully.

@patrickdillon
Copy link
Contributor

I don't think I have privileges but trying anyway:
/lgtm
/approve
NB: the rehearsal jobs should be failing at this stage

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels May 1, 2019
@vrutkovs vrutkovs force-pushed the restore-cluster-state branch 2 times, most recently from 90c3eb5 to 47045a4 Compare May 2, 2019 18:47
@stevekuznetsov
Copy link
Contributor

Do we have consensus here? Please LMK

@vrutkovs vrutkovs force-pushed the restore-cluster-state branch 5 times, most recently from c469ee0 to 2b9b38e Compare May 3, 2019 12:11
@vrutkovs vrutkovs force-pushed the restore-cluster-state branch 2 times, most recently from e8cfe3d to 28f4a27 Compare May 10, 2019 12:25
@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

1 similar comment
@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

@vrutkovs vrutkovs force-pushed the restore-cluster-state branch from 28f4a27 to 1b3d15d Compare May 10, 2019 14:17
@vrutkovs
Copy link
Contributor Author

/hold

Breaks aws test

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2019
@wking
Copy link
Member

wking commented May 10, 2019

Some kind of CI hiccup for e2e-aws:

2019/05/10 14:52:31 Container setup in pod e2e-aws completed successfully
failed to open log file "/var/log/pods/7c363a33-732f-11e9-8cec-42010a8e0004/test/0.log": open /var/log/pods/7c363a33-732f-11e9-8cec-42010a8e0004/test/0.log: no such file or directory2019/05/10 15:21:02 Container test in pod e2e-aws failed, exit code 1, reason Error
2019/05/10 15:21:11 Container artifacts in pod e2e-aws completed successfully
2019/05/10 15:21:11 Container teardown in pod e2e-aws completed successfully
2019/05/10 15:21:12 error: unable to signal to artifacts container to terminate in pod e2e-aws, triggering deletion: could not run remote command: pods "e2e-aws" is forbidden: pods "e2e-aws" not found
2019/05/10 15:21:12 error: unable to retrieve artifacts from pod e2e-aws: could not read gzipped artifacts: pods "e2e-aws" is forbidden: pods "e2e-aws" not found
E0510 15:32:06.936001      13 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".159d5c1d41e6b47f", GenerateName:"", Namespace:"ci-op-j2mgkr2k", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-op-j2mgkr2k", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job rehearse-3595-pull-ci-openshift-builder-master-e2e-aws for PRs () in namespace ci-op-j2mgkr2k from authors ()", Source:v1.EventSource{Component:"ci-op-j2mgkr2k", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf2d843db3d3987f, ext:4348204536303, loc:(*time.Location)(0x1d61ae0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf2d843db3d3987f, ext:4348204536303, loc:(*time.Location)(0x1d61ae0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "ci-op-j2mgkr2k" not found' (will not retry!)

/retest

@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

@vrutkovs
Copy link
Contributor Author

Failing AWS job was just a hiccup indeed
/hold cancel
/test pj-rehearse

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2019
@vrutkovs vrutkovs force-pushed the restore-cluster-state branch from 1b3d15d to c99a3b4 Compare May 13, 2019 08:32
@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want exit 1 -> return 1, so the caller can decide if they want to add additional error handling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, the rest of this script is all set -e, so this matches that.

@vrutkovs vrutkovs force-pushed the restore-cluster-state branch from c2f0048 to 4866887 Compare May 16, 2019 10:07
@vrutkovs vrutkovs force-pushed the restore-cluster-state branch from 4866887 to 5e02459 Compare May 16, 2019 11:57
@wking
Copy link
Member

wking commented May 16, 2019

e2e-aws:

failed: (1s) 2019-05-16T12:49:32 "[k8s.io] [sig-node] Security Context [Feature:SecurityContext] should support seccomp alpha runtime/default annotation [Feature:Seccomp] [Suite:openshift/conformance/parallel] [Suite:k8s]"

/test e2e-aws

@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

@hexfusion
Copy link
Contributor

/lgtm
thanks @vrutkovs

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2019
@wking
Copy link
Member

wking commented May 16, 2019

/lgtm

🎉

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, patrickdillon, sdodson, vrutkovs, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 16, 2019
@openshift-merge-robot openshift-merge-robot merged commit 499afed into openshift:master May 16, 2019
@openshift-ci-robot
Copy link
Contributor

@vrutkovs: Updated the following 5 configmaps:

  • ci-operator-master-configs configmap in namespace ci using the following files:
    • key openshift-installer-master.yaml using file ci-operator/config/openshift/installer/openshift-installer-master.yaml
  • ci-operator-master-configs configmap in namespace ci-stg using the following files:
    • key openshift-installer-master.yaml using file ci-operator/config/openshift/installer/openshift-installer-master.yaml
  • job-config-master configmap in namespace ci using the following files:
    • key openshift-installer-master-presubmits.yaml using file ci-operator/jobs/openshift/installer/openshift-installer-master-presubmits.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
Details

In response to this:

This adds a new function to installer template, which emulates cluster state restore. A new optional test e2e-restore-cluster-state is added to test it via rehearse jobs.

See https://jira.coreos.com/browse/CORS-1062

TODO:

  • MCO doesn't seem to switch to Degraded in this case anymore (cc @runcom)
    Previously machine-config operator switched to Degraded state when master configs were updated. This was a bug, which was fixed. The correct way to track config rollouts is watching MachineConfigPool statuses
  • Join other etcd members in the restored cluster
  • Install an older version of the cluster so that it could later on be updated
    This would be taken care of later

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants