DR snapshot restore: use scripts provided by MCO #3828

vrutkovs · 2019-05-17T10:10:23Z

Most etcd scripts are now controlled by MCO, so restore-cluster-state
function now uses those instead vendored scripts.

TODO:

Merge JIRA PROD-1027: templates/master/00-master: add a script to make etcd snapshot machine-config-operator#775
Merge JIRA PROD-1027: templates/master/00-master: optionally pass etcd connection string machine-config-operator#776
Merge DR start_static_pods: reread a list of static pods when starting them machine-config-operator#779
Fix "Pod openshift-apiserver/apiserver-xxxx is not healthy: container openshift-apiserver has restarted more than 5 times" test
Fix "Prometheus when installed on the cluster should report less than two alerts in firing or pending state"
Last node doesn't complete recovery process and it does nothing on rerun, so last node doesn't get rebooted correctly
Merge DR: reload systemd services on disk before starting kubelet in etcd restore script machine-config-operator#788 - some nodes don't start as kubelet service has changed

vrutkovs · 2019-05-17T14:37:26Z

oh, that's not good:

Make etcd backup on first master
Creating asset directory ./assets
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/
Error: could not open ./assets/backup/etcd/member/snap/db.part (open ./assets/backup/etcd/member/snap/db.part: no such file or directory)

vrutkovs · 2019-05-18T07:14:10Z

Failing tests:

[Feature:Platform] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel]

Seems we need to add a two minute pause to ensure this test passes

Other test failures are flakes

Flaky tests:

[Feature:DeploymentConfig] deploymentconfigs keep the deployer pod invariant valid [Conformance] should deal with config change in case the deployment is still running [Suite:openshift/conformance/parallel/minimal]
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity [Suite:openshift/conformance/parallel] [Suite:k8s]

vrutkovs · 2019-05-20T11:21:45Z

/hold

Waiting for MCO bugfix to land to make sure all tests pass

vrutkovs · 2019-05-21T08:36:23Z

fail [github.com/openshift/origin/test/extended/operators/cluster.go:118]: Expected
    <[]string | len:3, cap:4>: [
        "Pod openshift-apiserver/apiserver-qhcch is not healthy: container openshift-apiserver has restarted more than 5 times",
        "Pod openshift-kube-apiserver/kube-apiserver-ip-10-0-129-7.ec2.internal is not healthy: container kube-apiserver-6 has restarted more than 5 times",
        "Pod openshift-kube-apiserver/kube-apiserver-ip-10-0-140-240.ec2.internal is not healthy: container kube-apiserver-6 has restarted more than 5 times",
    ]
to be empty

vrutkovs · 2019-05-21T13:31:25Z

ssh bastion didn't start

/test pj-rehearse

vrutkovs · 2019-05-22T07:04:04Z

/test pj-rehearse

vrutkovs · 2019-05-22T12:09:17Z

/test pj-rehearse

patrickdillon · 2019-05-24T14:32:57Z

/cc @hexfusion @patrickdillon

Please review the sequence of actions and params to the scripts

Looks good, but I will give time for @hexfusion to take a look.

abhinavdahiya · 2019-05-24T16:12:49Z

/approve

wking · 2019-05-24T22:39:17Z

PR topic references openshift/machine-config-operator#791. Looks like that's been closed in favor of the still-open openshift/machine-config-operator#793? Are we still waiting for that to land? [Edit: sounds like the plan is to land this first to help debug the MCO PR]

wking · 2019-05-24T22:41:26Z

...s/openshift/machine-config-operator/openshift-machine-config-operator-master-presubmits.yaml

@@ -182,7 +182,7 @@ presubmits:
          secretName: sentry-dsn
    trigger: '(?m)^/test (?:.*? )?e2e-aws-upgrade(?: .*?)?$'
  - agent: kubernetes
-    always_run: false
+    always_run: true


I think we want a longer track-record of success before we do this (although it's really up to the MCO team). Currently, the past 24 hours have three failures and no success for this job.

We can't have passing tests before MCO scripts are debugged - and we can't properly test those without having a dedicated test for DR scenarios (e.g. openshift/machine-config-operator#793 (comment))

You can /test e2e-restore-cluster-state in that PR and it will run (and rerun after each bump) to help you debug that PR. No need to run this in all other MCO PRs while you debug that one.

Its not clear which MCO PR would break DR scenarios. Also, if this test is misbehaving it can be skipped with /skip since its optional

...s/openshift/machine-config-operator/openshift-machine-config-operator-master-presubmits.yaml

wking · 2019-05-24T22:45:42Z

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

+
+          echo "Remove existing openshift-apiserver pods"
+          # This would ensure "Pod 'openshift-apiserver/apiserver-xxx' is not healthy: container openshift-apiserver has restarted more than 5 times" test won't fail
+          oc delete pod --all -n openshift-apiserver


hmm, this probably also blows away our logs for those pods? Maybe we want to pull down their logs into the shared artifacts volume before doing this?

This commit updates `restore-cluster-state` function used for DR tests. It leverages scripts, which MCO deploys on the masters. This change also makes all MCO PRs run this test so that we could fix the scripts if necessary

openshift-ci-robot · 2019-05-27T13:47:08Z

@vrutkovs: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/rehearse/openshift/machine-config-operator/master/e2e-restore-cluster-state	7358884b8c1c539a53ac178e3979b0235364fe5b	link	`/test pj-rehearse`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

runcom · 2019-05-29T13:33:53Z

/approve

hexfusion · 2019-05-30T12:46:19Z

/lgtm

openshift-ci-robot · 2019-05-30T12:46:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, hexfusion, runcom, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/jobs/openshift/machine-config-operator/OWNERS~~ [runcom]
~~ci-operator/templates/openshift/installer/OWNERS~~ [abhinavdahiya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-05-30T12:53:56Z

@vrutkovs: Updated the following 3 configmaps:

prow-job-cluster-launch-installer-e2e configmap in namespace ci using the following files:
- key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
- key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
job-config-master configmap in namespace ci using the following files:
- key openshift-machine-config-operator-master-presubmits.yaml using file ci-operator/jobs/openshift/machine-config-operator/openshift-machine-config-operator-master-presubmits.yaml

Details

In response to this:

Most etcd scripts are now controlled by MCO, so restore-cluster-state
function now uses those instead vendored scripts.

TODO:

Merge JIRA PROD-1027: templates/master/00-master: add a script to make etcd snapshot machine-config-operator#775

Merge JIRA PROD-1027: templates/master/00-master: optionally pass etcd connection string machine-config-operator#776

Merge DR start_static_pods: reread a list of static pods when starting them machine-config-operator#779

Fix "Pod openshift-apiserver/apiserver-xxxx is not healthy: container openshift-apiserver has restarted more than 5 times" test

Fix "Prometheus when installed on the cluster should report less than two alerts in firing or pending state"
Last node doesn't complete recovery process and it does nothing on rerun, so last node doesn't get rebooted correctly

Merge DR: reload systemd services on disk before starting kubelet in etcd restore script machine-config-operator#788 - some nodes don't start as kubelet service has changed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 17, 2019

openshift-ci-robot requested review from smarterclayton and staebler May 17, 2019 10:10

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 91ef3d0 to 3eac0c5 Compare May 17, 2019 13:26

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 1ce5def to 0d9bc25 Compare May 17, 2019 15:22

vrutkovs changed the title ~~WIP DR snapshot restore: use scripts provided by MCO~~ DR snapshot restore: use scripts provided by MCO May 17, 2019

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch 4 times, most recently from 4b86017 to 1aac106 Compare May 17, 2019 21:08

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 17, 2019

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 826929a to f729a7e Compare May 20, 2019 09:08

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2019

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from f729a7e to f3f5594 Compare May 21, 2019 06:19

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch 2 times, most recently from 4f99974 to a2e03ae Compare May 21, 2019 12:11

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from a2e03ae to e0fd47b Compare May 21, 2019 14:51

openshift-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 22, 2019

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 5717efe to d062872 Compare May 24, 2019 15:42

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from d062872 to 3f7b01c Compare May 24, 2019 16:48

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 24, 2019

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 3f7b01c to 0ccd9d9 Compare May 24, 2019 18:03

wking reviewed May 24, 2019

View reviewed changes

...s/openshift/machine-config-operator/openshift-machine-config-operator-master-presubmits.yaml Outdated Show resolved Hide resolved

wking reviewed May 24, 2019

View reviewed changes

DR snapshot restore: use scripts provided by MCO

a156af2

This commit updates `restore-cluster-state` function used for DR tests. It leverages scripts, which MCO deploys on the masters. This change also makes all MCO PRs run this test so that we could fix the scripts if necessary

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch 2 times, most recently from 07d04f2 to a156af2 Compare May 27, 2019 10:23

vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 7358884 to a156af2 Compare May 27, 2019 13:46

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 27, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2019

openshift-ci-robot assigned hexfusion May 30, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 30, 2019

openshift-merge-robot merged commit 1571cfd into openshift:master May 30, 2019

DR snapshot restore: use scripts provided by MCO #3828

DR snapshot restore: use scripts provided by MCO #3828

Uh oh!

Conversation

vrutkovs commented May 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented May 17, 2019

Uh oh!

vrutkovs commented May 18, 2019

Uh oh!

vrutkovs commented May 20, 2019

Uh oh!

vrutkovs commented May 21, 2019

Uh oh!

vrutkovs commented May 21, 2019

Uh oh!

vrutkovs commented May 22, 2019

Uh oh!

vrutkovs commented May 22, 2019

Uh oh!

patrickdillon commented May 24, 2019

Uh oh!

abhinavdahiya commented May 24, 2019

Uh oh!

wking commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking May 24, 2019

Choose a reason for hiding this comment

Uh oh!

vrutkovs May 25, 2019

Choose a reason for hiding this comment

Uh oh!

wking May 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrutkovs May 27, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wking May 24, 2019

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented May 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented May 29, 2019

Uh oh!

hexfusion commented May 30, 2019

Uh oh!

openshift-ci-robot commented May 30, 2019

Uh oh!

openshift-ci-robot commented May 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

vrutkovs commented May 17, 2019 •

edited

Loading

wking commented May 24, 2019 •

edited

Loading

wking May 25, 2019 •

edited

Loading

openshift-ci-robot commented May 27, 2019 •

edited

Loading