Skip to content

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented May 17, 2019

Most etcd scripts are now controlled by MCO, so restore-cluster-state
function now uses those instead vendored scripts.

TODO:

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 17, 2019
@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 91ef3d0 to 3eac0c5 Compare May 17, 2019 13:26
@vrutkovs
Copy link
Contributor Author

oh, that's not good:

Make etcd backup on first master
Creating asset directory ./assets
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/
Error: could not open ./assets/backup/etcd/member/snap/db.part (open ./assets/backup/etcd/member/snap/db.part: no such file or directory)

@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 1ce5def to 0d9bc25 Compare May 17, 2019 15:22
@vrutkovs vrutkovs changed the title WIP DR snapshot restore: use scripts provided by MCO DR snapshot restore: use scripts provided by MCO May 17, 2019
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 17, 2019
@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch 4 times, most recently from 4b86017 to 1aac106 Compare May 17, 2019 21:08
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 17, 2019
@vrutkovs
Copy link
Contributor Author

Failing tests:

[Feature:Platform] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel]

Seems we need to add a two minute pause to ensure this test passes

Other test failures are flakes

Flaky tests:

[Feature:DeploymentConfig] deploymentconfigs keep the deployer pod invariant valid [Conformance] should deal with config change in case the deployment is still running [Suite:openshift/conformance/parallel/minimal]
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity [Suite:openshift/conformance/parallel] [Suite:k8s]

@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 826929a to f729a7e Compare May 20, 2019 09:08
@vrutkovs
Copy link
Contributor Author

/hold

Waiting for MCO bugfix to land to make sure all tests pass

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2019
@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from f729a7e to f3f5594 Compare May 21, 2019 06:19
@vrutkovs
Copy link
Contributor Author

fail [github.com/openshift/origin/test/extended/operators/cluster.go:118]: Expected
    <[]string | len:3, cap:4>: [
        "Pod openshift-apiserver/apiserver-qhcch is not healthy: container openshift-apiserver has restarted more than 5 times",
        "Pod openshift-kube-apiserver/kube-apiserver-ip-10-0-129-7.ec2.internal is not healthy: container kube-apiserver-6 has restarted more than 5 times",
        "Pod openshift-kube-apiserver/kube-apiserver-ip-10-0-140-240.ec2.internal is not healthy: container kube-apiserver-6 has restarted more than 5 times",
    ]
to be empty

@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch 2 times, most recently from 4f99974 to a2e03ae Compare May 21, 2019 12:11
@vrutkovs
Copy link
Contributor Author

ssh bastion didn't start

/test pj-rehearse

@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from a2e03ae to e0fd47b Compare May 21, 2019 14:51
@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

1 similar comment
@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

@openshift-ci-robot openshift-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 22, 2019
@patrickdillon
Copy link
Contributor

/cc @hexfusion @patrickdillon

Please review the sequence of actions and params to the scripts

Looks good, but I will give time for @hexfusion to take a look.

@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 5717efe to d062872 Compare May 24, 2019 15:42
@abhinavdahiya
Copy link
Contributor

/approve

@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from d062872 to 3f7b01c Compare May 24, 2019 16:48
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 24, 2019
@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 3f7b01c to 0ccd9d9 Compare May 24, 2019 18:03
@wking
Copy link
Member

wking commented May 24, 2019

PR topic references openshift/machine-config-operator#791. Looks like that's been closed in favor of the still-open openshift/machine-config-operator#793? Are we still waiting for that to land? [Edit: sounds like the plan is to land this first to help debug the MCO PR]

@@ -182,7 +182,7 @@ presubmits:
secretName: sentry-dsn
trigger: '(?m)^/test (?:.*? )?e2e-aws-upgrade(?: .*?)?$'
- agent: kubernetes
always_run: false
always_run: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want a longer track-record of success before we do this (although it's really up to the MCO team). Currently, the past 24 hours have three failures and no success for this job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have passing tests before MCO scripts are debugged - and we can't properly test those without having a dedicated test for DR scenarios (e.g. openshift/machine-config-operator#793 (comment))

Copy link
Member

@wking wking May 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can /test e2e-restore-cluster-state in that PR and it will run (and rerun after each bump) to help you debug that PR. No need to run this in all other MCO PRs while you debug that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not clear which MCO PR would break DR scenarios. Also, if this test is misbehaving it can be skipped with /skip since its optional


echo "Remove existing openshift-apiserver pods"
# This would ensure "Pod 'openshift-apiserver/apiserver-xxx' is not healthy: container openshift-apiserver has restarted more than 5 times" test won't fail
oc delete pod --all -n openshift-apiserver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this probably also blows away our logs for those pods? Maybe we want to pull down their logs into the shared artifacts volume before doing this?

This commit updates `restore-cluster-state` function used for DR tests.
It leverages scripts, which MCO deploys on the masters.

This change also makes all MCO PRs run this test so that we could
fix the scripts if necessary
@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch 2 times, most recently from 07d04f2 to a156af2 Compare May 27, 2019 10:23
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 27, 2019
@vrutkovs vrutkovs force-pushed the etcd-snapshot-mco-scripts branch from 7358884 to a156af2 Compare May 27, 2019 13:46
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 27, 2019
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 27, 2019

@vrutkovs: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/rehearse/openshift/machine-config-operator/master/e2e-restore-cluster-state 7358884b8c1c539a53ac178e3979b0235364fe5b link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@runcom
Copy link
Member

runcom commented May 29, 2019

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2019
@hexfusion
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 30, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, hexfusion, runcom, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 1571cfd into openshift:master May 30, 2019
@openshift-ci-robot
Copy link
Contributor

@vrutkovs: Updated the following 3 configmaps:

  • prow-job-cluster-launch-installer-e2e configmap in namespace ci using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • job-config-master configmap in namespace ci using the following files:
    • key openshift-machine-config-operator-master-presubmits.yaml using file ci-operator/jobs/openshift/machine-config-operator/openshift-machine-config-operator-master-presubmits.yaml
Details

In response to this:

Most etcd scripts are now controlled by MCO, so restore-cluster-state
function now uses those instead vendored scripts.

TODO:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants