Skip to content

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented May 20, 2019

Rework of #3572, which uses MCO-provided scripts instead of bundled ones.

This adds a new e2e-etcd-quorum-loss test which simulates failures of two masters.

TODO:

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 20, 2019
@vrutkovs vrutkovs force-pushed the dr-quorum-loss-mco branch 2 times, most recently from 0bfa382 to 17caefe Compare May 21, 2019 16:31
@vrutkovs vrutkovs changed the title WIP installer template: add etcd quorum loss scenario installer template: add etcd quorum loss scenario May 21, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2019
@vrutkovs
Copy link
Contributor Author

vrutkovs commented May 21, 2019

/hold

Reaches test phase, holding it until tests pass:

Failing tests:

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster should report less than two alerts in firing or pending state [Suite:openshift/conformance/parallel/minimal]

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2019
@vrutkovs vrutkovs force-pushed the dr-quorum-loss-mco branch from 17caefe to 68837ec Compare May 22, 2019 08:21
@vrutkovs
Copy link
Contributor Author

/usr/local/bin/etcd-snapshot-restore.sh never run due to a broken connection to ssh bastion

/test pj-rehearse

@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

One of the quorum loss tests passed

@vrutkovs vrutkovs force-pushed the dr-quorum-loss-mco branch 5 times, most recently from dce3aef to a7b8dc8 Compare May 23, 2019 14:00
@vrutkovs
Copy link
Contributor Author

/hold cancel

Tests are ready for review.
/cc @wking

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 23, 2019
@openshift-ci-robot openshift-ci-robot requested a review from wking May 23, 2019 14:02
@vrutkovs vrutkovs force-pushed the dr-quorum-loss-mco branch 2 times, most recently from dd7092a to c850f43 Compare May 23, 2019 14:04
@vrutkovs vrutkovs force-pushed the dr-quorum-loss-mco branch 10 times, most recently from 4ab87b8 to a7f9e76 Compare May 28, 2019 15:09
@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

@vrutkovs
Copy link
Contributor Author

Route53 issue

/test pj-rehearsal

@vrutkovs
Copy link
Contributor Author

/test pj-rehearsal

@vrutkovs vrutkovs force-pushed the dr-quorum-loss-mco branch from a7f9e76 to 53d37e2 Compare May 28, 2019 19:11
@vrutkovs
Copy link
Contributor Author

/test pj-rehearse

@runcom
Copy link
Member

runcom commented May 29, 2019

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2019
Copy link
Contributor

@hexfusion hexfusion May 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am about to break this and make etcd_name a required param for etcd-member-recover.sh. Can we add this now? For AWS you can do something like

      HOSTNAME=$(hostname)
      HOSTDOMAIN=$(hostname -d)
      ETCD_NAME=etcd-member-${HOSTNAME}.${HOSTDOMAIN}

The above will fail for vsphere and bare metal so we will need to figure out what works best there.

https://github.com/openshift/machine-config-operator/blob/ca2e5c541f3bd1dc1ec82e4fabafbb4f5228ca63/templates/master/00-master/_base/files/usr-local-bin-etcd-member-recover-sh.yaml#L22

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this now?

Sure

The above will faile for vsphere and bare metal so we will need to figure out what works best there.

I don't think we plan to run DR scenarios on vsphere CI, so it just needs to be properly documented

@vrutkovs vrutkovs force-pushed the dr-quorum-loss-mco branch 2 times, most recently from b757e39 to c1e8f71 Compare May 30, 2019 12:48
@hexfusion
Copy link
Contributor

@vrutkovs we need a rebase looks like now

@vrutkovs vrutkovs force-pushed the dr-quorum-loss-mco branch from c1e8f71 to 1f85eac Compare May 30, 2019 12:57
@hexfusion
Copy link
Contributor

/lgtm

nice work @vrutkovs thank you.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 30, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, hexfusion, runcom, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 881b8de into openshift:master May 30, 2019
@openshift-ci-robot
Copy link
Contributor

@vrutkovs: Updated the following 5 configmaps:

  • ci-operator-master-configs configmap in namespace ci-stg using the following files:
    • key openshift-installer-master.yaml using file ci-operator/config/openshift/installer/openshift-installer-master.yaml
    • key openshift-machine-config-operator-master.yaml using file ci-operator/config/openshift/machine-config-operator/openshift-machine-config-operator-master.yaml
  • job-config-master configmap in namespace ci using the following files:
    • key openshift-installer-master-presubmits.yaml using file ci-operator/jobs/openshift/installer/openshift-installer-master-presubmits.yaml
    • key openshift-machine-config-operator-master-presubmits.yaml using file ci-operator/jobs/openshift/machine-config-operator/openshift-machine-config-operator-master-presubmits.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • ci-operator-master-configs configmap in namespace ci using the following files:
    • key openshift-installer-master.yaml using file ci-operator/config/openshift/installer/openshift-installer-master.yaml
    • key openshift-machine-config-operator-master.yaml using file ci-operator/config/openshift/machine-config-operator/openshift-machine-config-operator-master.yaml
Details

In response to this:

Rework of #3572, which uses MCO-provided scripts instead of bundled ones.

This adds a new e2e-etcd-quorum-loss test which simulates failures of two masters.

TODO:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@vrutkovs: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/rehearse/openshift/machine-config-operator/master/e2e-rhel-scaleup dd7092a5352b05aeebf3b5a4121b899663458217 link /test pj-rehearse
ci/rehearse/openshift/machine-config-operator/master/e2e-etcd-quorum-loss 1f85eac link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-etcd-quorum-loss 1f85eac link /test pj-rehearse
ci/prow/pj-rehearse 1f85eac link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants