Skip to content

Conversation

@runcom
Copy link
Member

@runcom runcom commented Oct 16, 2019

There are two main reasons for this patch. The first one is obivious and
avoids spending time retemplating when there's no need since the
ControllerConfig's spec hasn't changed.
The second reason is king of tricky but what can happen is the
following:

  1. between 4.1 and 4.2 MCO changed the etcd-member image name
    placeholder from setupEtcdEnv to setupEtcdEnvKey (along with other
    changes to image names)
  2. when an upgrade from 4.1 starts the following happens:
    a) the new MCO is rolled out
    b) the new MCC is rolled out (note the template controller is also
    rolled out here and can start before the new CC is rolled out with the
    new image name placeholder!)
    c) RACE! the template controller starts running before the new
    ControllerConfig with the new image name placeholder is created
    d) the MCC generates templates w/o the image name for the etcd member
    e) cluster dies

The patch does a simple thing but ensures that the above scenario isn't
hit since we're not going to retemplate anymore if the CC hasn't changed
(yet).

Signed-off-by: Antonio Murdaca [email protected]

…hange

There are two main reasons for this patch. The first one is obivious and
avoids spending time retemplating when there's no need since the
ControllerConfig's spec hasn't changed.
The second reason is king of tricky but what can happen is the
following:

0) between 4.1 and 4.2 MCO changed the etcd-member image name
placeholder from `setupEtcdEnv` to `setupEtcdEnvKey` (along with other
changes to image names)
1) when an upgrade from 4.1 starts the following happens:
   a) the new MCO is rolled out
   b) the new MCC is rolled out (note the template controller is also
rolled out here and can start before the new CC is rolled out with the
new image name placeholder!)
   c) RACE! the template controller starts running before the new
ControllerConfig with the new image name placeholder is created
   d) the MCC generates templates w/o the image name for the etcd member
   e)_cluster dies

The patch does a simple thing but ensures that the above scenario isn't
hit since we're not going to retemplate anymore if the CC hasn't changed
(yet).

Signed-off-by: Antonio Murdaca <[email protected]>
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 16, 2019
@runcom
Copy link
Member Author

runcom commented Oct 16, 2019

/skip

@runcom
Copy link
Member Author

runcom commented Oct 16, 2019

/retest

@runcom
Copy link
Member Author

runcom commented Oct 16, 2019

lemme check if something is wrong on upgrade now that this fails consistently

@runcom
Copy link
Member Author

runcom commented Oct 16, 2019

uhm:

level=info msg="Cluster operator {} {} is {} with {}: {}%!(EXTRA string=insights, v1.ClusterStatusConditionType=Disabled, v1.ConditionStatus=False, string=, string=)"
level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console"

/retest

@alaypatel07
Copy link
Contributor

alaypatel07 commented Oct 16, 2019

@runcom I don't understand the working of MCO fully but from what I observed in the repro, even the node that is supposed to be drained and restarts had image: '', is that something that will be covered by this PR?

Should the workflow be: a drain and reboot all the master nodes and then roll out the new manifests based on new ControllerConfig?

Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some logic to abort if an image key is not found or empty.

@kikisdeliveryservice
Copy link
Contributor

Approach makes sense to me, doing some more runs to confirm, but so far seems like it avoided the bug.
Though we're now hitting failed to sync secret cache issues..

@runcom
Copy link
Member Author

runcom commented Oct 16, 2019

Need some logic to abort if an image key is not found or empty.

that would be a nice addition indeed :)

@rphillips
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 16, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rphillips, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rphillips
Copy link
Contributor

/cherrypick release-4.2

@openshift-cherrypick-robot

@rphillips: once the present PR merges, I will cherry-pick it on top of release-4.2 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@smarterclayton
Copy link
Contributor

/test e2e-gcp-upgrade

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 16, 2019

@runcom: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 6e091e8 link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mrunalp
Copy link
Member

mrunalp commented Oct 16, 2019

/test e2e-gcp-upgrade

@smarterclayton
Copy link
Contributor

/override ci/prow/e2e-gcp-upgrade

@openshift-ci-robot
Copy link
Contributor

@smarterclayton: Overrode contexts on behalf of smarterclayton: ci/prow/e2e-gcp-upgrade

Details

In response to this:

/override ci/prow/e2e-gcp-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Contributor

/cherrypick release-4.2

@openshift-cherrypick-robot

@kikisdeliveryservice: once the present PR merges, I will cherry-pick it on top of release-4.2 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 18c9e83 into openshift:master Oct 16, 2019
@openshift-cherrypick-robot

@rphillips: new pull request created: #1182

Details

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants