Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Oct 19, 2021

Before running the post-update unpause monitor. Without this, 4.8->4.9->4.10 jobs are updating successfully to 4.10, unpausing compute, running the post-unpause compute-settling monitor, and failing on:

: [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]	0s
fail [github.com/onsi/[email protected]+incompatible/internal/leafnodes/runner.go:113]: Oct 17 23:28:57.284: Some cluster operators are not ready: kube-apiserver (Upgradeable=False KubeletMinorVersion_KubeletMinorVersionUnsupportedNextUpgrade: KubeletMinorVersionUpgradeable: Kubelet minor versions on nodes ip-10-0-135-91.ec2.internal, ip-10-0-168-151.ec2.internal, and ip-10-0-192-244.ec2.internal will not be supported in the next OpenShift minor version upgrade.)

With this change, that skew guard from openshift/cluster-kube-apiserver-operator#1199 is allowed to trip early in the suite. But if it trips for long enough to set off alerts, we'd still fail on that.

if cond.Get("status").String() != string(expected) {
reason := cond.Get("reason").String()
if conditionType == configv1.OperatorUpgradeable && (name == "kube-storage-version-migrator" || // https://bugzilla.redhat.com/show_bug.cgi?id=1928141
name == "openshift-controller-manager" || // https://bugzilla.redhat.com/show_bug.cgi?id=1948011
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed now, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That bug shipped with 4.8, and was not backported. We have chained-update tests that install 4.7 and update through 4.10, and attempt all the hops in a single test-suite invocation, so we can't drop this from the 4.10 suite unless we backport the fix to 4.7.z. We'll be able to drop it from 4.11, unless we extend the number of hops in our multi-hop chain jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the code comment to tell me which openshift release I can remove this in or someone (like me) will "fix" it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca2b8ad adds a comment like what you request.

if conditionType == configv1.OperatorUpgradeable && (name == "kube-storage-version-migrator" || // https://bugzilla.redhat.com/show_bug.cgi?id=1928141
name == "openshift-controller-manager" || // https://bugzilla.redhat.com/show_bug.cgi?id=1948011
name == "service-ca") { // https://bugzilla.redhat.com/show_bug.cgi?id=1948012
name == "service-ca") || // https://bugzilla.redhat.com/show_bug.cgi?id=1948012
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed now, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shipped in 4.8 with no backports, so we need to keep it for the same multi-hop reason discussed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the code comment to tell me which openshift release I can remove this in or someone (like me) will "fix" it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca2b8ad adds a comment like what you request.

}
if cond.Get("status").String() != string(expected) {
reason := cond.Get("reason").String()
if conditionType == configv1.OperatorUpgradeable && (name == "kube-storage-version-migrator" || // https://bugzilla.redhat.com/show_bug.cgi?id=1928141
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you get this fix merged?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just went MODIFIED in the past few hours 🎉. But we'll need to keep this line for many 4.y to avoid issues with chained-update jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just went MODIFIED in the past few hours tada. But we'll need to keep this line for many 4.y to avoid issues with chained-update jobs.

update the comment to tell me which openshift release I can remove this in .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca2b8ad adds a comment like what you request.

name == "openshift-controller-manager" || // https://bugzilla.redhat.com/show_bug.cgi?id=1948011
name == "service-ca") { // https://bugzilla.redhat.com/show_bug.cgi?id=1948012
name == "service-ca") || // https://bugzilla.redhat.com/show_bug.cgi?id=1948012
(name == "kube-apiserver" && reason == "KubeletMinorVersion_KubeletMinorVersionUnsupportedNextUpgrade") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to comment why this is an acceptable upgradeable=false condition or future me will not remember.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit messages and git blame ... for the win ;). If you need something inline, tell me how much of that commit message you want copied into the file, or suggest the alternative text you would like to see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'd like something inline. much easier to review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you like the whole commit message inline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole commit message is way too wordy for an inline comment, but you haven't clarified how you'd like it compacted, and obviously you think "folks who care can blame to get the details" is too compact. Rather than guessing at a middle ground, I've just inlined the relevant motivation from the commit message in a comment with c72bb0c. Let me know if you have particular text you'd like to see instead.

wking added a commit to wking/origin that referenced this pull request Oct 20, 2021
David asked for comments pointing out that we cannot remove these
exceptions yet [1].  For example, the fixed-in-4.8 bugs still impact
the 4.7 -> ... -> 4.10 jobs [2], so we'd need to wait until we were no
longer running 4.10 jobs with this suite code, before we can drop
those exceptions from the master test suite.  And if we added 4.7 ->
... -> 4.11 jobs, we'd need to wait until 4.11 forked off.  But even
if these comments go stale, they'll at least remind folks about some
of the issues they should think through before dropping the
exceptions.

[1]: openshift#26531 (comment)
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#release-openshift-origin-installer-e2e-aws-upgrade-4.7-to-4.8-to-4.9-to-4.10-ci
wking added a commit to wking/origin that referenced this pull request Oct 20, 2021
…xtUpgrade

Before running the post-update unpause monitor.  Without this,
4.8->4.9->4.10 jobs are updating successfully to 4.10, unpausing
compute, running the post-unpause compute-settling monitor, and
failing on [1,2]:

  : [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]	0s
  fail [github.com/onsi/[email protected]+incompatible/internal/leafnodes/runner.go:113]: Oct 17 23:28:57.284: Some cluster operators are not ready: kube-apiserver (Upgradeable=False KubeletMinorVersion_KubeletMinorVersionUnsupportedNextUpgrade: KubeletMinorVersionUpgradeable: Kubelet minor versions on nodes ip-10-0-135-91.ec2.internal, ip-10-0-168-151.ec2.internal, and ip-10-0-192-244.ec2.internal will not be supported in the next OpenShift minor version upgrade.)

With this change, that skew guard from [3] is allowed to trip early in
the suite.  But if it trips for long enough to set off alerts, we'd
still fail on that.

Personally, I'd much rather have the motivational message in the
commit, and not in the inline comment.  Folks could answer "why is
this line here?" with 'git blame ...'.  But David wanted an inline
comment [4], so that's what I've done here.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.8-e2e-aws-upgrade-paused
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.8-e2e-aws-upgrade-paused/1449821870344900608
[3]: openshift/cluster-kube-apiserver-operator#1199
[4]: openshift#26531 (comment)
@wking wking force-pushed the allow-compute-skew-upgradeable-false branch from 4179fb3 to c72bb0c Compare October 20, 2021 18:03
@deads2k
Copy link
Contributor

deads2k commented Oct 20, 2021

/lgtm

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 20, 2021
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

wking added 2 commits October 20, 2021 13:00
David asked for comments pointing out that we cannot remove these
exceptions yet [1].  For example, the fixed-in-4.8 bugs still impact
the 4.7 -> ... -> 4.10 jobs [2], so we'd need to wait until we were no
longer running 4.10 jobs with this suite code, before we can drop
those exceptions from the master test suite.  And if we added 4.7 ->
... -> 4.11 jobs, we'd need to wait until 4.11 forked off.  But even
if these comments go stale, they'll at least remind folks about some
of the issues they should think through before dropping the
exceptions.

[1]: openshift#26531 (comment)
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#release-openshift-origin-installer-e2e-aws-upgrade-4.7-to-4.8-to-4.9-to-4.10-ci
…xtUpgrade

Before running the post-update unpause monitor.  Without this,
4.8->4.9->4.10 jobs are updating successfully to 4.10, unpausing
compute, running the post-unpause compute-settling monitor, and
failing on [1,2]:

  : [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]	0s
  fail [github.com/onsi/[email protected]+incompatible/internal/leafnodes/runner.go:113]: Oct 17 23:28:57.284: Some cluster operators are not ready: kube-apiserver (Upgradeable=False KubeletMinorVersion_KubeletMinorVersionUnsupportedNextUpgrade: KubeletMinorVersionUpgradeable: Kubelet minor versions on nodes ip-10-0-135-91.ec2.internal, ip-10-0-168-151.ec2.internal, and ip-10-0-192-244.ec2.internal will not be supported in the next OpenShift minor version upgrade.)

With this change, that skew guard from [3] is allowed to trip early in
the suite.  But if it trips for long enough to set off alerts, we'd
still fail on that.

Personally, I'd much rather have the motivational message in the
commit, and not in the inline comment.  Folks could answer "why is
this line here?" with 'git blame ...'.  But David wanted an inline
comment [4], so that's what I've done here.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.8-e2e-aws-upgrade-paused
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.8-e2e-aws-upgrade-paused/1449821870344900608
[3]: openshift/cluster-kube-apiserver-operator#1199
[4]: openshift#26531 (comment)
@wking wking force-pushed the allow-compute-skew-upgradeable-false branch from c72bb0c to 619200e Compare October 20, 2021 20:02
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2021
@deads2k
Copy link
Contributor

deads2k commented Oct 21, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2021

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-csi 619200e link false /test e2e-aws-csi
ci/prow/e2e-agnostic-cmd 619200e link false /test e2e-agnostic-cmd
ci/prow/e2e-aws-single-node 619200e link false /test e2e-aws-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

8 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit dc1ee70 into openshift:master Oct 21, 2021
@wking wking deleted the allow-compute-skew-upgradeable-false branch November 3, 2021 07:25
jsafrane pushed a commit to jsafrane/origin that referenced this pull request Nov 11, 2021
David asked for comments pointing out that we cannot remove these
exceptions yet [1].  For example, the fixed-in-4.8 bugs still impact
the 4.7 -> ... -> 4.10 jobs [2], so we'd need to wait until we were no
longer running 4.10 jobs with this suite code, before we can drop
those exceptions from the master test suite.  And if we added 4.7 ->
... -> 4.11 jobs, we'd need to wait until 4.11 forked off.  But even
if these comments go stale, they'll at least remind folks about some
of the issues they should think through before dropping the
exceptions.

[1]: openshift#26531 (comment)
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#release-openshift-origin-installer-e2e-aws-upgrade-4.7-to-4.8-to-4.9-to-4.10-ci
jsafrane pushed a commit to jsafrane/origin that referenced this pull request Nov 11, 2021
…xtUpgrade

Before running the post-update unpause monitor.  Without this,
4.8->4.9->4.10 jobs are updating successfully to 4.10, unpausing
compute, running the post-unpause compute-settling monitor, and
failing on [1,2]:

  : [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]	0s
  fail [github.com/onsi/[email protected]+incompatible/internal/leafnodes/runner.go:113]: Oct 17 23:28:57.284: Some cluster operators are not ready: kube-apiserver (Upgradeable=False KubeletMinorVersion_KubeletMinorVersionUnsupportedNextUpgrade: KubeletMinorVersionUpgradeable: Kubelet minor versions on nodes ip-10-0-135-91.ec2.internal, ip-10-0-168-151.ec2.internal, and ip-10-0-192-244.ec2.internal will not be supported in the next OpenShift minor version upgrade.)

With this change, that skew guard from [3] is allowed to trip early in
the suite.  But if it trips for long enough to set off alerts, we'd
still fail on that.

Personally, I'd much rather have the motivational message in the
commit, and not in the inline comment.  Folks could answer "why is
this line here?" with 'git blame ...'.  But David wanted an inline
comment [4], so that's what I've done here.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.8-e2e-aws-upgrade-paused
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.8-e2e-aws-upgrade-paused/1449821870344900608
[3]: openshift/cluster-kube-apiserver-operator#1199
[4]: openshift#26531 (comment)
@wking wking changed the title test/extended/operators: Allow early KubeletMinorVersionUnsupportedNextUpgrade Bug 2028930: test/extended/operators: Allow early KubeletMinorVersionUnsupportedNextUpgrade Dec 3, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 3, 2021

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 2028930 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2028930: test/extended/operators: Allow early KubeletMinorVersionUnsupportedNextUpgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants