Skip to content

Conversation

@jkyros
Copy link
Member

@jkyros jkyros commented Mar 17, 2022

A payload was rejected due to upgrade test failures for the machine-config-controller, and PR #2802 was reverted.

{  fail [github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Mar 17 07:40:40.823: Pods in platform namespaces are not following resource request/limit rules or do not have an exception granted:
  apps/v1/Deployment/openshift-machine-config-operator/machine-config-controller/container/oauth-proxy does not have a cpu request (rule: "apps/v1/Deployment/openshift-machine-config-operator/machine-config-controller/container/oauth-proxy/request[cpu]")
  apps/v1/Deployment/openshift-machine-config-operator/machine-config-controller/container/oauth-proxy does not have a memory request (rule: "apps/v1/Deployment/openshift-machine-config-operator/machine-config-controller/container/oauth-proxy/request[memory]")}

This PR "un-reverts" the reversion from #3027 and fixes the underlying issue that was causing the failures.

How the original problem happened:

  • We added the oauth-proxy to machine-config-controller as part of Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool #2802.
  • The machine-config-controller manifest which contained it existed in earlier versions, but the oauth-proxy "sidecar" was new.
  • On upgrade we don't reapply the machine-config-controller manifest, we merge the old and the new container settings and update the object
  • The oauth-proxy fields got merged into the new object, but the oauth-proxy container's requested resources got scraped off because our fork of the resourcemerge library didn't know how to merge Resources.Requests
  • Payload tests failed because those fields were missing after an update. ( even though they were present in the manifest, they were not present on the object )

How this fixes it:

  • This makes sure we update the container's Resources.Requests structure when we merge the container object so required fields are no longer missing when we upgrade

jkyros added 2 commits March 17, 2022 11:54
…2802-mco-74-controller-alert-certificate"

This reverts commit b80e6a1, reversing
changes made to 57267b7.

This "un-reverts" the reversion so we can put PR 2802 back in with the
fix to resourcemerge.
Resourcemerge did not previously merge a container's Resources.Requests
in ensureContainer(), which meant that during upgrade cases where we update
the container object directly with changes (instead of applying/re-applying
the manifests), Resources.Requests changes would not propagate to the
updated object.

This makes ensureContainer update Resources.Requests if it has changed,
which keeps that structure from getting scraped off when we update. ( Which
will keep us from failing tests, since at least cpu and memory in that
structure are required fields )
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 17, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@mkenigs
Copy link
Contributor

mkenigs commented Mar 17, 2022

Grr wish we could have gotten #2882 in and avoided this

@mkenigs
Copy link
Contributor

mkenigs commented Mar 17, 2022

I have a weak grasp of this but the fix makes sense to me

@cgwalters
Copy link
Member

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2022
@jkyros jkyros marked this pull request as ready for review March 17, 2022 20:45
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2022
@openshift-ci openshift-ci bot requested review from cgwalters and mkenigs March 17, 2022 20:46
@mkenigs
Copy link
Contributor

mkenigs commented Mar 17, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 17, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 17, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jkyros, mkenigs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jkyros
Copy link
Member Author

jkyros commented Mar 17, 2022

I think our deal was in order to re-merge we needed a clean payload test (after we passed CI). e2e-agnostic-upgrade will probably flake at least once, but even so I'm going to hold this so it doesn't try to auto-merge. I'll trigger the payload test after we get a green CI run.
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 17, 2022
@jkyros
Copy link
Member Author

jkyros commented Mar 17, 2022

/retest

@jkyros
Copy link
Member Author

jkyros commented Mar 18, 2022

I'm not just blindly mashing on it, I promise. That last run was sandbox creation failures because it couldn't reach the container registry.
/retest

@jkyros
Copy link
Member Author

jkyros commented Mar 18, 2022

/retest

1 similar comment
@jkyros
Copy link
Member Author

jkyros commented Mar 18, 2022

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2022

@jkyros: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@deads2k
Copy link
Contributor

deads2k commented Mar 18, 2022

/payload 4.11 ci blocking

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2022

@deads2k: trigger 5 jobs of type blocking for the ci release of OCP 4.11

  • periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade
  • periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade
  • periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c1e8e190-a6bc-11ec-87d5-de4f6e72f000-0

@jkyros
Copy link
Member Author

jkyros commented Mar 18, 2022

@deads2k it looks like that passed everything in the payload test except for those failures in e2e-aws-serial and that looks like etcd problems (and I am not seeing a way this code could be affecting that).

There were a couple lines under "Multi-stage test e2e-aws-serial - e2e-aws-servial-openshift-e2e-test container test that briefly gave me pause, since we did change those ClusterRoles as part of this PR but it didn't look like it was saying that's what caused the failure, just that those roles changed ( I also might be interpreting that wrong ) :

{  0.000 I ns/openshift-machine-config-operator deployment/machine-config-operator reason/ClusterRoleUpdated Updated ClusterRole.rbac.authorization.k8s.io/machine-config-controller -n openshift-machine-config-operator because it changed
Mar 18 15:10:50.000 I ns/openshift-machine-config-operator deployment/machine-config-operator reason/ClusterRoleUpdated Updated ClusterRole.rbac.authorization.k8s.io/machine-config-controller-events -n openshift-machine-config-operator because it changed
Mar 18 15:10:50.709 E e2e-test/"[sig-instrumentation][Late] OpenShift alerting rules should have a runbook_url annotation if the alert is critical [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" finishedStatus/Flaked

Failing invariants:

[bz-etcd][invariant] alert/etcdHighNumberOfLeaderChanges should not be at or above info

With that e2e-aws-serial failure, is that "back to the drawing board" for this, or does this still have a shot at getting in?

(I didn't want to just try retesting again without asking because these tests are expensive)

@jkyros
Copy link
Member Author

jkyros commented Mar 21, 2022

I talked to David, he said the pass level on the test appeared to be sufficient (since it passed the conformance that it failed) to let it back in, so I'm unholding.
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2022
@openshift-merge-robot openshift-merge-robot merged commit 5ad20c3 into openshift:master Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants