Make our resourcemerge fork update a container's Resources.Requests, un-revert #2802 #3028

jkyros · 2022-03-17T19:19:07Z

A payload was rejected due to upgrade test failures for the machine-config-controller, and PR #2802 was reverted.

{  fail [github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Mar 17 07:40:40.823: Pods in platform namespaces are not following resource request/limit rules or do not have an exception granted:
  apps/v1/Deployment/openshift-machine-config-operator/machine-config-controller/container/oauth-proxy does not have a cpu request (rule: "apps/v1/Deployment/openshift-machine-config-operator/machine-config-controller/container/oauth-proxy/request[cpu]")
  apps/v1/Deployment/openshift-machine-config-operator/machine-config-controller/container/oauth-proxy does not have a memory request (rule: "apps/v1/Deployment/openshift-machine-config-operator/machine-config-controller/container/oauth-proxy/request[memory]")}

This PR "un-reverts" the reversion from #3027 and fixes the underlying issue that was causing the failures.

How the original problem happened:

We added the oauth-proxy to machine-config-controller as part of Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool #2802.
The machine-config-controller manifest which contained it existed in earlier versions, but the oauth-proxy "sidecar" was new.
On upgrade we don't reapply the machine-config-controller manifest, we merge the old and the new container settings and update the object
The oauth-proxy fields got merged into the new object, but the oauth-proxy container's requested resources got scraped off because our fork of the resourcemerge library didn't know how to merge Resources.Requests
Payload tests failed because those fields were missing after an update. ( even though they were present in the manifest, they were not present on the object )

How this fixes it:

This makes sure we update the container's Resources.Requests structure when we merge the container object so required fields are no longer missing when we upgrade

…2802-mco-74-controller-alert-certificate" This reverts commit b80e6a1, reversing changes made to 57267b7. This "un-reverts" the reversion so we can put PR 2802 back in with the fix to resourcemerge.

Resourcemerge did not previously merge a container's Resources.Requests in ensureContainer(), which meant that during upgrade cases where we update the container object directly with changes (instead of applying/re-applying the manifests), Resources.Requests changes would not propagate to the updated object. This makes ensureContainer update Resources.Requests if it has changed, which keeps that structure from getting scraped off when we update. ( Which will keep us from failing tests, since at least cpu and memory in that structure are required fields )

openshift-ci · 2022-03-17T19:19:23Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

mkenigs · 2022-03-17T19:45:06Z

Grr wish we could have gotten #2882 in and avoided this

mkenigs · 2022-03-17T19:59:50Z

I have a weak grasp of this but the fix makes sense to me

cgwalters · 2022-03-17T20:36:39Z

/approve

mkenigs · 2022-03-17T21:00:54Z

/lgtm

openshift-ci · 2022-03-17T21:01:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jkyros, mkenigs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jkyros · 2022-03-17T21:10:18Z

I think our deal was in order to re-merge we needed a clean payload test (after we passed CI). e2e-agnostic-upgrade will probably flake at least once, but even so I'm going to hold this so it doesn't try to auto-merge. I'll trigger the payload test after we get a green CI run.
/hold

jkyros · 2022-03-17T23:14:00Z

/retest

jkyros · 2022-03-18T01:16:44Z

I'm not just blindly mashing on it, I promise. That last run was sandbox creation failures because it couldn't reach the container registry.
/retest

jkyros · 2022-03-18T04:48:07Z

/retest

jkyros · 2022-03-18T06:59:11Z

/retest

openshift-ci · 2022-03-18T08:56:24Z

@jkyros: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

deads2k · 2022-03-18T13:10:17Z

/payload 4.11 ci blocking

openshift-ci · 2022-03-18T13:10:20Z

@deads2k: trigger 5 jobs of type blocking for the ci release of OCP 4.11

periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade
periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade
periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c1e8e190-a6bc-11ec-87d5-de4f6e72f000-0

jkyros · 2022-03-18T19:21:26Z

@deads2k it looks like that passed everything in the payload test except for those failures in e2e-aws-serial and that looks like etcd problems (and I am not seeing a way this code could be affecting that).

There were a couple lines under "Multi-stage test e2e-aws-serial - e2e-aws-servial-openshift-e2e-test container test that briefly gave me pause, since we did change those ClusterRoles as part of this PR but it didn't look like it was saying that's what caused the failure, just that those roles changed ( I also might be interpreting that wrong ) :

{  0.000 I ns/openshift-machine-config-operator deployment/machine-config-operator reason/ClusterRoleUpdated Updated ClusterRole.rbac.authorization.k8s.io/machine-config-controller -n openshift-machine-config-operator because it changed
Mar 18 15:10:50.000 I ns/openshift-machine-config-operator deployment/machine-config-operator reason/ClusterRoleUpdated Updated ClusterRole.rbac.authorization.k8s.io/machine-config-controller-events -n openshift-machine-config-operator because it changed
Mar 18 15:10:50.709 E e2e-test/"[sig-instrumentation][Late] OpenShift alerting rules should have a runbook_url annotation if the alert is critical [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" finishedStatus/Flaked

Failing invariants:

[bz-etcd][invariant] alert/etcdHighNumberOfLeaderChanges should not be at or above info

With that e2e-aws-serial failure, is that "back to the drawing board" for this, or does this still have a shot at getting in?

(I didn't want to just try retesting again without asking because these tests are expensive)

jkyros · 2022-03-21T16:11:56Z

I talked to David, he said the pass level on the test appeared to be sufficient (since it passed the conformance that it failed) to let it back in, so I'm unholding.
/hold cancel

jkyros added 2 commits March 17, 2022 11:54

Revert "Merge pull request openshift#3027 from DennisPeriquet/revert-…

a0c0b2e

…2802-mco-74-controller-alert-certificate" This reverts commit b80e6a1, reversing changes made to 57267b7. This "un-reverts" the reversion so we can put PR 2802 back in with the fix to resourcemerge.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2022

openshift-ci bot requested review from cheesesashimi and kikisdeliveryservice March 17, 2022 19:22

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2022

jkyros marked this pull request as ready for review March 17, 2022 20:45

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2022

openshift-ci bot requested review from cgwalters and mkenigs March 17, 2022 20:46

openshift-ci bot assigned mkenigs Mar 17, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 17, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 17, 2022

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2022

openshift-merge-robot merged commit 5ad20c3 into openshift:master Mar 21, 2022

Make our resourcemerge fork update a container's Resources.Requests, un-revert #2802 #3028

Make our resourcemerge fork update a container's Resources.Requests, un-revert #2802 #3028

Uh oh!

Conversation

jkyros commented Mar 17, 2022

How the original problem happened:

How this fixes it:

Uh oh!

openshift-ci bot commented Mar 17, 2022

Uh oh!

mkenigs commented Mar 17, 2022

Uh oh!

mkenigs commented Mar 17, 2022

Uh oh!

cgwalters commented Mar 17, 2022

Uh oh!

mkenigs commented Mar 17, 2022

Uh oh!

openshift-ci bot commented Mar 17, 2022

Uh oh!

jkyros commented Mar 17, 2022

Uh oh!

jkyros commented Mar 17, 2022

Uh oh!

jkyros commented Mar 18, 2022

Uh oh!

jkyros commented Mar 18, 2022

Uh oh!

jkyros commented Mar 18, 2022

Uh oh!

openshift-ci bot commented Mar 18, 2022

Uh oh!

deads2k commented Mar 18, 2022

Uh oh!

openshift-ci bot commented Mar 18, 2022

Uh oh!

jkyros commented Mar 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkyros commented Mar 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jkyros commented Mar 18, 2022 •

edited

Loading