OCPBUGS-11565: High API requests due to allowlist and operconfig reconcilers running too often #1788

donpenney · 2023-04-21T11:10:27Z

The allowlist and operconfig reconcilers are running too often, which significantly drives up the cluster-network-operator kube-apiserver requests.

The allowlist controller creates a watcher that triggers the reconciler for any configmap change in any namespace. The first thing the reconciler does is a GET request for the 1 configmap it manages to check existence, creating it if it doesn't already exist. The second thing is to exit if the object being reconciled is not the configmap it manages. This results in the reconciler being run almost constantly, up to a few thousand times per hour, with a configmap GET request every time. By changing the controller to use a cmInformer to limit the watcher to a specific namespace, as is done with other controllers watching configmaps, the reconciler is run only when needed.

The operconfig reconciler accesses and manages a larger number of resources, and requeues itself every 3 minutes in order to update node status if needed. The requeue functionality is reliant on the uniqueness of the reconcile request object, ie. namespace and name. Originally, the controller launched this recurring reconciler with just a watcher on the network type. A later update added configmap and node watchers as additional triggers, with a request object transformer to set the request name to match the default network name. However, this transformer also set the namespace, where the network reconcile request was unnamespaced. This created a second unique recurring reconciler, as can be seen in the logs by pairing the timestamps of "Operconfig Controller complete" logs with the subsequent requeued reconciler start log "Reconciling Network.operator.openshift.io cluster" three minutes later. As a result, the reconciler is now run twice every three minutes, rather than once. By updating the request object transformer to leave the namespace unset, the reconcile requests triggered by configmap and node changes now match the network request object, and the requeue sees these as the same trigger, leaving a single recurring reconciler requeued.

openshift-ci-robot · 2023-04-21T11:10:33Z

@donpenney: This pull request references Jira Issue OCPBUGS-11565, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The allowlist and operconfig reconcilers are running too often, which significantly drives up the cluster-network-operator kube-apiserver requests.

The allowlist controller creates a watcher that triggers the reconciler for any configmap change in any namespace. The first thing the reconciler does is a GET request for the 1 configmap it manages to check existence, creating it if it doesn't already exist. The second thing is to exit if the object being reconciled is not the configmap it manages. This results in the reconciler being run almost constantly, up to a few thousand times per hour, with a configmap GET request every time. By changing the controller to use a cmInformer to limit the watcher to a specific namespace, as is done with other controllers watching configmaps, the reconciler is run only when needed.

The operconfig reconciler accesses and manages a larger number of resources, and requeues itself every 3 minutes in order to update node status if needed. The requeue functionality is reliant on the uniqueness of the reconcile request object, ie. namespace and name. Originally, the controller launched this recurring reconciler with just a watcher on the network type. A later update added configmap and node watchers as additional triggers, with a request object transformer to set the request name to match the default network name. However, this transformer also set the namespace, where the network reconcile request was unnamespaced. This created a second unique recurring reconciler, as can be seen in the logs by pairing the timestamps of "Operconfig Controller complete" logs with the subsequent requeued reconciler start log "Reconciling Network.operator.openshift.io cluster" three minutes later. As a result, the reconciler is now run twice every three minutes, rather than once. By updating the request object transformer to leave the namespace unset, the reconcile requests triggered by configmap and node changes now match the network request object, and the requeue sees these as the same trigger, leaving a single recurring reconciler requeued.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

donpenney · 2023-04-21T11:22:31Z

/cc @flavio-fernandes @browsell

donpenney · 2023-04-21T11:23:46Z

/jira refresh

openshift-ci-robot · 2023-04-21T11:23:53Z

@donpenney: This pull request references Jira Issue OCPBUGS-11565, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

flavio-fernandes · 2023-04-21T14:45:20Z

/assign @mariomac

openshift-ci · 2023-04-21T14:45:21Z

@flavio-fernandes: GitHub didn't allow me to assign the following users: mariomac.

Note that only openshift members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @mariomac

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

donpenney · 2023-04-21T16:44:46Z

/retest

donpenney · 2023-04-22T02:24:01Z

/retest

donpenney · 2023-04-22T12:19:29Z

/retest

donpenney · 2023-04-22T14:44:26Z

/retest

donpenney · 2023-04-23T01:08:25Z

/retest

donpenney · 2023-04-23T11:46:31Z

/retest

donpenney · 2023-04-23T14:31:44Z

/retest

donpenney · 2023-04-23T17:22:06Z

The e2e-openstack-ovn test is failing, but the failures are not consistent. Is this a flaky test, or could this somehow be related to the operconfig_controller change? Given that the reconciler handles the unnamespaced network object and deals with objects in various namespaces, it doesn't seem like this change should impact anything other than recurrence.

If this isn't a flaky test, however, and this change is somehow triggering the failures, I can either split out the change from this PR, or alter the change to undo the object transformation change and instead do a namespace check when returning ReconcileAfter at the end of the Reconcile function (only triggering the recurring reconciler for the namespaced reconcile request, and not the unnamespaced network object. This is how I originally coded the change, and have already verified it resolves the recurrence issue, without changing how the reconcile is called, but the current implementation of updating the transformation seemed like the cleaner approach.

@flavio-fernandes @browsell

flavio-fernandes · 2023-04-24T14:50:51Z

/assign @pperiyasamy
Hi Peri. Can you please track this as part of https://issues.redhat.com/browse/OCPBUGS-11565 ?

dougbtv · 2023-04-25T18:09:30Z

I'm quite willing to approve, but I want to get a review from @mlguerrero12 first please, and thanks!

pkg/controller/allowlist/allowlist_controller.go

donpenney · 2023-05-02T12:14:35Z

/assign @dougbtv

dougbtv · 2023-05-02T16:22:23Z

/approve

openshift-ci · 2023-05-02T16:23:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: donpenney, dougbtv, mlguerrero12

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dougbtv]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-05-02T18:13:31Z

/retest-required

Remaining retests: 0 against base HEAD 7174d73 and 2 for PR HEAD 0d04760 in total

donpenney · 2023-05-03T00:44:34Z

/retest

donpenney · 2023-05-03T04:12:39Z

/retest

donpenney · 2023-05-03T04:58:27Z

/retest

donpenney · 2023-05-03T11:21:36Z

/retest

donpenney · 2023-05-03T12:44:55Z

/retest

donpenney · 2023-05-03T14:04:50Z

/retest

donpenney · 2023-05-03T16:48:04Z

@dougbtv Job history shows this e2e-hypershift-ovn has been failing repeatedly for almost a week, with the last successful run April 27. Seems to be AWS issues?

donpenney · 2023-05-03T22:01:44Z

/retest

donpenney · 2023-05-04T01:37:24Z

/retest

donpenney · 2023-05-04T05:25:02Z

/retest

donpenney · 2023-05-04T11:19:29Z

/retest

donpenney · 2023-05-04T14:32:46Z

/retest

donpenney · 2023-05-04T16:19:37Z

/retest

dougbtv · 2023-05-04T17:55:19Z

/override ci/prow/e2e-hypershift-ovn

openshift-ci · 2023-05-04T18:08:41Z

@dougbtv: Overrode contexts on behalf of dougbtv: ci/prow/e2e-hypershift-ovn

Details

In response to this:

/override ci/prow/e2e-hypershift-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-05-04T18:10:51Z

@donpenney: Jira Issue OCPBUGS-11565: All pull requests linked via external trackers have merged:

openshift/cluster-network-operator#1788

Jira Issue OCPBUGS-11565 has been moved to the MODIFIED state.

Details

In response to this:

The allowlist and operconfig reconcilers are running too often, which significantly drives up the cluster-network-operator kube-apiserver requests.

The allowlist controller creates a watcher that triggers the reconciler for any configmap change in any namespace. The first thing the reconciler does is a GET request for the 1 configmap it manages to check existence, creating it if it doesn't already exist. The second thing is to exit if the object being reconciled is not the configmap it manages. This results in the reconciler being run almost constantly, up to a few thousand times per hour, with a configmap GET request every time. By changing the controller to use a cmInformer to limit the watcher to a specific namespace, as is done with other controllers watching configmaps, the reconciler is run only when needed.

The operconfig reconciler accesses and manages a larger number of resources, and requeues itself every 3 minutes in order to update node status if needed. The requeue functionality is reliant on the uniqueness of the reconcile request object, ie. namespace and name. Originally, the controller launched this recurring reconciler with just a watcher on the network type. A later update added configmap and node watchers as additional triggers, with a request object transformer to set the request name to match the default network name. However, this transformer also set the namespace, where the network reconcile request was unnamespaced. This created a second unique recurring reconciler, as can be seen in the logs by pairing the timestamps of "Operconfig Controller complete" logs with the subsequent requeued reconciler start log "Reconciling Network.operator.openshift.io cluster" three minutes later. As a result, the reconciler is now run twice every three minutes, rather than once. By updating the request object transformer to leave the namespace unset, the reconcile requests triggered by configmap and node changes now match the network request object, and the requeue sees these as the same trigger, leaving a single recurring reconciler requeued.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-05-04T18:15:12Z

@donpenney: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

donpenney · 2023-05-31T16:07:26Z

/cherry-pick release-4.13

openshift-cherrypick-robot · 2023-05-31T16:08:09Z

@donpenney: new pull request created: #1824

Details

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from dcbw and tssurya April 21, 2023 11:10

openshift-ci bot requested review from browsell and flavio-fernandes April 21, 2023 11:22

openshift-ci bot requested a review from anuragthehatter April 21, 2023 11:23

openshift-ci bot assigned pperiyasamy Apr 24, 2023

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2023

mlguerrero12 reviewed Apr 26, 2023

View reviewed changes

pkg/controller/allowlist/allowlist_controller.go Outdated Show resolved Hide resolved

openshift-ci bot assigned mlguerrero12 Apr 28, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2023

openshift-ci bot assigned dougbtv May 2, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2023

openshift-merge-robot merged commit 0a4d1fa into openshift:master May 4, 2023

donpenney deleted the update-reconcilers branch May 31, 2023 16:07

openshift-cherrypick-robot mentioned this pull request May 31, 2023

[release-4.13] OCPBUGS-14367: High API requests due to allowlist and operconfig reconcilers running too often #1824

Merged

OCPBUGS-11565: High API requests due to allowlist and operconfig reconcilers running too often #1788

OCPBUGS-11565: High API requests due to allowlist and operconfig reconcilers running too often #1788

Uh oh!

Conversation

donpenney commented Apr 21, 2023

Uh oh!

openshift-ci-robot commented Apr 21, 2023

Uh oh!

donpenney commented Apr 21, 2023

Uh oh!

donpenney commented Apr 21, 2023

Uh oh!

openshift-ci-robot commented Apr 21, 2023

Uh oh!

flavio-fernandes commented Apr 21, 2023

Uh oh!

openshift-ci bot commented Apr 21, 2023

Uh oh!

donpenney commented Apr 21, 2023

Uh oh!

donpenney commented Apr 22, 2023

Uh oh!

donpenney commented Apr 22, 2023

Uh oh!

donpenney commented Apr 22, 2023

Uh oh!

donpenney commented Apr 23, 2023

Uh oh!

donpenney commented Apr 23, 2023

Uh oh!

donpenney commented Apr 23, 2023

Uh oh!

donpenney commented Apr 23, 2023

Uh oh!

flavio-fernandes commented Apr 24, 2023

Uh oh!

dougbtv commented Apr 25, 2023

Uh oh!

Uh oh!

donpenney commented May 2, 2023

Uh oh!

dougbtv commented May 2, 2023

Uh oh!

openshift-ci bot commented May 2, 2023

Uh oh!

openshift-ci-robot commented May 2, 2023

Uh oh!

donpenney commented May 3, 2023

Uh oh!

donpenney commented May 3, 2023

Uh oh!

donpenney commented May 3, 2023

Uh oh!

donpenney commented May 3, 2023

Uh oh!

donpenney commented May 3, 2023

Uh oh!

donpenney commented May 3, 2023

Uh oh!

donpenney commented May 3, 2023

Uh oh!

donpenney commented May 3, 2023

Uh oh!

donpenney commented May 4, 2023

Uh oh!

donpenney commented May 4, 2023

Uh oh!

donpenney commented May 4, 2023

Uh oh!

donpenney commented May 4, 2023

Uh oh!

donpenney commented May 4, 2023

Uh oh!

dougbtv commented May 4, 2023

Uh oh!

openshift-ci bot commented May 4, 2023

Uh oh!

openshift-ci-robot commented May 4, 2023

Uh oh!

openshift-ci bot commented May 4, 2023