-
Notifications
You must be signed in to change notification settings - Fork 270
OCPBUGS-11565: High API requests due to allowlist and operconfig reconcilers running too often #1788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-11565: High API requests due to allowlist and operconfig reconcilers running too often #1788
Conversation
|
@donpenney: This pull request references Jira Issue OCPBUGS-11565, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/jira refresh |
|
@donpenney: This pull request references Jira Issue OCPBUGS-11565, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/assign @mariomac |
|
@flavio-fernandes: GitHub didn't allow me to assign the following users: mariomac. Note that only openshift members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
6 similar comments
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
The e2e-openstack-ovn test is failing, but the failures are not consistent. Is this a flaky test, or could this somehow be related to the operconfig_controller change? Given that the reconciler handles the unnamespaced network object and deals with objects in various namespaces, it doesn't seem like this change should impact anything other than recurrence. If this isn't a flaky test, however, and this change is somehow triggering the failures, I can either split out the change from this PR, or alter the change to undo the object transformation change and instead do a namespace check when returning ReconcileAfter at the end of the Reconcile function (only triggering the recurring reconciler for the namespaced reconcile request, and not the unnamespaced network object. This is how I originally coded the change, and have already verified it resolves the recurrence issue, without changing how the reconcile is called, but the current implementation of updating the transformation seemed like the cleaner approach. |
|
/assign @pperiyasamy |
|
I'm quite willing to approve, but I want to get a review from @mlguerrero12 first please, and thanks! |
|
/assign @dougbtv |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: donpenney, dougbtv, mlguerrero12 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
5 similar comments
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
@dougbtv Job history shows this e2e-hypershift-ovn has been failing repeatedly for almost a week, with the last successful run April 27. Seems to be AWS issues? |
|
/retest |
5 similar comments
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/override ci/prow/e2e-hypershift-ovn |
|
@dougbtv: Overrode contexts on behalf of dougbtv: ci/prow/e2e-hypershift-ovn DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@donpenney: Jira Issue OCPBUGS-11565: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-11565 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@donpenney: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/cherry-pick release-4.13 |
|
@donpenney: new pull request created: #1824 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The allowlist and operconfig reconcilers are running too often, which significantly drives up the cluster-network-operator kube-apiserver requests.
The allowlist controller creates a watcher that triggers the reconciler for any configmap change in any namespace. The first thing the reconciler does is a GET request for the 1 configmap it manages to check existence, creating it if it doesn't already exist. The second thing is to exit if the object being reconciled is not the configmap it manages. This results in the reconciler being run almost constantly, up to a few thousand times per hour, with a configmap GET request every time. By changing the controller to use a cmInformer to limit the watcher to a specific namespace, as is done with other controllers watching configmaps, the reconciler is run only when needed.
The operconfig reconciler accesses and manages a larger number of resources, and requeues itself every 3 minutes in order to update node status if needed. The requeue functionality is reliant on the uniqueness of the reconcile request object, ie. namespace and name. Originally, the controller launched this recurring reconciler with just a watcher on the network type. A later update added configmap and node watchers as additional triggers, with a request object transformer to set the request name to match the default network name. However, this transformer also set the namespace, where the network reconcile request was unnamespaced. This created a second unique recurring reconciler, as can be seen in the logs by pairing the timestamps of "Operconfig Controller complete" logs with the subsequent requeued reconciler start log "Reconciling Network.operator.openshift.io cluster" three minutes later. As a result, the reconciler is now run twice every three minutes, rather than once. By updating the request object transformer to leave the namespace unset, the reconcile requests triggered by configmap and node changes now match the network request object, and the requeue sees these as the same trigger, leaving a single recurring reconciler requeued.