Skip to content

Conversation

@donpenney
Copy link
Member

The allowlist and operconfig reconcilers are running too often, which significantly drives up the cluster-network-operator kube-apiserver requests.

The allowlist controller creates a watcher that triggers the reconciler for any configmap change in any namespace. The first thing the reconciler does is a GET request for the 1 configmap it manages to check existence, creating it if it doesn't already exist. The second thing is to exit if the object being reconciled is not the configmap it manages. This results in the reconciler being run almost constantly, up to a few thousand times per hour, with a configmap GET request every time. By changing the controller to use a cmInformer to limit the watcher to a specific namespace, as is done with other controllers watching configmaps, the reconciler is run only when needed.

The operconfig reconciler accesses and manages a larger number of resources, and requeues itself every 3 minutes in order to update node status if needed. The requeue functionality is reliant on the uniqueness of the reconcile request object, ie. namespace and name. Originally, the controller launched this recurring reconciler with just a watcher on the network type. A later update added configmap and node watchers as additional triggers, with a request object transformer to set the request name to match the default network name. However, this transformer also set the namespace, where the network reconcile request was unnamespaced. This created a second unique recurring reconciler, as can be seen in the logs by pairing the timestamps of "Operconfig Controller complete" logs with the subsequent requeued reconciler start log "Reconciling Network.operator.openshift.io cluster" three minutes later. As a result, the reconciler is now run twice every three minutes, rather than once. By updating the request object transformer to leave the namespace unset, the reconcile requests triggered by configmap and node changes now match the network request object, and the requeue sees these as the same trigger, leaving a single recurring reconciler requeued.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 21, 2023
@openshift-ci-robot
Copy link
Contributor

@donpenney: This pull request references Jira Issue OCPBUGS-11565, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The allowlist and operconfig reconcilers are running too often, which significantly drives up the cluster-network-operator kube-apiserver requests.

The allowlist controller creates a watcher that triggers the reconciler for any configmap change in any namespace. The first thing the reconciler does is a GET request for the 1 configmap it manages to check existence, creating it if it doesn't already exist. The second thing is to exit if the object being reconciled is not the configmap it manages. This results in the reconciler being run almost constantly, up to a few thousand times per hour, with a configmap GET request every time. By changing the controller to use a cmInformer to limit the watcher to a specific namespace, as is done with other controllers watching configmaps, the reconciler is run only when needed.

The operconfig reconciler accesses and manages a larger number of resources, and requeues itself every 3 minutes in order to update node status if needed. The requeue functionality is reliant on the uniqueness of the reconcile request object, ie. namespace and name. Originally, the controller launched this recurring reconciler with just a watcher on the network type. A later update added configmap and node watchers as additional triggers, with a request object transformer to set the request name to match the default network name. However, this transformer also set the namespace, where the network reconcile request was unnamespaced. This created a second unique recurring reconciler, as can be seen in the logs by pairing the timestamps of "Operconfig Controller complete" logs with the subsequent requeued reconciler start log "Reconciling Network.operator.openshift.io cluster" three minutes later. As a result, the reconciler is now run twice every three minutes, rather than once. By updating the request object transformer to leave the namespace unset, the reconcile requests triggered by configmap and node changes now match the network request object, and the requeue sees these as the same trigger, leaving a single recurring reconciler requeued.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from dcbw and tssurya April 21, 2023 11:10
@donpenney
Copy link
Member Author

/cc @flavio-fernandes @browsell

@donpenney
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 21, 2023
@openshift-ci-robot
Copy link
Contributor

@donpenney: This pull request references Jira Issue OCPBUGS-11565, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from anuragthehatter April 21, 2023 11:23
@flavio-fernandes
Copy link
Contributor

/assign @mariomac

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 21, 2023

@flavio-fernandes: GitHub didn't allow me to assign the following users: mariomac.

Note that only openshift members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @mariomac

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@donpenney
Copy link
Member Author

/retest

6 similar comments
@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

The e2e-openstack-ovn test is failing, but the failures are not consistent. Is this a flaky test, or could this somehow be related to the operconfig_controller change? Given that the reconciler handles the unnamespaced network object and deals with objects in various namespaces, it doesn't seem like this change should impact anything other than recurrence.

If this isn't a flaky test, however, and this change is somehow triggering the failures, I can either split out the change from this PR, or alter the change to undo the object transformation change and instead do a namespace check when returning ReconcileAfter at the end of the Reconcile function (only triggering the recurring reconciler for the namespaced reconcile request, and not the unnamespaced network object. This is how I originally coded the change, and have already verified it resolves the recurrence issue, without changing how the reconcile is called, but the current implementation of updating the transformation seemed like the cleaner approach.

@flavio-fernandes @browsell

@flavio-fernandes
Copy link
Contributor

/assign @pperiyasamy
Hi Peri. Can you please track this as part of https://issues.redhat.com/browse/OCPBUGS-11565 ?

@dougbtv
Copy link
Contributor

dougbtv commented Apr 25, 2023

I'm quite willing to approve, but I want to get a review from @mlguerrero12 first please, and thanks!

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 25, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2023
@donpenney
Copy link
Member Author

/assign @dougbtv

@dougbtv
Copy link
Contributor

dougbtv commented May 2, 2023

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 2, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: donpenney, dougbtv, mlguerrero12

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 7174d73 and 2 for PR HEAD 0d04760 in total

@donpenney
Copy link
Member Author

/retest

5 similar comments
@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

@dougbtv Job history shows this e2e-hypershift-ovn has been failing repeatedly for almost a week, with the last successful run April 27. Seems to be AWS issues?

@donpenney
Copy link
Member Author

/retest

5 similar comments
@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@donpenney
Copy link
Member Author

/retest

@dougbtv
Copy link
Contributor

dougbtv commented May 4, 2023

/override ci/prow/e2e-hypershift-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 4, 2023

@dougbtv: Overrode contexts on behalf of dougbtv: ci/prow/e2e-hypershift-ovn

Details

In response to this:

/override ci/prow/e2e-hypershift-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 0a4d1fa into openshift:master May 4, 2023
@openshift-ci-robot
Copy link
Contributor

@donpenney: Jira Issue OCPBUGS-11565: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-11565 has been moved to the MODIFIED state.

Details

In response to this:

The allowlist and operconfig reconcilers are running too often, which significantly drives up the cluster-network-operator kube-apiserver requests.

The allowlist controller creates a watcher that triggers the reconciler for any configmap change in any namespace. The first thing the reconciler does is a GET request for the 1 configmap it manages to check existence, creating it if it doesn't already exist. The second thing is to exit if the object being reconciled is not the configmap it manages. This results in the reconciler being run almost constantly, up to a few thousand times per hour, with a configmap GET request every time. By changing the controller to use a cmInformer to limit the watcher to a specific namespace, as is done with other controllers watching configmaps, the reconciler is run only when needed.

The operconfig reconciler accesses and manages a larger number of resources, and requeues itself every 3 minutes in order to update node status if needed. The requeue functionality is reliant on the uniqueness of the reconcile request object, ie. namespace and name. Originally, the controller launched this recurring reconciler with just a watcher on the network type. A later update added configmap and node watchers as additional triggers, with a request object transformer to set the request name to match the default network name. However, this transformer also set the namespace, where the network reconcile request was unnamespaced. This created a second unique recurring reconciler, as can be seen in the logs by pairing the timestamps of "Operconfig Controller complete" logs with the subsequent requeued reconciler start log "Reconciling Network.operator.openshift.io cluster" three minutes later. As a result, the reconciler is now run twice every three minutes, rather than once. By updating the request object transformer to leave the namespace unset, the reconcile requests triggered by configmap and node changes now match the network request object, and the requeue sees these as the same trigger, leaving a single recurring reconciler requeued.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 4, 2023

@donpenney: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@donpenney
Copy link
Member Author

/cherry-pick release-4.13

@openshift-cherrypick-robot

@donpenney: new pull request created: #1824

Details

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants