Bug 1984414: Log resource diffs on update only in reconcile mode #628

arjunrn · 2021-07-13T11:19:52Z

Logging of resource diffs was introduced to detect hotlooping in the CVO. But this increases the verbosity of logging during updates when the resources have changed which is not useful. The logging of the diffs is more useful in the reconcile mode. Also the diff is generated with ObjectDiff which is deprecated. The ObjectDiff function now just uses cmp.Diff internally and all instances of ObjectDiff have been replaced with cmp.Diff.

wking · 2021-07-13T17:06:58Z

lib/resourceapply/apiext.go

 )

-func ApplyCustomResourceDefinitionv1(ctx context.Context, client apiextclientv1.CustomResourceDefinitionsGetter, required *apiextv1.CustomResourceDefinition) (*apiextv1.CustomResourceDefinition, bool, error) {
+func ApplyCustomResourceDefinitionv1(ctx context.Context, client apiextclientv1.CustomResourceDefinitionsGetter, required *apiextv1.CustomResourceDefinition, reconciling bool) (*apiextv1.CustomResourceDefinition, bool, error) {


Can we pass in the mode instead of passing in a reconciling bool? Like we do for ClusterOperators here. That way we don't have to bump cluster signatures if we need to make some other mode distinction later on.

I considered that and decided against it because it would leak abstractions. Mode is part of the resourcebuilder package, where as the resource creation/update logic is in the internal. If in the future the full mode is required more refactoring can be done where the mode is moved to a common package and then it can be passed to the apply function.

wking · 2021-07-13T17:09:29Z

lib/resourceapply/apps.go

-}
-
-// ApplyDeploymentFromCache applies the required deployment to the cluster.
-func ApplyDeploymentFromCache(ctx context.Context, lister appslisterv1.DeploymentLister, client appsclientv1.DeploymentsGetter, required *appsv1.Deployment) (*appsv1.Deployment, bool, error) {


note to self, this is catching up with 4b485ca (#10):

$ git show 4b485ca109cd271a10d32e47d3657331be8160fd | grep ApplyDeploymentFromCache - _, updated, err := resourceapply.ApplyDeploymentFromCache(optr.deployLister, optr.kubeClient.AppsV1(), cvo)

which removed the last consumer.

wking · 2021-07-13T17:11:37Z

lib/resourceapply/cv.go

 	}

-	klog.V(2).Infof("Updating ClusterVersion %s due to diff: %v", required.Name, diff.ObjectDiff(existing, required))
+	klog.V(2).Infof("Updating ClusterVersion %s due to diff: %v", required.Name, cmp.Diff(existing, required))


no mode guard here?

Unlike the other resources, the CV is created/updated from multiple controllers and the mode is not present there. So it cannot be used.

lib/resourceapply/imagestream.go

vrutkovs · 2021-07-13T17:25:16Z

lib/resourceapply/apiext.go


-	klog.V(2).Infof("Updating CRD %s due to diff: %v", required.Name, diff.ObjectDiff(existing, required))
+	if reconciling {
+		klog.V(2).Infof("Updating CRD %s due to diff: %v", required.Name, cmp.Diff(existing, required))


@LalatenduMohanty argued that we should use v(4) to avoid logging too many lines on production clusters (during development we could manually bump level and check for hotloops, as it doesn't need to be done too often)

I could reduce the log level but the logs will now be emitted only when there are resource updates when in reconcile mode. And this would be indicative of a hot loop. So these logs could help with diagnostics in production clusters. Changing the log level to 4 would be mean that before every release someone would have to increase the log level and check for hot loops. This would also not help in clusters where there are hot loops due to configuration which is not tested in CI.
In summary, this change basically eliminates all the logs which is considered verbose in the bug report. While at the same time keeping the logs for debugging in production clusters.

I'm fine with moving this to a different PR.

Lets file a bug for this change so that we could backport it? Other than that LGTM

This would also not help in clusters where there are hot loops due to configuration which is not tested in CI.

This is not true as we are only concerned about hotloops because of manifests in release payload which CVO is responsible for. We can not track every hotloops in the cluster.

/hold

klog.V(2).Infof("Updating CRD %s due to diff: %v", required.Name, cmp.Diff(existing, required))

This will only catch hotloops from manifests which is managed by CVO. So it wont help if there is a hot loop somewhere else. Even if there is a hot loop in a production cluster it does not mean that the cluster's availability is impacted or this is a serious issue which needs to be fixed ASAP. Also I do not see much use for customers from this information because they can not change the manifests from release payload.

However if we move the log level to V(4) then we can easily get it in our CI (by bumping the log level). So IMHO it should be atleast V(4).

Updated to V(4)

arjunrn · 2021-07-19T16:05:24Z

/retest

openshift-ci · 2021-07-21T12:00:28Z

@arjunrn: This pull request references Bugzilla bug 1984414, which is invalid:

expected the bug to target the "4.9.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1984414: Log resource diffs on update only in reconcile mode

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

arjunrn · 2021-07-21T12:01:38Z

/bugzilla refresh

openshift-ci · 2021-07-21T12:01:53Z

@arjunrn: This pull request references Bugzilla bug 1984414, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianlinliu

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vrutkovs

/lgtm

Logging of resource diffs was introduced to detect hotlooping in the CVO[1]. But this increases the verbosity of logging during updates when the resources have changed which is not useful. Hence logging of the diffs is more useful in the reconcile mode. Also the diff is generated with ObjectDiff[2] which is deprecated. The ObjectDiff function now just uses cmp.Diff[3] internally and all instances of ObjectDiff have been replaced with cmp.Diff. [1] - openshift#561 [2] - https://github.com/kubernetes/apimachinery/blob/235edae7dd90601011bbe3bcd6f84f7dc857b034/pkg/util/diff/diff.go#L57 [3] - https://pkg.go.dev/github.com/google/go-cmp/cmp#Diff

LalatenduMohanty · 2021-07-21T16:07:42Z

/hold cancel

LalatenduMohanty

/lgtm

openshift-ci · 2021-07-21T16:08:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arjunrn, LalatenduMohanty, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,vrutkovs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

arjunrn · 2021-07-21T18:18:17Z

/retest

openshift-bot · 2021-07-22T03:04:52Z

/retest-required