-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 1984414: Log resource diffs on update only in reconcile mode #628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1984414: Log resource diffs on update only in reconcile mode #628
Conversation
| ) | ||
|
|
||
| func ApplyCustomResourceDefinitionv1(ctx context.Context, client apiextclientv1.CustomResourceDefinitionsGetter, required *apiextv1.CustomResourceDefinition) (*apiextv1.CustomResourceDefinition, bool, error) { | ||
| func ApplyCustomResourceDefinitionv1(ctx context.Context, client apiextclientv1.CustomResourceDefinitionsGetter, required *apiextv1.CustomResourceDefinition, reconciling bool) (*apiextv1.CustomResourceDefinition, bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass in the mode instead of passing in a reconciling bool? Like we do for ClusterOperators here. That way we don't have to bump cluster signatures if we need to make some other mode distinction later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that and decided against it because it would leak abstractions. Mode is part of the resourcebuilder package, where as the resource creation/update logic is in the internal. If in the future the full mode is required more refactoring can be done where the mode is moved to a common package and then it can be passed to the apply function.
| } | ||
|
|
||
| // ApplyDeploymentFromCache applies the required deployment to the cluster. | ||
| func ApplyDeploymentFromCache(ctx context.Context, lister appslisterv1.DeploymentLister, client appsclientv1.DeploymentsGetter, required *appsv1.Deployment) (*appsv1.Deployment, bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/resourceapply/cv.go
Outdated
| } | ||
|
|
||
| klog.V(2).Infof("Updating ClusterVersion %s due to diff: %v", required.Name, diff.ObjectDiff(existing, required)) | ||
| klog.V(2).Infof("Updating ClusterVersion %s due to diff: %v", required.Name, cmp.Diff(existing, required)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no mode guard here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike the other resources, the CV is created/updated from multiple controllers and the mode is not present there. So it cannot be used.
lib/resourceapply/apiext.go
Outdated
|
|
||
| klog.V(2).Infof("Updating CRD %s due to diff: %v", required.Name, diff.ObjectDiff(existing, required)) | ||
| if reconciling { | ||
| klog.V(2).Infof("Updating CRD %s due to diff: %v", required.Name, cmp.Diff(existing, required)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LalatenduMohanty argued that we should use v(4) to avoid logging too many lines on production clusters (during development we could manually bump level and check for hotloops, as it doesn't need to be done too often)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could reduce the log level but the logs will now be emitted only when there are resource updates when in reconcile mode. And this would be indicative of a hot loop. So these logs could help with diagnostics in production clusters. Changing the log level to 4 would be mean that before every release someone would have to increase the log level and check for hot loops. This would also not help in clusters where there are hot loops due to configuration which is not tested in CI.
In summary, this change basically eliminates all the logs which is considered verbose in the bug report. While at the same time keeping the logs for debugging in production clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with moving this to a different PR.
Lets file a bug for this change so that we could backport it? Other than that LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also not help in clusters where there are hot loops due to configuration which is not tested in CI.
This is not true as we are only concerned about hotloops because of manifests in release payload which CVO is responsible for. We can not track every hotloops in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
klog.V(2).Infof("Updating CRD %s due to diff: %v", required.Name, cmp.Diff(existing, required))
This will only catch hotloops from manifests which is managed by CVO. So it wont help if there is a hot loop somewhere else. Even if there is a hot loop in a production cluster it does not mean that the cluster's availability is impacted or this is a serious issue which needs to be fixed ASAP. Also I do not see much use for customers from this information because they can not change the manifests from release payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However if we move the log level to V(4) then we can easily get it in our CI (by bumping the log level). So IMHO it should be atleast V(4).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to V(4)
0ec0ad8 to
f5b62a5
Compare
|
/retest |
|
@arjunrn: This pull request references Bugzilla bug 1984414, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@arjunrn: This pull request references Bugzilla bug 1984414, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
vrutkovs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Logging of resource diffs was introduced to detect hotlooping in the CVO[1]. But this increases the verbosity of logging during updates when the resources have changed which is not useful. Hence logging of the diffs is more useful in the reconcile mode. Also the diff is generated with ObjectDiff[2] which is deprecated. The ObjectDiff function now just uses cmp.Diff[3] internally and all instances of ObjectDiff have been replaced with cmp.Diff. [1] - openshift#561 [2] - https://github.com/kubernetes/apimachinery/blob/235edae7dd90601011bbe3bcd6f84f7dc857b034/pkg/util/diff/diff.go#L57 [3] - https://pkg.go.dev/github.com/google/go-cmp/cmp#Diff
f5b62a5 to
05e1af7
Compare
|
/hold cancel |
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: arjunrn, LalatenduMohanty, vrutkovs The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
@arjunrn: All pull requests linked via external trackers have merged: Bugzilla bug 1984414 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We've been using --v=5 since 88c222c (install/0000_00_cluster-version-operator_03_deployment: Bump to --v=5, 2020-08-30, openshift#448). But it leads to large quantities of noise from client-side throttling [1], and that throttling is V(3): $ grep -n . vendor/k8s.io/client-go/rest/request.go | grep -B7 -A6 '^597:' 589: switch { 590: case len(retryInfo) > 0: 591: message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String()) 592: default: 593: message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String()) 594: } 596: if latency > longThrottleLatency { 597: klog.V(3).Info(message) 598: } 599: if latency > extraLongThrottleLatency { 600: // If the rate limiter latency is very high, the log message should be printed at a higher log level, 601: // but we use a throttled logger to prevent spamming. 602: globalThrottledLogger.Infof("%s", message) 603: } Auditing: $ git --no-pager grep 'klog.V([3-5])' vendor I don't see much that I'd miss. I liked having vendor/github.com/openshift/library-go/pkg/verify's logging back when we made that pivot, but now that that code is old and stable, I'm ok losing it. I've shifted the lib/ stuff down to V(2) using: $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V lib) It's mostly hotloop-detection since 40d0a4e (Log object updates and show existing/required diff, 2021-06-03, openshift#561, [2]), and isn't all that noisy since 05e1af7 (Bug 1984414: Log resource diffs on update only in reconcile mode, 2021-07-22, openshift#628, [3]). I've shifted the pkg/ stuff down to V(2) using: $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V pkg) It's mostly fairly core stuff, and low-noise except for the per-manifest "Running sync for ..." and "Done syncing for ..." messages [1]. The per-manifest messages can be useful to identify where the CVO is in the sync cycle, so I'm keeping them for now. We may be able to punt them up to higher levels if we get an alternative mechanism for identifying sync cycle positions (e.g. events for task-node completion or task failure). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2034493#c1 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1879184 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1984414
We've been using --v=5 since 88c222c (install/0000_00_cluster-version-operator_03_deployment: Bump to --v=5, 2020-08-30, openshift#448). But it leads to large quantities of noise from client-side throttling [1], and that throttling is V(3): $ grep -n . vendor/k8s.io/client-go/rest/request.go | grep -B7 -A6 '^597:' 589: switch { 590: case len(retryInfo) > 0: 591: message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String()) 592: default: 593: message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String()) 594: } 596: if latency > longThrottleLatency { 597: klog.V(3).Info(message) 598: } 599: if latency > extraLongThrottleLatency { 600: // If the rate limiter latency is very high, the log message should be printed at a higher log level, 601: // but we use a throttled logger to prevent spamming. 602: globalThrottledLogger.Infof("%s", message) 603: } Auditing: $ git --no-pager grep 'klog.V([3-5])' vendor I don't see much that I'd miss. I liked having vendor/github.com/openshift/library-go/pkg/verify's logging back when we made that pivot, but now that that code is old and stable, I'm ok losing it. I've shifted the lib/ stuff down to V(2) using: $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V lib) It's mostly hotloop-detection since 40d0a4e (Log object updates and show existing/required diff, 2021-06-03, openshift#561, [2]), and isn't all that noisy since 05e1af7 (Bug 1984414: Log resource diffs on update only in reconcile mode, 2021-07-22, openshift#628, [3]). I've shifted the pkg/ stuff down to V(2) using: $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V pkg) It's mostly fairly core stuff, and low-noise except for the per-manifest "Running sync for ..." and "Done syncing for ..." messages [1]. The per-manifest messages can be useful to identify where the CVO is in the sync cycle, so I'm keeping them for now. We may be able to punt them up to higher levels if we get an alternative mechanism for identifying sync cycle positions (e.g. events for task-node completion or task failure). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2034493#c1 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1879184 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1984414
We've been using --v=5 since 88c222c (install/0000_00_cluster-version-operator_03_deployment: Bump to --v=5, 2020-08-30, openshift#448). But it leads to large quantities of noise from client-side throttling [1], and that throttling is V(3): $ grep -n . vendor/k8s.io/client-go/rest/request.go | grep -B7 -A6 '^597:' 589: switch { 590: case len(retryInfo) > 0: 591: message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String()) 592: default: 593: message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String()) 594: } 596: if latency > longThrottleLatency { 597: klog.V(3).Info(message) 598: } 599: if latency > extraLongThrottleLatency { 600: // If the rate limiter latency is very high, the log message should be printed at a higher log level, 601: // but we use a throttled logger to prevent spamming. 602: globalThrottledLogger.Infof("%s", message) 603: } Auditing: $ git --no-pager grep 'klog.V([3-5])' vendor I don't see much that I'd miss. I liked having vendor/github.com/openshift/library-go/pkg/verify's logging back when we made that pivot, but now that that code is old and stable, I'm ok losing it. I've shifted the lib/ stuff down to V(2) using: $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V lib) It's mostly hotloop-detection since 40d0a4e (Log object updates and show existing/required diff, 2021-06-03, openshift#561, [2]), and isn't all that noisy since 05e1af7 (Bug 1984414: Log resource diffs on update only in reconcile mode, 2021-07-22, openshift#628, [3]). I've shifted the pkg/ stuff down to V(2) using: $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V pkg) It's mostly fairly core stuff, and low-noise except for the per-manifest "Running sync for ..." and "Done syncing for ..." messages [1]. The per-manifest messages can be useful to identify where the CVO is in the sync cycle, so I'm keeping them for now. We may be able to punt them up to higher levels if we get an alternative mechanism for identifying sync cycle positions (e.g. events for task-node completion or task failure). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2034493#c1 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1879184 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1984414
We've been using --v=5 since 88c222c (install/0000_00_cluster-version-operator_03_deployment: Bump to --v=5, 2020-08-30, openshift#448). But it leads to large quantities of noise from client-side throttling [1], and that throttling is V(3): $ grep -n . vendor/k8s.io/client-go/rest/request.go | grep -B7 -A6 '^597:' 589: switch { 590: case len(retryInfo) > 0: 591: message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String()) 592: default: 593: message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String()) 594: } 596: if latency > longThrottleLatency { 597: klog.V(3).Info(message) 598: } 599: if latency > extraLongThrottleLatency { 600: // If the rate limiter latency is very high, the log message should be printed at a higher log level, 601: // but we use a throttled logger to prevent spamming. 602: globalThrottledLogger.Infof("%s", message) 603: } Auditing: $ git --no-pager grep 'klog.V([3-5])' vendor I don't see much that I'd miss. I liked having vendor/github.com/openshift/library-go/pkg/verify's logging back when we made that pivot, but now that that code is old and stable, I'm ok losing it. I've shifted the lib/ stuff down to V(2) using: $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V lib) It's mostly hotloop-detection since 40d0a4e (Log object updates and show existing/required diff, 2021-06-03, openshift#561, [2]), and isn't all that noisy since 05e1af7 (Bug 1984414: Log resource diffs on update only in reconcile mode, 2021-07-22, openshift#628, [3]). I've shifted the pkg/ stuff down to V(2) using: $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V pkg) It's mostly fairly core stuff, and low-noise except for the per-manifest "Running sync for ..." and "Done syncing for ..." messages [1]. The per-manifest messages can be useful to identify where the CVO is in the sync cycle, so I'm keeping them for now. We may be able to punt them up to higher levels if we get an alternative mechanism for identifying sync cycle positions (e.g. events for task-node completion or task failure). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2034493#c1 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1879184 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1984414
We've been using --v=5 since 88c222c (install/0000_00_cluster-version-operator_03_deployment: Bump to --v=5, 2020-08-30, openshift#448). But it leads to large quantities of noise from client-side throttling [1], and that throttling is V(3): $ grep -n . vendor/k8s.io/client-go/rest/request.go | grep -B7 -A6 '^597:' 589: switch { 590: case len(retryInfo) > 0: 591: message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String()) 592: default: 593: message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String()) 594: } 596: if latency > longThrottleLatency { 597: klog.V(3).Info(message) 598: } 599: if latency > extraLongThrottleLatency { 600: // If the rate limiter latency is very high, the log message should be printed at a higher log level, 601: // but we use a throttled logger to prevent spamming. 602: globalThrottledLogger.Infof("%s", message) 603: } Auditing: $ git --no-pager grep 'klog.V([3-5])' vendor I don't see much that I'd miss. I liked having vendor/github.com/openshift/library-go/pkg/verify's logging back when we made that pivot, but now that that code is old and stable, I'm ok losing it. I've shifted the lib/ stuff down to V(2) using: $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V lib) It's mostly hotloop-detection since 40d0a4e (Log object updates and show existing/required diff, 2021-06-03, openshift#561, [2]), and isn't all that noisy since 05e1af7 (Bug 1984414: Log resource diffs on update only in reconcile mode, 2021-07-22, openshift#628, [3]). I've shifted the pkg/ stuff down to V(2) using: $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V pkg) It's mostly fairly core stuff, and low-noise except for the per-manifest "Running sync for ..." and "Done syncing for ..." messages [1]. The per-manifest messages can be useful to identify where the CVO is in the sync cycle, so I'm keeping them for now. We may be able to punt them up to higher levels if we get an alternative mechanism for identifying sync cycle positions (e.g. events for task-node completion or task failure). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2034493#c1 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1879184 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1984414
I'd accidentally broken this in e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), when I moved the health checks out of the Do methods and began passing typedObject into them. typedObject is the release manifest, which lacks the status attributes we want the health checks to cover. This also catches the generation script up with some past manual changes: * 0afb8a8 (Add a manifest annotation to be used for object deletion, 2020-08-17, openshift#438). * 05e1af7 (Log resource diffs on update only in reconcile mode, 2021-07-13, openshift#628). The always-fatal ApplyAPIServicev1 function gives us the required symmetry with 0afb8a8's DeleteAPIServicev1, although I'm a bit fuzzy on who would have been using an APIService manifest. I'm not excited about supporting a new manifest kind I know nothing about, and failing all attempts at non-delete reconciling should protect us from accidentally encouraging manifest authors to feed us this kind.
I'd accidentally broken this in e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), when I moved the health checks out of the Do methods and began passing typedObject into them. typedObject is the release manifest, which lacks the status attributes we want the health checks to cover. This also catches the generation script up with some past manual changes: * 0afb8a8 (Add a manifest annotation to be used for object deletion, 2020-08-17, openshift#438). * 05e1af7 (Log resource diffs on update only in reconcile mode, 2021-07-13, openshift#628). The always-fatal ApplyAPIServicev1 function gives us the required symmetry with 0afb8a8's DeleteAPIServicev1, although I'm a bit fuzzy on who would have been using an APIService manifest. I'm not excited about supporting a new manifest kind I know nothing about, and failing all attempts at non-delete reconciling should protect us from accidentally encouraging manifest authors to feed us this kind.
I'd accidentally broken this in e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), when I moved the health checks out of the Do methods and began passing typedObject into them. typedObject is the release manifest, which lacks the status attributes we want the health checks to cover. This also catches the generation script up with some past manual changes: * 0afb8a8 (Add a manifest annotation to be used for object deletion, 2020-08-17, openshift#438). * 05e1af7 (Log resource diffs on update only in reconcile mode, 2021-07-13, openshift#628). The always-fatal ApplyAPIServicev1 function gives us the required symmetry with 0afb8a8's DeleteAPIServicev1, although I'm a bit fuzzy on who would have been using an APIService manifest. I'm not excited about supporting a new manifest kind I know nothing about, and failing all attempts at non-delete reconciling should protect us from accidentally encouraging manifest authors to feed us this kind.
I'd accidentally broken this in e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), when I moved the health checks out of the Do methods and began passing typedObject into them. typedObject is the release manifest, which lacks the status attributes we want the health checks to cover. This also catches the generation script up with some past manual changes: * 0afb8a8 (Add a manifest annotation to be used for object deletion, 2020-08-17, openshift#438). * 05e1af7 (Log resource diffs on update only in reconcile mode, 2021-07-13, openshift#628).
Logging of resource diffs was introduced to detect hotlooping in the CVO. But this increases the verbosity of logging during updates when the resources have changed which is not useful. The logging of the diffs is more useful in the reconcile mode. Also the diff is generated with ObjectDiff which is deprecated. The
ObjectDifffunction now just uses cmp.Diff internally and all instances ofObjectDiffhave been replaced withcmp.Diff.