Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jan 4, 2022

We've been using --v=5 since 88c222c (#448). But it leads to large quantities of noise from client-side throttling, and that throttling is V(3):

$ grep -n . vendor/k8s.io/client-go/rest/request.go | grep -B7 -A6 '^597:'
589:    switch {
590:    case len(retryInfo) > 0:
591:            message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String())
592:    default:
593:            message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String())
594:    }
596:    if latency > longThrottleLatency {
597:            klog.V(3).Info(message)
598:    }
599:    if latency > extraLongThrottleLatency {
600:            // If the rate limiter latency is very high, the log message should be printed at a higher log level,
601:            // but we use a throttled logger to prevent spamming.
602:            globalThrottledLogger.Infof("%s", message)
603:    }

Auditing:

$ git --no-pager grep 'klog.V([3-5])' vendor

I don't see much that I'd miss. I liked having vendor/github.com/openshift/library-go/pkg/verify's logging back when we made that pivot, but now that that code is old and stable, I'm ok losing it.

I've shifted the lib/ stuff down to V(2) using:

$ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V lib)

It's mostly hotloop-detection since 40d0a4e (#561, rhbz#1879184), and isn't all that noisy since 05e1af7 (#628, rhbz#1984414).

I've shifted the pkg/ stuff down to V(2) using:

$ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V pkg)

It's mostly fairly core stuff, and low-noise except for the per-manifest Running sync for ... and Done syncing for ... messages. The per-manifest messages can be useful to identify where the CVO is in the sync cycle, so I'm keeping them for now. We may be able to punt them up to higher levels if we get an alternative mechanism for identifying sync cycle positions (e.g. events for task-node completion or task failure).

We've been using --v=5 since 88c222c
(install/0000_00_cluster-version-operator_03_deployment: Bump to
--v=5, 2020-08-30, openshift#448).  But it leads to large quantities of noise
from client-side throttling [1], and that throttling is V(3):

  $ grep -n . vendor/k8s.io/client-go/rest/request.go | grep -B7 -A6 '^597:'
  589:    switch {
  590:    case len(retryInfo) > 0:
  591:            message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String())
  592:    default:
  593:            message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String())
  594:    }
  596:    if latency > longThrottleLatency {
  597:            klog.V(3).Info(message)
  598:    }
  599:    if latency > extraLongThrottleLatency {
  600:            // If the rate limiter latency is very high, the log message should be printed at a higher log level,
  601:            // but we use a throttled logger to prevent spamming.
  602:            globalThrottledLogger.Infof("%s", message)
  603:    }

Auditing:

  $ git --no-pager grep 'klog.V([3-5])' vendor

I don't see much that I'd miss.  I liked having
vendor/github.com/openshift/library-go/pkg/verify's logging back when
we made that pivot, but now that that code is old and stable, I'm ok
losing it.

I've shifted the lib/ stuff down to V(2) using:

  $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V lib)

It's mostly hotloop-detection since 40d0a4e (Log object updates and
show existing/required diff, 2021-06-03, openshift#561, [2]), and isn't all
that noisy since 05e1af7 (Bug 1984414: Log resource diffs on update
only in reconcile mode, 2021-07-22, openshift#628, [3]).

I've shifted the pkg/ stuff down to V(2) using:

  $ sed -i 's/klog[.]V([3-5])/klog.V(2)/g' $(git grep -l klog.V pkg)

It's mostly fairly core stuff, and low-noise except for the
per-manifest "Running sync for ..." and "Done syncing for ..."
messages [1].  The per-manifest messages can be useful to identify
where the CVO is in the sync cycle, so I'm keeping them for now.  We
may be able to punt them up to higher levels if we get an alternative
mechanism for identifying sync cycle positions (e.g. events for
task-node completion or task failure).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2034493#c1
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1879184
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1984414
@openshift-ci openshift-ci bot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Jan 4, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 4, 2022

@wking: This pull request references Bugzilla bug 2034493, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianlinliu

Details

In response to this:

Bug 2034493: *: Use --v=2 logging to drop client-side throttling noise

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 4, 2022
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 4, 2022
Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 4, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 4, 2022
@wking
Copy link
Member Author

wking commented Jan 4, 2022

/hold

I want to peak at the CI results before this lands.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 4, 2022
@wking
Copy link
Member Author

wking commented Jan 4, 2022

Three rounds of msg=Error: error creating/updating Private DNS Zone...The gateway did not receive a response from 'Microsoft.Network' within the specified time period..

/retest

@wking
Copy link
Member Author

wking commented Jan 5, 2022

/retest

@wking
Copy link
Member Author

wking commented Jan 5, 2022

Looking at the e2e run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/721/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1478469497282105344/artifacts/e2e-agnostic/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5b44c85bd4-qgwdw_cluster-version-operator.log | sed -n 's/.* \([^ ]*[.]go:[0-9]*\)].*/\1/p' | sort | uniq -c | sort -n | tail -n7
    288 cvo.go:514
    288 cvo.go:516
    352 apps.go:38
   1512 sync_worker.go:393
   1558 task_graph.go:477
  14548 sync_worker.go:771
  14551 sync_worker.go:759

We still see a handful of the extraLongThrottleLatency messages:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/721/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1478469497282105344/artifacts/e2e-agnostic/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5b44c85bd4-qgwdw_cluster-version-operator.log | grep 'Waited for' | head -n3
I0104 21:41:18.122651       1 request.go:665] Waited for 1.006875105s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:6443/api/v1/namespaces/openshift-cluster-samples-operator
I0104 21:41:19.148129       1 request.go:665] Waited for 1.497948234s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:6443/apis/operators.coreos.com/v1/namespaces/openshift-monitoring/operatorgroups/openshift-cluster-monitoring
I0104 21:41:20.172157       1 request.go:665] Waited for 1.370806868s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:6443/apis/operator.openshift.io/v1/csisnapshotcontrollers/cluster

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/721/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1478469497282105344/artifacts/e2e-agnostic/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5b44c85bd4-qgwdw_cluster-version-operator.log | grep 'Waited for' | wc -l
26

but that's acceptably quiet.

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 5, 2022
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

8 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@sdodson
Copy link
Member

sdodson commented Jan 5, 2022

About a 13% reduction in log volume compared to an unrelated but identical CI job

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/720/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1476323230380724224/artifacts/e2e-agnostic/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-66b54655f7-5qtp9_cluster-version-operator.log | sed -n 's/.* \([^ ]*[.]go:[0-9]*\)].*/\1/p' | sort | uniq -c | sort -n | tail -n7
    274 task_graph.go:474
    308 apps.go:38
   1394 task_graph.go:477
   1512 sync_worker.go:393
   8519 request.go:597
  13014 sync_worker.go:771
  13017 sync_worker.go:759

We grew ~ 3000 lines for sync_worker.go but got rid of the 8519 request.go lines. I think we should still look hard at whether or not we need those other log entries in the near future.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Jan 5, 2022

Hrmm, update job seems to be wedged on:

level=fatal msg=failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": api.ci.azure.devcluster.openshift.com CNAME record already exists in ci-op-twz79rwz-7ee27 and might be in use by another cluster, please remove it to continue

I could probably bump the commit or some such to get a new job hash and avoid that issue, but I'm personally confident enough in this change and the other green presubmits to just override the job:

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 5, 2022

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

Hrmm, update job seems to be wedged on:

level=fatal msg=failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": api.ci.azure.devcluster.openshift.com CNAME record already exists in ci-op-twz79rwz-7ee27 and might be in use by another cluster, please remove it to continue

I could probably bump the commit or some such to get a new job hash and avoid that issue, but I'm personally confident enough in this change and the other green presubmits to just override the job:

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 5, 2022

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 3db74eb into openshift:master Jan 5, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 5, 2022

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 2034493 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2034493: *: Use --v=2 logging to drop client-side throttling noise

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the shuffle-log-levels branch January 5, 2022 19:50
@sdodson
Copy link
Member

sdodson commented Jan 7, 2022

/cherry-pick release-4.9
Probably conflicts, who knows.

@openshift-cherrypick-robot

@sdodson: #721 failed to apply on top of branch "release-4.9":

Applying: *: Use --v=2 logging to drop client-side throttling noise
Using index info to reconstruct a base tree...
M	bootstrap/bootstrap-pod.yaml
M	install/0000_00_cluster-version-operator_03_deployment.yaml
M	pkg/cincinnati/cincinnati.go
A	pkg/clusterconditions/cache/cache.go
A	pkg/clusterconditions/clusterconditions.go
A	pkg/clusterconditions/promql/promql.go
M	pkg/cvo/availableupdates.go
M	pkg/cvo/cvo.go
M	pkg/cvo/cvo_test.go
M	pkg/cvo/sync_worker.go
M	pkg/cvo/updatepayload.go
M	pkg/payload/payload.go
M	pkg/payload/precondition/clusterversion/upgradeable.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/payload/precondition/clusterversion/upgradeable.go
CONFLICT (content): Merge conflict in pkg/payload/precondition/clusterversion/upgradeable.go
Auto-merging pkg/payload/payload.go
Auto-merging pkg/cvo/updatepayload.go
Auto-merging pkg/cvo/sync_worker.go
CONFLICT (content): Merge conflict in pkg/cvo/sync_worker.go
Auto-merging pkg/cvo/cvo_test.go
Auto-merging pkg/cvo/cvo.go
Auto-merging pkg/cvo/availableupdates.go
CONFLICT (content): Merge conflict in pkg/cvo/availableupdates.go
CONFLICT (modify/delete): pkg/clusterconditions/promql/promql.go deleted in HEAD and modified in *: Use --v=2 logging to drop client-side throttling noise. Version *: Use --v=2 logging to drop client-side throttling noise of pkg/clusterconditions/promql/promql.go left in tree.
CONFLICT (modify/delete): pkg/clusterconditions/clusterconditions.go deleted in HEAD and modified in *: Use --v=2 logging to drop client-side throttling noise. Version *: Use --v=2 logging to drop client-side throttling noise of pkg/clusterconditions/clusterconditions.go left in tree.
CONFLICT (modify/delete): pkg/clusterconditions/cache/cache.go deleted in HEAD and modified in *: Use --v=2 logging to drop client-side throttling noise. Version *: Use --v=2 logging to drop client-side throttling noise of pkg/clusterconditions/cache/cache.go left in tree.
Auto-merging pkg/cincinnati/cincinnati.go
Auto-merging install/0000_00_cluster-version-operator_03_deployment.yaml
Auto-merging bootstrap/bootstrap-pod.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 *: Use --v=2 logging to drop client-side throttling noise
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-4.9
Probably conflicts, who knows.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/cluster-version-operator that referenced this pull request Mar 15, 2022
We moved from --v=5 to --v=2 with c4a1004 (*: Use --v=2 logging to
drop client-side throttling noise, 2022-01-04, openshift#721).  But this V(5)
logging slipped through via the parallel 5d343a5 (pkg/payload: Log
manifest exclusion, 2021-12-10, openshift#712).  Catch it up, to restore
exclusion logging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants