Skip to content

Conversation

@p0lyn0mial
Copy link
Contributor

This PR improves API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients" test. In particular:

  1. Processes all available audit logs not just the last one.
  2. Doesn't prematurely close the audit logs file so that the entire file can be processed.
  3. Checks scanner.Err()
  4. Ensures that opened files are always closed even if the test fails in the middle.

Note that given that the audit logs were not fully processed before this PR we might start seeing some failures.

…it files

before this PR audit logs were not fully read because the files
were already closed.
before this PR only a single audit log could be read.
@p0lyn0mial
Copy link
Contributor Author

/assign @tkashem

@p0lyn0mial p0lyn0mial changed the title improve test/apiserver/graceful_termination NO-JIRA: improve test/apiserver/graceful_termination Aug 14, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 14, 2024
@openshift-ci-robot
Copy link

@p0lyn0mial: This pull request explicitly references no jira issue.

In response to this:

This PR improves API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients" test. In particular:

  1. Processes all available audit logs not just the last one.
  2. Doesn't prematurely close the audit logs file so that the entire file can be processed.
  3. Checks scanner.Err()
  4. Ensures that opened files are always closed even if the test fails in the middle.

Note that given that the audit logs were not fully processed before this PR we might start seeing some failures.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

}
o.Expect(text).To(o.HavePrefix(`{"kind":"Event",`))

if strings.Contains(text, "openshift.io/during-graceful") && strings.Contains(text, "openshift-origin-external-backend-sampler") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a look at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-serial-ovn-ipv6/1820973338286100480

the gathered audit logs (including termination.log) show me

grep -r "through a connection created very late in the graceful termination process" *
2 hits

grep -r "openshift.io/during-graceful"
0 hits

I'm not saying I don't want this. I am noting that something appears off since the audit log isn't showing what you expect.

Copy link
Contributor Author

@p0lyn0mial p0lyn0mial Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you checked it. It turns out that this was changed between versions 4.14 and 4.13 :(

The annotation was replaced by new filters in this pull request, which added a new annotation https://apiserver.k8s.io/shutdown, but the test wasn't updated. :(

Later on, the new annotation was removed and replaced by sending an HTTP response header in this pull request, and no test was added. (?)

This morning, I also realized that there is a test (API LBs follow /readyz of kube-apiserver and stop sending requests) that checks for the LateConnections event.

The issue is that in the run you linked, the event wasn't persisted in the database because the server had issues connecting to the database! For instance:

[-]etcd failed: error getting data from etcd: context deadline exceeded
[-]etcd-readiness failed: error getting data from etcd: context deadline exceeded
etcd retry - counter: 4, lastErrLabel: Unavailable lastError: etcdserver: request timed out, error: context deadline exceeded

That's why we should not rely solely on events.

I'm fine with replacing this test with the one that is going to read the termination log files. (cc @sanchezl)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The openshift.io/during-graceful annotation was brought back in openshift/kubernetes#2077

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I have checked https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/29006/pull-ci-openshift-origin-master-e2e-metal-ipi-ovn/1833121043204542464/artifacts/e2e-metal-ipi-ovn/gather-audit-logs/artifacts/

the gathered audit logs show me some results (the test didn't fail because it was below the threshold):

 grep -r "openshift.io/during-graceful"
./master-0-audit-2024-09-09T14-17-24.568.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"58f32bd7-808e-42f5-a08d-5f731844a2dc","stage":"ResponseComplete","requestURI":"/apis/coordination.k8s.io/v1/namespaces/openshift-marketplace/leases/marketplace-operator-lock","verb":"get","user":{"username":"system:serviceaccount:openshift-marketplace:marketplace-operator","uid":"a8ab4437-6ff2-4947-b718-58c0010fc837","groups":["system:serviceaccounts","system:serviceaccounts:openshift-marketplace","system:authenticated"],"extra":{"authentication.kubernetes.io/credential-id":["JTI=2b6ed9b6-755b-4458-86d7-a80964178dc1"],"authentication.kubernetes.io/node-name":["master-0"],"authentication.kubernetes.io/node-uid":["0ed94f1d-c7c5-4be0-9b82-b1dff79b1ec6"],"authentication.kubernetes.io/pod-name":["marketplace-operator-85c678694c-qpqpf"],"authentication.kubernetes.io/pod-uid":["d9ea82a1-1caf-4657-83a3-7b57e2acd59f"]}},"sourceIPs":["10.128.0.23"],"userAgent":"marketplace-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"leases","namespace":"openshift-marketplace","name":"marketplace-operator-lock","apiGroup":"coordination.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","message":"context deadline exceeded","code":500},"requestReceivedTimestamp":"2024-09-09T13:26:41.930788Z","stageTimestamp":"2024-09-09T13:27:41.934549Z","annotations":{"apiserver.latency.k8s.io/etcd":"1m0.000435665s","apiserver.latency.k8s.io/response-write":"5.726µs","apiserver.latency.k8s.io/serialize-response-object":"2.481828ms","apiserver.latency.k8s.io/total":"1m0.003761653s","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by RoleBinding \"marketplace-operator/openshift-marketplace\" of Role \"marketplace-operator\" to ServiceAccount \"marketplace-operator/openshift-marketplace\"","openshift.io/during-graceful":"loopback=false,,readyz=false"}}

./master-1-audit-2024-09-09T14-17-59.588.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"775c595d-125c-46a5-b6b2-2ab65a12ffeb","stage":"ResponseComplete","requestURI":"/apis/operator.openshift.io/v1/authentications/cluster/status","verb":"update","user":{"username":"system:serviceaccount:openshift-authentication-operator:authentication-operator","uid":"36d42c5c-bfe9-4fff-8fef-055db4f6df4d","groups":["system:serviceaccounts","system:serviceaccounts:openshift-authentication-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/credential-id":["JTI=ecfaae00-a5d0-4ca3-b3dd-91edcf397651"],"authentication.kubernetes.io/node-name":["master-0"],"authentication.kubernetes.io/node-uid":["0ed94f1d-c7c5-4be0-9b82-b1dff79b1ec6"],"authentication.kubernetes.io/pod-name":["authentication-operator-7df766656f-g7c6p"],"authentication.kubernetes.io/pod-uid":["b19ee6fa-872f-4c58-bad2-f8a897a165a6"]}},"sourceIPs":["192.168.111.20"],"userAgent":"authentication-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"authentications","name":"cluster","uid":"c6b81673-1467-411f-a9d2-372101ab73fc","apiGroup":"operator.openshift.io","apiVersion":"v1","resourceVersion":"20474","subresource":"status"},"responseStatus":{"metadata":{},"status":"Failure","message":"Timeout: request did not complete within requested timeout - context deadline exceeded","reason":"Timeout","details":{},"code":504},"requestReceivedTimestamp":"2024-09-09T13:34:48.835708Z","stageTimestamp":"2024-09-09T13:35:22.851223Z","annotations":{"apiserver.latency.k8s.io/etcd":"33.090109659s","apiserver.latency.k8s.io/response-write":"2.871µs","apiserver.latency.k8s.io/serialize-response-object":"141.511µs","apiserver.latency.k8s.io/total":"34.015514604s","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:openshift:operator:authentication\" of ClusterRole \"cluster-admin\" to ServiceAccount \"authentication-operator/openshift-authentication-operator\"","openshift.io/during-graceful":"loopback=false,,readyz=false"}}

./master-1-audit-2024-09-09T14-17-59.588.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"f155e7ff-b027-44c6-8fdb-6a3c6bd0e2d7","stage":"ResponseComplete","requestURI":"/apis/coordination.k8s.io/v1/namespaces/openshift-marketplace/leases/marketplace-operator-lock","verb":"get","user":{"username":"system:serviceaccount:openshift-marketplace:marketplace-operator","uid":"a8ab4437-6ff2-4947-b718-58c0010fc837","groups":["system:serviceaccounts","system:serviceaccounts:openshift-marketplace","system:authenticated"],"extra":{"authentication.kubernetes.io/credential-id":["JTI=2b6ed9b6-755b-4458-86d7-a80964178dc1"],"authentication.kubernetes.io/node-name":["master-0"],"authentication.kubernetes.io/node-uid":["0ed94f1d-c7c5-4be0-9b82-b1dff79b1ec6"],"authentication.kubernetes.io/pod-name":["marketplace-operator-85c678694c-qpqpf"],"authentication.kubernetes.io/pod-uid":["d9ea82a1-1caf-4657-83a3-7b57e2acd59f"]}},"sourceIPs":["192.168.111.20"],"userAgent":"marketplace-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"leases","namespace":"openshift-marketplace","name":"marketplace-operator-lock","apiGroup":"coordination.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2024-09-09T13:34:39.919997Z","stageTimestamp":"2024-09-09T13:35:26.540072Z","annotations":{"apiserver.latency.k8s.io/etcd":"44.402019535s","apiserver.latency.k8s.io/response-write":"5.499µs","apiserver.latency.k8s.io/serialize-response-object":"198.405µs","apiserver.latency.k8s.io/total":"46.62007528s","apiserver.latency.k8s.io/transform-response-object":"247ns","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by RoleBinding \"marketplace-operator/openshift-marketplace\" of Role \"marketplace-operator\" to ServiceAccount \"marketplace-operator/openshift-marketplace\"","openshift.io/during-graceful":"loopback=false,,readyz=false"}}

@p0lyn0mial
Copy link
Contributor Author

/test all

@deads2k
Copy link
Contributor

deads2k commented Sep 17, 2024

Please move to a monitortest since monitor tests are run for all job types and are able to create intervals for consumption by other analysis tools (this cannot do so efficiently).

@p0lyn0mial
Copy link
Contributor Author

Please move to a monitortest since monitor tests are run for all job types and are able to create intervals for consumption by other analysis tools (this cannot do so efficiently).

This PR fixes the existing test, and moving the code would require creating a new test. While it may not fully align with the monitor test framework, having a test in place is better than having no test at all, as it still provides some coverage and ensures the functionality is being validated.

@vrutkovs
Copy link
Contributor

vrutkovs commented Oct 17, 2024

/retitle OCPBUGS-43483: improve test/apiserver/graceful_termination

I think we still want this, the test can be converted into monitortest in a separate PR

@openshift-ci openshift-ci bot changed the title NO-JIRA: improve test/apiserver/graceful_termination OCPBUGS-43483: improve test/apiserver/graceful_termination Oct 17, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Oct 17, 2024
@openshift-ci-robot
Copy link

@p0lyn0mial: This pull request references Jira Issue OCPBUGS-43483, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR improves API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients" test. In particular:

  1. Processes all available audit logs not just the last one.
  2. Doesn't prematurely close the audit logs file so that the entire file can be processed.
  3. Checks scanner.Err()
  4. Ensures that opened files are always closed even if the test fails in the middle.

Note that given that the audit logs were not fully processed before this PR we might start seeing some failures.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from wangke19 October 17, 2024 06:53
@vrutkovs
Copy link
Contributor

/test e2e-metal-ipi-ovn-kube-apiserver-rollout e2e-aws-ovn-kube-apiserver-rollout

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 16, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 29, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 2, 2025

@p0lyn0mial: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade ea43552 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-single-node-serial ea43552 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-ipsec-serial ea43552 link false /test e2e-aws-ovn-ipsec-serial
ci/prow/e2e-agnostic-ovn-cmd ea43552 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-gcp-ovn-builds ea43552 link true /test e2e-gcp-ovn-builds
ci/prow/e2e-aws-ovn-kube-apiserver-rollout ea43552 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-microshift ea43552 link true /test e2e-aws-ovn-microshift
ci/prow/images ea43552 link true /test images
ci/prow/e2e-metal-ipi-ovn-ipv6 ea43552 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-ovn-fips ea43552 link true /test e2e-aws-ovn-fips
ci/prow/unit ea43552 link true /test unit
ci/prow/e2e-vsphere-ovn ea43552 link true /test e2e-vsphere-ovn
ci/prow/e2e-aws-ovn-serial ea43552 link true /test e2e-aws-ovn-serial
ci/prow/verify ea43552 link true /test verify
ci/prow/e2e-gcp-ovn-upgrade ea43552 link true /test e2e-gcp-ovn-upgrade
ci/prow/e2e-aws-ovn-edge-zones ea43552 link true /test e2e-aws-ovn-edge-zones
ci/prow/e2e-aws-ovn-microshift-serial ea43552 link true /test e2e-aws-ovn-microshift-serial
ci/prow/lint ea43552 link true /test lint
ci/prow/verify-deps ea43552 link true /test verify-deps
ci/prow/e2e-vsphere-ovn-upi ea43552 link true /test e2e-vsphere-ovn-upi
ci/prow/okd-scos-images ea43552 link true /test okd-scos-images
ci/prow/e2e-gcp-ovn ea43552 link true /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-serial-1of2 ea43552 link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-aws-ovn-serial-2of2 ea43552 link true /test e2e-aws-ovn-serial-2of2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt openshift-trt bot deleted a comment from openshift-trt-bot May 2, 2025
@openshift-trt
Copy link

openshift-trt bot commented May 2, 2025

Job Failure Risk Analysis for sha: ea43552

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2 IncompleteTests
Tests for this run (2) are below the historical average (1133): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2 IncompleteTests
Tests for this run (2) are below the historical average (1055): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jun 2, 2025
@openshift-ci-robot
Copy link

@p0lyn0mial: An error was encountered getting issue for bug OCPBUGS-43483 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details.

Full error message. No response returned: Get "https://issues.redhat.com/rest/api/2/issue/OCPBUGS-43483": GET https://issues.redhat.com/rest/api/2/issue/OCPBUGS-43483 giving up after 5 attempt(s)

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

In response to this:

This PR improves API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients" test. In particular:

  1. Processes all available audit logs not just the last one.
  2. Doesn't prematurely close the audit logs file so that the entire file can be processed.
  3. Checks scanner.Err()
  4. Ensures that opened files are always closed even if the test fails in the middle.

Note that given that the audit logs were not fully processed before this PR we might start seeing some failures.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 2, 2025

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants