-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPBUGS-43483: improve test/apiserver/graceful_termination #29006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-43483: improve test/apiserver/graceful_termination #29006
Conversation
…it files before this PR audit logs were not fully read because the files were already closed.
before this PR only a single audit log could be read.
|
/assign @tkashem |
|
@p0lyn0mial: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| } | ||
| o.Expect(text).To(o.HavePrefix(`{"kind":"Event",`)) | ||
|
|
||
| if strings.Contains(text, "openshift.io/during-graceful") && strings.Contains(text, "openshift-origin-external-backend-sampler") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the gathered audit logs (including termination.log) show me
grep -r "through a connection created very late in the graceful termination process" *
2 hits
grep -r "openshift.io/during-graceful"
0 hits
I'm not saying I don't want this. I am noting that something appears off since the audit log isn't showing what you expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you checked it. It turns out that this was changed between versions 4.14 and 4.13 :(
The annotation was replaced by new filters in this pull request, which added a new annotation https://apiserver.k8s.io/shutdown, but the test wasn't updated. :(
Later on, the new annotation was removed and replaced by sending an HTTP response header in this pull request, and no test was added. (?)
This morning, I also realized that there is a test (API LBs follow /readyz of kube-apiserver and stop sending requests) that checks for the LateConnections event.
The issue is that in the run you linked, the event wasn't persisted in the database because the server had issues connecting to the database! For instance:
[-]etcd failed: error getting data from etcd: context deadline exceeded
[-]etcd-readiness failed: error getting data from etcd: context deadline exceeded
etcd retry - counter: 4, lastErrLabel: Unavailable lastError: etcdserver: request timed out, error: context deadline exceeded
That's why we should not rely solely on events.
I'm fine with replacing this test with the one that is going to read the termination log files. (cc @sanchezl)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The openshift.io/during-graceful annotation was brought back in openshift/kubernetes#2077
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the gathered audit logs show me some results (the test didn't fail because it was below the threshold):
grep -r "openshift.io/during-graceful"
./master-0-audit-2024-09-09T14-17-24.568.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"58f32bd7-808e-42f5-a08d-5f731844a2dc","stage":"ResponseComplete","requestURI":"/apis/coordination.k8s.io/v1/namespaces/openshift-marketplace/leases/marketplace-operator-lock","verb":"get","user":{"username":"system:serviceaccount:openshift-marketplace:marketplace-operator","uid":"a8ab4437-6ff2-4947-b718-58c0010fc837","groups":["system:serviceaccounts","system:serviceaccounts:openshift-marketplace","system:authenticated"],"extra":{"authentication.kubernetes.io/credential-id":["JTI=2b6ed9b6-755b-4458-86d7-a80964178dc1"],"authentication.kubernetes.io/node-name":["master-0"],"authentication.kubernetes.io/node-uid":["0ed94f1d-c7c5-4be0-9b82-b1dff79b1ec6"],"authentication.kubernetes.io/pod-name":["marketplace-operator-85c678694c-qpqpf"],"authentication.kubernetes.io/pod-uid":["d9ea82a1-1caf-4657-83a3-7b57e2acd59f"]}},"sourceIPs":["10.128.0.23"],"userAgent":"marketplace-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"leases","namespace":"openshift-marketplace","name":"marketplace-operator-lock","apiGroup":"coordination.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","message":"context deadline exceeded","code":500},"requestReceivedTimestamp":"2024-09-09T13:26:41.930788Z","stageTimestamp":"2024-09-09T13:27:41.934549Z","annotations":{"apiserver.latency.k8s.io/etcd":"1m0.000435665s","apiserver.latency.k8s.io/response-write":"5.726µs","apiserver.latency.k8s.io/serialize-response-object":"2.481828ms","apiserver.latency.k8s.io/total":"1m0.003761653s","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by RoleBinding \"marketplace-operator/openshift-marketplace\" of Role \"marketplace-operator\" to ServiceAccount \"marketplace-operator/openshift-marketplace\"","openshift.io/during-graceful":"loopback=false,,readyz=false"}}
./master-1-audit-2024-09-09T14-17-59.588.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"775c595d-125c-46a5-b6b2-2ab65a12ffeb","stage":"ResponseComplete","requestURI":"/apis/operator.openshift.io/v1/authentications/cluster/status","verb":"update","user":{"username":"system:serviceaccount:openshift-authentication-operator:authentication-operator","uid":"36d42c5c-bfe9-4fff-8fef-055db4f6df4d","groups":["system:serviceaccounts","system:serviceaccounts:openshift-authentication-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/credential-id":["JTI=ecfaae00-a5d0-4ca3-b3dd-91edcf397651"],"authentication.kubernetes.io/node-name":["master-0"],"authentication.kubernetes.io/node-uid":["0ed94f1d-c7c5-4be0-9b82-b1dff79b1ec6"],"authentication.kubernetes.io/pod-name":["authentication-operator-7df766656f-g7c6p"],"authentication.kubernetes.io/pod-uid":["b19ee6fa-872f-4c58-bad2-f8a897a165a6"]}},"sourceIPs":["192.168.111.20"],"userAgent":"authentication-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"authentications","name":"cluster","uid":"c6b81673-1467-411f-a9d2-372101ab73fc","apiGroup":"operator.openshift.io","apiVersion":"v1","resourceVersion":"20474","subresource":"status"},"responseStatus":{"metadata":{},"status":"Failure","message":"Timeout: request did not complete within requested timeout - context deadline exceeded","reason":"Timeout","details":{},"code":504},"requestReceivedTimestamp":"2024-09-09T13:34:48.835708Z","stageTimestamp":"2024-09-09T13:35:22.851223Z","annotations":{"apiserver.latency.k8s.io/etcd":"33.090109659s","apiserver.latency.k8s.io/response-write":"2.871µs","apiserver.latency.k8s.io/serialize-response-object":"141.511µs","apiserver.latency.k8s.io/total":"34.015514604s","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:openshift:operator:authentication\" of ClusterRole \"cluster-admin\" to ServiceAccount \"authentication-operator/openshift-authentication-operator\"","openshift.io/during-graceful":"loopback=false,,readyz=false"}}
./master-1-audit-2024-09-09T14-17-59.588.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"f155e7ff-b027-44c6-8fdb-6a3c6bd0e2d7","stage":"ResponseComplete","requestURI":"/apis/coordination.k8s.io/v1/namespaces/openshift-marketplace/leases/marketplace-operator-lock","verb":"get","user":{"username":"system:serviceaccount:openshift-marketplace:marketplace-operator","uid":"a8ab4437-6ff2-4947-b718-58c0010fc837","groups":["system:serviceaccounts","system:serviceaccounts:openshift-marketplace","system:authenticated"],"extra":{"authentication.kubernetes.io/credential-id":["JTI=2b6ed9b6-755b-4458-86d7-a80964178dc1"],"authentication.kubernetes.io/node-name":["master-0"],"authentication.kubernetes.io/node-uid":["0ed94f1d-c7c5-4be0-9b82-b1dff79b1ec6"],"authentication.kubernetes.io/pod-name":["marketplace-operator-85c678694c-qpqpf"],"authentication.kubernetes.io/pod-uid":["d9ea82a1-1caf-4657-83a3-7b57e2acd59f"]}},"sourceIPs":["192.168.111.20"],"userAgent":"marketplace-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"leases","namespace":"openshift-marketplace","name":"marketplace-operator-lock","apiGroup":"coordination.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2024-09-09T13:34:39.919997Z","stageTimestamp":"2024-09-09T13:35:26.540072Z","annotations":{"apiserver.latency.k8s.io/etcd":"44.402019535s","apiserver.latency.k8s.io/response-write":"5.499µs","apiserver.latency.k8s.io/serialize-response-object":"198.405µs","apiserver.latency.k8s.io/total":"46.62007528s","apiserver.latency.k8s.io/transform-response-object":"247ns","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by RoleBinding \"marketplace-operator/openshift-marketplace\" of Role \"marketplace-operator\" to ServiceAccount \"marketplace-operator/openshift-marketplace\"","openshift.io/during-graceful":"loopback=false,,readyz=false"}}
|
/test all |
|
Please move to a monitortest since monitor tests are run for all job types and are able to create intervals for consumption by other analysis tools (this cannot do so efficiently). |
This PR fixes the existing test, and moving the code would require creating a new test. While it may not fully align with the monitor test framework, having a test in place is better than having no test at all, as it still provides some coverage and ensures the functionality is being validated. |
|
/retitle OCPBUGS-43483: improve test/apiserver/graceful_termination I think we still want this, the test can be converted into monitortest in a separate PR |
|
@p0lyn0mial: This pull request references Jira Issue OCPBUGS-43483, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-metal-ipi-ovn-kube-apiserver-rollout e2e-aws-ovn-kube-apiserver-rollout |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
@p0lyn0mial: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Job Failure Risk Analysis for sha: ea43552
|
|
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
|
@p0lyn0mial: An error was encountered getting issue for bug OCPBUGS-43483 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details. Full error message.
No response returned: Get "https://issues.redhat.com/rest/api/2/issue/OCPBUGS-43483": GET https://issues.redhat.com/rest/api/2/issue/OCPBUGS-43483 giving up after 5 attempt(s)
Please contact an administrator to resolve this issue, then request a bug refresh with In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This PR improves
API LBs follow /readyz of kube-apiserver and stop sending requests before server shutdowns for external clients"test. In particular:scanner.Err()Note that given that the audit logs were not fully processed before this PR we might start seeing some failures.