Skip to content

OCPBUGS-2435: TestRouterCompressionOperation: Nil-pointer fix#843

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
Miciah:OCPBUGS-2435-TestRouterCompressionOperation-nil-pointer-fix
Oct 21, 2022
Merged

OCPBUGS-2435: TestRouterCompressionOperation: Nil-pointer fix#843
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
Miciah:OCPBUGS-2435-TestRouterCompressionOperation-nil-pointer-fix

Conversation

@Miciah
Copy link
Contributor

@Miciah Miciah commented Oct 15, 2022

Fix a nil-pointer dereference in the getHttpHeaders helper for TestRouterCompressionOperation.

Follow-up to #679.

  • test/e2e/router_compression_test.go (getHttpHeaders): Don't dereference the response value from client.Do if it returned a non-nil error value.

Fix a nil-pointer dereference in the getHttpHeaders helper for
TestRouterCompressionOperation.

Follow-up to commit 211b9c1.

This commit fixes OCPBUGS-2435.

https://issues.redhat.com/browse/OCPBUGS-2435

* test/e2e/router_compression_test.go (getHttpHeaders): Don't dereference
the response value from client.Do if it returned a non-nil error value.
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 15, 2022
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-2435, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Fix a nil-pointer dereference in the getHttpHeaders helper for TestRouterCompressionOperation.

Follow-up to #679.

  • test/e2e/router_compression_test.go (getHttpHeaders): Don't dereference the response value from client.Do if it returned a non-nil error value.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 15, 2022

/test e2e-azure-operator
/test e2e-gcp-operator

@Miciah
Copy link
Contributor Author

Miciah commented Oct 15, 2022

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Oct 15, 2022

e2e-gcp-operator failed and e2e-azure-operator failed because TestUnmanagedDNSToManagedDNSInternalIngressController failed in both cases, and this test is known to be flaky. e2e-aws-operator failed because of API issues.

/retest

@Miciah
Copy link
Contributor Author

Miciah commented Oct 16, 2022

e2e-aws-operator failed because of kube-apiserver problems. This job failure shows an interesting new failure mode that results from #837: The operator could not fetch the ingresses.config.openshift.io/cluster object, and so the operator did not even try to create the default ingresscontroller.
/test e2e-aws-operator

e2e-azure-operator failed and e2e-gcp-operator failed because kube-apiserver and TestUnmanagedDNSToManagedDNSInternalIngressController flaked.
/test e2e-azure-operator
/test e2e-gcp-operator

@Miciah
Copy link
Contributor Author

Miciah commented Oct 16, 2022

The operator could not fetch the ingresses.config.openshift.io/cluster object

Following up on that here: openshift/installer#6478 (comment)

@frobware
Copy link
Contributor

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 17, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: frobware

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2022
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 6d63b3c and 2 for PR HEAD a658ba2 in total

@frobware
Copy link
Contributor

/retest

@Miciah
Copy link
Contributor Author

Miciah commented Oct 17, 2022

/hold
No point in retesting when we know AWS installs are broken. We can resume testing once openshift/installer#6490 is merged.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 17, 2022
@candita
Copy link
Contributor

candita commented Oct 17, 2022

/retest-required

@Miciah
Copy link
Contributor Author

Miciah commented Oct 17, 2022

/hold cancel
now that openshift/installer#6490 is merged.
/test all
/test e2e-azure-operator
/test e2e-gcp-operator

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 17, 2022
@Miciah
Copy link
Contributor Author

Miciah commented Oct 17, 2022

e2e-gcp-operator failed on TestUnmanagedDNSToManagedDNSInternalIngressController, which is known to be flaky.
/test e2e-gcp-operator

@Miciah
Copy link
Contributor Author

Miciah commented Oct 17, 2022

/test all
because #779 merged.
/test e2e-azure-operator
/test e2e-gcp-operator
These keep failing on TestUnmanagedDNSToManagedDNSInternalIngressController and very often kube-apiserver NodeInstallerProgressing failures.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 18, 2022

/hold
because I don't want this PR to merge without e2e-azure-operator and e2e-gcp-operator passing.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2022
@Miciah
Copy link
Contributor Author

Miciah commented Oct 18, 2022

Installer error:

 level=error msg=Error: error reading Route Table (rtb-0841b02723ba42394): couldn't find resource 

/test e2e-aws-ovn

@Miciah
Copy link
Contributor Author

Miciah commented Oct 18, 2022

Same installer error for e2e-aws-ovn-upgrade.
/test e2e-aws-ovn-upgrade

@frobware
Copy link
Contributor

/retest-required

@gcs278
Copy link
Contributor

gcs278 commented Oct 19, 2022

FYI #845 is mostly failing on this E2E error now.

@gcs278
Copy link
Contributor

gcs278 commented Oct 19, 2022

/retest
Looks like distruption-type issues, ovn issues, DNSUnmanagedToManagedInternal-type issues fixed by #845, but nothing related to TestRouterCompressionOperation

@Miciah
Copy link
Contributor Author

Miciah commented Oct 20, 2022

/test all
now that #845 has merged.
/test e2e-azure-operator
/test e2e-gcp-operator

@Miciah
Copy link
Contributor Author

Miciah commented Oct 20, 2022

Tests passed in e2e-gcp-operator:

�[36mINFO�[0m[2022-10-20T18:44:03Z] Running step e2e-gcp-operator-test.          
�[36mINFO�[0m[2022-10-20T19:08:13Z] Step e2e-gcp-operator-test succeeded after 24m10s. 

Tests passed in e2e-azure-operator:

�[36mINFO�[0m[2022-10-20T19:07:00Z] Running step e2e-azure-operator-test.        
�[36mINFO�[0m[2022-10-20T19:28:00Z] Step e2e-azure-operator-test succeeded after 21m0s. 

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 20, 2022
@Miciah
Copy link
Contributor Author

Miciah commented Oct 20, 2022

e2e-azure-ovn failed because must-gather failed.
/test e2e-azure-ovn

@Miciah
Copy link
Contributor Author

Miciah commented Oct 20, 2022

e2e-aws-operator failed because must-gather failed and because kube-apiserver and kube-scheduler were reportingNodeInstallerProgressing.
/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Oct 20, 2022

e2e-gcp-ovn-serial failed because disruption/ingress-to-oauth-server connection/new and disruption/ingress-to-console connection/new failed.
/test e2e-tcp-ovn-serial

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 20, 2022

@Miciah: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test e2e-aws-operator
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-ovn-serial
  • /test images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test e2e-aws-ovn-single-node
  • /test e2e-azure-operator
  • /test e2e-azure-ovn
  • /test e2e-gcp-operator

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-ovn
  • pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-ovn-serial
  • pull-ci-openshift-cluster-ingress-operator-master-images
  • pull-ci-openshift-cluster-ingress-operator-master-unit
  • pull-ci-openshift-cluster-ingress-operator-master-verify
Details

In response to this:

e2e-gcp-ovn-serial failed because disruption/ingress-to-oauth-server connection/new and disruption/ingress-to-console connection/new failed.
/test e2e-tcp-ovn-serial

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 20, 2022

/test e2e-gcp-ovn-serial

@gcs278
Copy link
Contributor

gcs278 commented Oct 20, 2022

/test e2e-gcp-ovn-serial
Disruption failures again.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 20, 2022

e2e-aws-operator failed because TestUnmanagedDNSToManagedDNSInternalIngressController got EOFs. Also, kube-apiserver and kube-scheduler were reporting NodeInstallerProgressing.
/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Oct 21, 2022

e2e-aws-operator failed because TestUpdateDefaultIngressController failed.

In addition, kube-apiserver reported NodeInstallerProgressing, image-registry reported NodeCADaemonProgressing, and ingress reported DeploymentRollingOut.

The failed deployment rollout for ingress may be related to the following MalscheduledPod event with timestamp 00:41:23: pod/router-default-5d98b8ddb6-zvm96 pod/router-default-68d8559557-lv24z should be one per node, but all were placed on node/ip-10-0-198-248.us-west-1.compute.internal; evicting pod/router-default-5d98b8ddb6-zvm96. The router-default-5d98b8ddb6-zvm96 pod is not in pods.json, but router-default-68d8559557-lv24z is.

I noticed in the kube-apiserver audit logs there is a single eviction request, by ingress-operator, for router-default-5d98b8ddb6-zvm96, as well as two deletes:

ip-10-0-142-11.us-west-1.compute.internal-audit-2022-10-21T00-52-08.826.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"2283a8ef-9132-4345-9980-2530e8d87ee2","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-ingress/pods/router-default-5d98b8ddb6-zvm96","verb":"delete","user":{"username":"system:node:ip-10-0-198-248.us-west-1.compute.internal","groups":["system:nodes","system:authenticated"]},"sourceIPs":["10.0.167.134"],"userAgent":"kubelet/v1.25.2+495fa99 (linux/amd64) kubernetes/e313ec2","objectRef":{"resource":"pods","namespace":"openshift-ingress","name":"router-default-5d98b8ddb6-zvm96","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2022-10-21T00:42:36.176533Z","stageTimestamp":"2022-10-21T00:42:36.196551Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}
ip-10-0-232-241.us-west-1.compute.internal-audit-2022-10-21T00-44-52.075.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"604ec47b-a1bf-4078-8516-4a631a584f7b","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-ingress/pods/router-default-5d98b8ddb6-zvm96/eviction","verb":"create","user":{"username":"system:serviceaccount:openshift-ingress-operator:ingress-operator","uid":"bc667cf5-757c-432f-8cac-a0d2dd6a7b63","groups":["system:serviceaccounts","system:serviceaccounts:openshift-ingress-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["ingress-operator-545ccb9685-pl2sc"],"authentication.kubernetes.io/pod-uid":["590af6d6-06ee-4823-a35d-4e42fa0db3d1"]}},"sourceIPs":["10.0.142.11"],"userAgent":"ingress-operator/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"pods","namespace":"openshift-ingress","name":"router-default-5d98b8ddb6-zvm96","apiVersion":"v1","subresource":"eviction"},"responseStatus":{"metadata":{},"status":"Success","code":201},"requestReceivedTimestamp":"2022-10-21T00:41:23.091911Z","stageTimestamp":"2022-10-21T00:41:23.166917Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"openshift-ingress-operator\" of ClusterRole \"openshift-ingress-operator\" to ServiceAccount \"ingress-operator/openshift-ingress-operator\""}}
ip-10-0-232-241.us-west-1.compute.internal-audit-2022-10-21T00-44-52.075.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"0e1c2602-832f-47dc-a544-0a60be7ba190","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-ingress/pods/router-default-5d98b8ddb6-zvm96","verb":"delete","user":{"username":"system:serviceaccount:kube-system:replicaset-controller","uid":"240367d2-d201-4e96-bcea-c3002b1ba9fd","groups":["system:serviceaccounts","system:serviceaccounts:kube-system","system:authenticated"]},"sourceIPs":["10.0.167.134"],"userAgent":"kube-controller-manager/v1.25.0 (linux/amd64) kubernetes/4bd0702/system:serviceaccount:kube-system:replicaset-controller","objectRef":{"resource":"pods","namespace":"openshift-ingress","name":"router-default-5d98b8ddb6-zvm96","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2022-10-21T00:41:23.153632Z","stageTimestamp":"2022-10-21T00:41:23.180243Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:controller:replicaset-controller\" of ClusterRole \"system:controller:replicaset-controller\" to ServiceAccount \"replicaset-controller/kube-system\""}}

Subsequently, machine-config-controller tried and failed over a hundred times to evict the router-default-68d8559557-lv24z pod. For example:

ip-10-0-142-11.us-west-1.compute.internal-audit-2022-10-21T00-52-08.826.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"8634515a-8fcb-40ea-8ac8-f12c3d218f42","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-ingress/pods/router-default-68d8559557-lv24z/eviction","verb":"create","user":{"username":"system:serviceaccount:openshift-machine-config-operator:machine-config-controller","uid":"d94da2fa-39a4-4aea-8d48-e18eee481602","groups":["system:serviceaccounts","system:serviceaccounts:openshift-machine-config-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["machine-config-controller-7cf476fb49-ll984"],"authentication.kubernetes.io/pod-uid":["f25ace7b-9298-47c0-bf09-39a70aee017d"]}},"sourceIPs":["10.129.0.33"],"userAgent":"machine-config-controller/v0.0.0 (linux/amd64) kubernetes/$Format/node-update-controller","objectRef":{"resource":"pods","namespace":"openshift-ingress","name":"router-default-68d8559557-lv24z","apiVersion":"v1","subresource":"eviction"},"responseStatus":{"metadata":{},"status":"Failure","message":"Cannot evict pod as it would violate the pod's disruption budget.","reason":"TooManyRequests","details":{"causes":[{"reason":"DisruptionBudget","message":"The disruption budget router-default needs 1 healthy pods and has 1 currently"}]},"code":429},"requestReceivedTimestamp":"2022-10-21T00:47:02.273260Z","stageTimestamp":"2022-10-21T00:47:02.280643Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"machine-config-controller\" of ClusterRole \"machine-config-controller\" to ServiceAccount \"machine-config-controller/openshift-machine-config-operator\""}}

I wonder whether the ingress operator's eviction broke the subsequent rollout. This may warrant some investigation.

Anyway, that seems to be a one-off failure for this PR, so I'll assume it isn't caused by the changes in this PR.
/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Oct 21, 2022

e2e-aws-operator failed because kube-apiserver reported NodeInstallerProgressing and because the TestUnmanagedDNSToManagedDNSInternalIngressController test failed with EOFs.
/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Oct 21, 2022

=== RUN   TestAll/serial/TestRouterCompressionOperation
    router_compression_test.go:178: failed to observe deployment completion: failed to observe deployment rollout complete; deployment specifies 2 replicas and has generation 14; last observed 2 updated, 0 available, 3 total replicas, with observed generation 14

/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Oct 21, 2022

Seems like e2e-aws-operator is consistently failing, for various reasons unrelated to this PR. Since e2e-azure-operator and e2e-gcp-operator passed (#843 (comment)), only a few of the e2e-aws-operator job failures are from failures in our E2E tests, and none of the failures are attributable to changes in this PR, I'm going to override e2e-aws-operator so we can at least eliminate one cause for test failures for other PRs.
/override ci/prow/e2e-aws-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2022

@Miciah: Overrode contexts on behalf of Miciah: ci/prow/e2e-aws-operator

Details

In response to this:

Seems like e2e-aws-operator is consistently failing, for various reasons unrelated to this PR. Since e2e-azure-operator and e2e-gcp-operator passed (#843 (comment)), only a few of the e2e-aws-operator job failures are from failures in our E2E tests, and none of the failures are attributable to changes in this PR, I'm going to override e2e-aws-operator so we can at least eliminate one cause for test failures for other PRs.
/override ci/prow/e2e-aws-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2022

@Miciah: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 9742afc into openshift:master Oct 21, 2022
@openshift-ci-robot
Copy link
Contributor

@Miciah: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-2435 has been moved to the MODIFIED state.

Details

In response to this:

Fix a nil-pointer dereference in the getHttpHeaders helper for TestRouterCompressionOperation.

Follow-up to #679.

  • test/e2e/router_compression_test.go (getHttpHeaders): Don't dereference the response value from client.Do if it returned a non-nil error value.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants