OCPBUGS-6661, OCPBUGS-9464: Move mTLS CRL handling into the router, and fix accidental duplication of CRLs#930
Conversation
|
@rfredette: This pull request references Jira Issue OCPBUGS-9464, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/jira refresh |
|
@rfredette: This pull request references Jira Issue OCPBUGS-9464, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@rfredette: This pull request references Jira Issue OCPBUGS-9464, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
operator tests failed on unrelated flakes. |
95eea14 to
67f6a68
Compare
|
/retest-required |
And a reminder to myself, [some of] the tests are failing because we're hard coding a reference to @rfredette's router image. |
Leave a stub of the CRL controller to clean up any existing configmaps. The stub controller will need to be removed in a future release Use cluster-wide proxy for CRL downloads when available Add a test with several test cases to test CRL management
91040f9 to
210072a
Compare
|
Operator suites failed on |
|
This PR has passed aws-ovn-upgrade before, and I don't see any obvious reason it should be failing now. |
|
/retest |
|
/lgtm |
|
@rfredette: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@rfredette: Jira Issue OCPBUGS-9464: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-9464 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-4.13 |
|
@rfredette: #930 failed to apply on top of branch "release-4.13": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
This broke hypershift New behavior CIO passing its own proxy config down to operands causing router pods to fail. We are trying to gate CIO so this does happen openshift/release#39246 We do have a blocking job on the ci release stream so it is blocked atm. |
That is surprising. Are you saying that HyperShift always configures a cluster-wide egress proxy for cluster-ingress-operator? What connection is the router making that is erroneously using the proxy? Is there a bug report? |
|
Not sure, but after this PR, we see this in our router pods Those are the proxy envvars we set for the cluster-ingress-operator on the hosted control plane, not meant to be passed down to operands, which run on the hosted cluster nodes. Before we did not see this |
We configure the ingress-operator with a sidecar that runs konnectivity so that it can communicate with things that run on the cluster nodes (in a different service network) and use proxy envvars to direct operator traffic through the konnectivity sidecar. |
This was merged in openshift#930. In hypershift the operator pod runs in a different cluster than the router pod, and so they might have different proxy configs. We set ootb proxy settings for the operator to go through konnectivity. I think because this got introduced in this PR https://github.com/openshift/cluster-ingress-operator/pull/930/files\#diff-03760f03eeac1174e38e1ec5fecadcab4cb38d05e00509f201d7ae64dc1c7c36R1044-R1048 now those proxy settings are being propagated down to the router pod and so breaking hypershift. See an example on how we diferenciate in the cluster network operator https://github.com/openshift/hypershift/blob/f03113aec2545aa08f85ab0e2fa0a790f709aacb/control-plane-operator/controllers/hostedcontrolplane/cno/clusternetworkoperator.go\#L341-L347 https://github.com/openshift/cluster-network-operator/blob/7df32ec5213e2abfcff18cd22a63e7f20a5cccdc/pkg/network/cloud_network.go\#L98
In hypershift the operator pod runs in a different cluster than the router pod, and so they might have different proxy configs. See an example on how we diferenciate in the cluster network operator https://github.com/openshift/hypershift/blob/f03113aec2545aa08f85ab0e2fa0a790f709aacb/control-plane-operator/controllers/hostedcontrolplane/cno/clusternetworkoperator.go#L341-L347 |
|
I'd suggest we revert that particular change #937. |
|
Policy required us to open a full revert rather than partial, see #938 for details. |
According to the design of the cluster-wide egress proxy feature, operators should propagate proxy configuration to their operands so that the operands use the proxy when communicating externally. This PR fixes a critical customer bug, and it follows the design specifications for the cluster-wide egress proxy feature in doing so. Can you provide more details about what was failing on HyperShift? Specifically, what failed when the proxy was configured on the router pods, what were the specific error messages, and what specific URLs were involved? Is HyperShift not setting If instead of copying the environment variables, cluster-ingress-operator read the cluster-wide egress proxy (using the injected kubeconfig for the hosted cluster) and injected environment variables based on those values, would that avoid the HyperShift problem (while still configuring the proxy on the operand as required by the cluster-wide egress proxy design)? |
Hypershift makes this assumption obsolete. We need to come up with an updated one.
You can see router pod log failures here which prevent the ingress CO from going available
That sounds reasonable to me. I think this probably deserves a discussion and possibly an enhancement update to make sure we consolidate on the approach for all components that have operator/operands split. cc @csrwng @sjenning |
💥.
In my opinion, it would be better to amend the design before invalidating it, or find a way for HyperShift to meet its requirements without violating existing designs and API abstractions. |
The following logs are key: The proxy config must not have the service network in |
This PR moves CRL lifecycle management to the router pod to avoid hitting the configmap max size (1MB), and fixes a bug that caused duplicate CRLs to be downloaded.
This PR Includes 2 new e2e tests:
TestMTLSWithCRLscreates an ingress controller with mTLS enabled and client CA certificates that include CRLs. It verifies that certificates that are signed by each of the client CAs are accepted, certificates revoked by each of the CAs are rejected, and certificates not signed by any of the CAs are also rejected.TestCRLUpdatecreates an ingress controller with mTLS enabled and client CA certificates that include CRLs. It verifies that the CRLs that are downloaded on startup match the CRLs specified in the CA bundle, and waits until all CRLs have expired once, making sure that each update only updates the required CRLs, and that all updates download the correct CRLs.