Skip to content

OCPBUGS-6661, OCPBUGS-9464: Move mTLS CRL handling into the router, and fix accidental duplication of CRLs#930

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
rfredette:ocpbugs-9464-duplicate-crls
May 23, 2023
Merged

OCPBUGS-6661, OCPBUGS-9464: Move mTLS CRL handling into the router, and fix accidental duplication of CRLs#930
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
rfredette:ocpbugs-9464-duplicate-crls

Conversation

@rfredette
Copy link
Contributor

@rfredette rfredette commented May 11, 2023

This PR moves CRL lifecycle management to the router pod to avoid hitting the configmap max size (1MB), and fixes a bug that caused duplicate CRLs to be downloaded.

This PR Includes 2 new e2e tests:

  • TestMTLSWithCRLs creates an ingress controller with mTLS enabled and client CA certificates that include CRLs. It verifies that certificates that are signed by each of the client CAs are accepted, certificates revoked by each of the CAs are rejected, and certificates not signed by any of the CAs are also rejected.
  • TestCRLUpdate creates an ingress controller with mTLS enabled and client CA certificates that include CRLs. It verifies that the CRLs that are downloaded on startup match the CRLs specified in the CA bundle, and waits until all CRLs have expired once, making sure that each update only updates the required CRLs, and that all updates download the correct CRLs.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 11, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9464, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This PR contains one new test with multiple test cases for the fix in openshift/router#472.

The new test creates an ingress controller with mTLS enabled and client CA certificates that include CRLs. It verifies that the CRLs that are downloaded on startup match the CRLs specified in the CA bundle, and waits until all CRLs have expired once, making sure that each update only updates the required CRLs, and that all updates download the correct CRLs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label May 11, 2023
@openshift-ci openshift-ci bot requested review from alebedev87, gcs278 and lihongan May 11, 2023 04:37
@frobware frobware self-assigned this May 17, 2023
@rfredette rfredette changed the title OCPBUGS-9464: Verify CRLs are updated correctly when they expire OCPBUGS-6661, OCPBUGS-9464: Move mTLS CRL handling into the router, and fix accidental duplication of CRLs May 17, 2023
@rfredette
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. and removed jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. labels May 17, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9464, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9464, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

This PR moves CRL lifecycle management to the router pod to avoid hitting the configmap max size (1MB), and fixes a bug that caused duplicate CRLs to be downloaded.

This PR Includes 2 new e2e tests:

  • TestMTLSWithCRLs creates an ingress controller with mTLS enabled and client CA certificates that include CRLs. It verifies that certificates that are signed by each of the client CAs are accepted, certificates revoked by each of the CAs are rejected, and certificates not signed by any of the CAs are also rejected.
  • TestCRLUpdate creates an ingress controller with mTLS enabled and client CA certificates that include CRLs. It verifies that the CRLs that are downloaded on startup match the CRLs specified in the CA bundle, and waits until all CRLs have expired once, making sure that each update only updates the required CRLs, and that all updates download the correct CRLs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rfredette
Copy link
Contributor Author

operator tests failed on unrelated flakes.
/retest-required

@rfredette rfredette force-pushed the ocpbugs-9464-duplicate-crls branch from 95eea14 to 67f6a68 Compare May 17, 2023 23:09
@frobware
Copy link
Contributor

/retest-required

@frobware
Copy link
Contributor

/retest-required

And a reminder to myself, [some of] the tests are failing because we're hard coding a reference to @rfredette's router image.

rfredette added 2 commits May 19, 2023 15:19
Leave a stub of the CRL controller to clean up any existing configmaps.
The stub controller will need to be removed in a future release

Use cluster-wide proxy for CRL downloads when available

Add a test with several test cases to test CRL management
@rfredette
Copy link
Contributor Author

Operator suites failed on TestMTLSWithCRLs and TestCRLUpdate, but they were kicked off before the router PR was merged, so likely they didn't have the required router version.
/retest-required

@rfredette
Copy link
Contributor Author

This PR has passed aws-ovn-upgrade before, and I don't see any obvious reason it should be failing now.
/retest-required

@frobware
Copy link
Contributor

/retest

@frobware
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 23, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2023

@rfredette: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Miciah
Copy link
Contributor

Miciah commented May 23, 2023

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 23, 2023
@openshift-merge-robot openshift-merge-robot merged commit 108acb3 into openshift:master May 23, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: Jira Issue OCPBUGS-9464: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-9464 has been moved to the MODIFIED state.

Details

In response to this:

This PR moves CRL lifecycle management to the router pod to avoid hitting the configmap max size (1MB), and fixes a bug that caused duplicate CRLs to be downloaded.

This PR Includes 2 new e2e tests:

  • TestMTLSWithCRLs creates an ingress controller with mTLS enabled and client CA certificates that include CRLs. It verifies that certificates that are signed by each of the client CAs are accepted, certificates revoked by each of the CAs are rejected, and certificates not signed by any of the CAs are also rejected.
  • TestCRLUpdate creates an ingress controller with mTLS enabled and client CA certificates that include CRLs. It verifies that the CRLs that are downloaded on startup match the CRLs specified in the CA bundle, and waits until all CRLs have expired once, making sure that each update only updates the required CRLs, and that all updates download the correct CRLs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rfredette
Copy link
Contributor Author

/cherry-pick release-4.13
/cherry-pick release-4.12
/cherry-pick release-4.11

@openshift-cherrypick-robot

@rfredette: #930 failed to apply on top of branch "release-4.13":

Applying: Remove CRL management code from ingress operator
Using index info to reconstruct a base tree...
M	go.mod
M	go.sum
M	vendor/modules.txt
Falling back to patching base and 3-way merge...
Auto-merging vendor/modules.txt
Removing vendor/github.com/operator-framework/api/pkg/operators/zz_generated.deepcopy.go
Removing vendor/github.com/operator-framework/api/pkg/operators/subscription_types.go
Removing vendor/github.com/operator-framework/api/pkg/operators/operatorgroup_types.go
Removing vendor/github.com/operator-framework/api/pkg/operators/installplan_types.go
Removing vendor/github.com/operator-framework/api/pkg/operators/clusterserviceversion_types.go
Removing vendor/github.com/operator-framework/api/pkg/operators/catalogsource_types.go
Removing vendor/github.com/blang/semver/sql.go
Removing vendor/github.com/blang/semver/sort.go
Removing vendor/github.com/blang/semver/semver.go
Removing vendor/github.com/blang/semver/range.go
Removing vendor/github.com/blang/semver/package.json
Removing vendor/github.com/blang/semver/json.go
Removing vendor/github.com/blang/semver/README.md
Removing vendor/github.com/blang/semver/LICENSE
Removing vendor/github.com/blang/semver/.travis.yml
Auto-merging go.sum
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Remove CRL management code from ingress operator
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-4.13
/cherry-pick release-4.12
/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sjenning
Copy link
Contributor

sjenning commented May 23, 2023

This broke hypershift

https://github.com/openshift/cluster-ingress-operator/pull/930/files#diff-03760f03eeac1174e38e1ec5fecadcab4cb38d05e00509f201d7ae64dc1c7c36R1044-R1048

New behavior CIO passing its own proxy config down to operands causing router pods to fail.

We are trying to gate CIO so this does happen openshift/release#39246

We do have a blocking job on the ci release stream so it is blocked atm.

@Miciah
Copy link
Contributor

Miciah commented May 23, 2023

This broke hypershift

https://github.com/openshift/cluster-ingress-operator/pull/930/files#diff-03760f03eeac1174e38e1ec5fecadcab4cb38d05e00509f201d7ae64dc1c7c36R1044-R1048

New behavior CIO passing its own proxy config down to operands causing router pods to fail.

That is surprising. Are you saying that HyperShift always configures a cluster-wide egress proxy for cluster-ingress-operator? What connection is the router making that is erroneously using the proxy? Is there a bug report?

@sjenning
Copy link
Contributor

sjenning commented May 24, 2023

@sjenning
Copy link
Contributor

sjenning commented May 24, 2023

Are you saying that HyperShift always configures a cluster-wide egress proxy for cluster-ingress-operator?

We configure the ingress-operator with a sidecar that runs konnectivity so that it can communicate with things that run on the cluster nodes (in a different service network) and use proxy envvars to direct operator traffic through the konnectivity sidecar.

enxebre added a commit to enxebre/cluster-ingress-operator that referenced this pull request May 24, 2023
This was merged in openshift#930. In hypershift the operator pod runs in a different cluster than the router pod, and so they might have different proxy configs.
We set ootb proxy settings for the operator to go through konnectivity.
I think because this got introduced in this PR https://github.com/openshift/cluster-ingress-operator/pull/930/files\#diff-03760f03eeac1174e38e1ec5fecadcab4cb38d05e00509f201d7ae64dc1c7c36R1044-R1048
now those proxy settings are being propagated down to the router pod and so breaking hypershift.

See an example on how we diferenciate in the cluster network operator https://github.com/openshift/hypershift/blob/f03113aec2545aa08f85ab0e2fa0a790f709aacb/control-plane-operator/controllers/hostedcontrolplane/cno/clusternetworkoperator.go\#L341-L347

https://github.com/openshift/cluster-network-operator/blob/7df32ec5213e2abfcff18cd22a63e7f20a5cccdc/pkg/network/cloud_network.go\#L98
@enxebre
Copy link
Member

enxebre commented May 24, 2023

That is surprising. Are you saying that HyperShift always configures a cluster-wide egress proxy for cluster-ingress-operator? What connection is the router making that is erroneously using the proxy? Is there a bug report?

In hypershift the operator pod runs in a different cluster than the router pod, and so they might have different proxy configs.
We set ootb proxy settings for the operator to go through konnectivity.
I think because this got introduced in this PR https://github.com/openshift/cluster-ingress-operator/pull/930/files#diff-03760f03eeac1174e38e1ec5fecadcab4cb38d05e00509f201d7ae64dc1c7c36R1044-R1048
now those proxy settings are being propagated down to the router pod and so breaking hypershift.

See an example on how we diferenciate in the cluster network operator https://github.com/openshift/hypershift/blob/f03113aec2545aa08f85ab0e2fa0a790f709aacb/control-plane-operator/controllers/hostedcontrolplane/cno/clusternetworkoperator.go#L341-L347

https://github.com/openshift/cluster-network-operator/blob/7df32ec5213e2abfcff18cd22a63e7f20a5cccdc/pkg/network/cloud_network.go#L98

@enxebre
Copy link
Member

enxebre commented May 24, 2023

I'd suggest we revert that particular change #937.
Then we merge openshift/release#39246 and gate on a green run.
Then we re-introduce the proxy change with a test driven safety check.

@dgoodwin
Copy link
Contributor

Policy required us to open a full revert rather than partial, see #938 for details.

@Miciah
Copy link
Contributor

Miciah commented May 24, 2023

Those are the proxy envvars we set for the cluster-ingress-operator on the hosted control plane, not meant to be passed down to operands, which run on the hosted cluster nodes.

According to the design of the cluster-wide egress proxy feature, operators should propagate proxy configuration to their operands so that the operands use the proxy when communicating externally. This PR fixes a critical customer bug, and it follows the design specifications for the cluster-wide egress proxy feature in doing so.

Can you provide more details about what was failing on HyperShift? Specifically, what failed when the proxy was configured on the router pods, what were the specific error messages, and what specific URLs were involved? Is HyperShift not setting NO_PROXY to an appropriate value?

If instead of copying the environment variables, cluster-ingress-operator read the cluster-wide egress proxy (using the injected kubeconfig for the hosted cluster) and injected environment variables based on those values, would that avoid the HyperShift problem (while still configuring the proxy on the operand as required by the cluster-wide egress proxy design)?

@enxebre
Copy link
Member

enxebre commented May 24, 2023

According to the design of the cluster-wide egress proxy feature, operators should propagate proxy configuration to their operands so that the operands use the proxy when communicating externally

Hypershift makes this assumption obsolete. We need to come up with an updated one.

Can you provide more details about what was failing on HyperShift? Specifically, what failed when the proxy was configured on the router pods, what were the specific error messages, and what specific URLs were involved? Is HyperShift not setting NO_PROXY to an appropriate value?

You can see router pod log failures here which prevent the ingress CO from going available
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_hypershift/2595/pull-ci-openshift-hypershift-main-e2e-aws/1661091472633499648/artifacts/e2e-aws/run-e2e/artifacts/TestCreateCluster_PreTeardownClusterDump/hostedcluster-example-qpj62/namespaces/openshift-ingress/pods/router-default-794cf955c9-fjdxw/router/router/logs/current.log

If instead of copying the environment variables, cluster-ingress-operator read the cluster-wide egress proxy (using the injected kubeconfig for the hosted cluster) and injected environment variables based on those values, would that avoid the HyperShift problem (while still configuring the proxy on the operand as required by the cluster-wide egress proxy design)?

That sounds reasonable to me. I think this probably deserves a discussion and possibly an enhancement update to make sure we consolidate on the approach for all components that have operator/operands split. cc @csrwng @sjenning

@Miciah
Copy link
Contributor

Miciah commented May 25, 2023

Hypershift makes this assumption obsolete. We need to come up with an updated one.

💥.

If instead of copying the environment variables, cluster-ingress-operator read the cluster-wide egress proxy (using the injected kubeconfig for the hosted cluster) and injected environment variables based on those values, would that avoid the HyperShift problem (while still configuring the proxy on the operand as required by the cluster-wide egress proxy design)?

That sounds reasonable to me. I think this probably deserves a discussion and possibly an enhancement update to make sure we consolidate on the approach for all components that have operator/operands split. cc @csrwng @sjenning

In my opinion, it would be better to amend the design before invalidating it, or find a way for HyperShift to meet its requirements without violating existing designs and API abstractions.

@Miciah
Copy link
Contributor

Miciah commented May 25, 2023

You can see router pod log failures here which prevent the ingress CO from going available
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_hypershift/2595/pull-ci-openshift-hypershift-main-e2e-aws/1661091472633499648/artifacts/e2e-aws/run-e2e/artifacts/TestCreateCluster_PreTeardownClusterDump/hostedcluster-example-qpj62/namespaces/openshift-ingress/pods/router-default-794cf955c9-fjdxw/router/router/logs/current.log

The following logs are key:

2023-05-23T20:30:42.394914893Z W0523 20:30:42.394839       1 reflector.go:424] github.com/openshift/router/pkg/router/controller/factory/factory.go:124: failed to list *v1.Route: Get "https://172.31.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0": proxyconnect tcp: dial tcp 127.0.0.1:8090: connect: connection refused
2023-05-23T20:30:42.394914893Z W0523 20:30:42.394835       1 reflector.go:424] github.com/openshift/router/pkg/router/controller/factory/factory.go:124: failed to list *v1.EndpointSlice: Get "https://172.31.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": proxyconnect tcp: dial tcp 127.0.0.1:8090: connect: connection refused

The proxy config must not have the service network in noProxy, and so the router controller attempts to connect to the kubernetes.default.svc address through the proxy. Usually, cluster-network-operator would inject the service network(s) into noProxy: https://github.com/openshift/cluster-network-operator/blob//pkg/util/proxyconfig/no_proxy.go#L78-L80

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants