Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Apr 17, 2023

Sometimes folks point their ClusterVersion spec.upstream at a local OpenShift Update Service running in the same cluster.

a. Our docs recommend using the UpdateService status.policyEngineURI in the ClusterVersion spec.upstream.

b. The cluster-version operator loads Proxy information, as described here. Before this commit, it assumed that the set of trusted CAs in the trusted-ca-bundle ConfigMap was the complete trust set, both for requests through the proxy and requests to no-proxy endpoints.

c. The network operator's Proxy controller combined any user-request trustedCA content with the system default trust bundle, but this did not include the current Ingress wildcard certificate CA.

d. From our docs:

The Ingress Operator generates a default certificate for an Ingress Controller to serve as a placeholder until you configure a custom default certificate. Do not use Operator-generated default certificates in production clusters.

When cluster admins:

  • Neglect the docs-recommended trusted Ingress certificate for (d), e.g. because they are testing in a non-production channel, and
  • Use the docs-recommended status.policyEngineURI for (a).

neither (c) nor (d) would recover for them, and the Cincinnati request would fail on missing TLS/X.509 trust.

It seems like this would be a generic issue that could be sorted at the network operator's (c), for all Proxy-consuming resources who might want to reach back into the cluster via the Ingress router. But as a cheap hack until something like that happens, users can use:

http://${POLICY_ENGINE_SERVICE_NAME}.${NAMESPACE}.svc.cluster.local/api/upgrades_info/graph

to bypass TLS. Or, if they'd rather use encrypted communication, this commit allows them to use:

https://${POLICY_ENGINE_SERVICE_NAME}.${NAMESPACE}.svc.cluster.local/api/upgrades_info/graph

This commit breaks a few layers of abstraction:

  • Hard-coding .cluster.local as a known no-proxy path, although that's formally documented here.
  • Neglecting similar handling for release signatures, because getTransport -> HTTPClient -> loadConfigMapVerifierDataFromUpdate doesn't currently allow for "but if the target URI is noProxy, skip the TLS customization, and fall back to default trust stores".
  • Requiring consumers to reach around status.policyEngineURI and talk directly to the underlying local Service.

But it's only a few lines, even if it's incomplete, awkward, and brittle. Folks concerned about any of those limitations can advocate for changes to the network operator's (c) so there's a generic decision around how to handle this kind of problem. Or they can explain why the network operator's current handling is appropriate, and only the cluster-version operator needs Ingress CA injection (and if so, whether that should happen only for Cincinnati requests, or also for signature requests).

…nati

Sometimes folks point their ClusterVersion spec.upstream at a local
OpenShift Update Service running in the same cluster.

a. Our docs recommend using the UpdateService status.policyEngineURI
   as the ClusterVersion spec.upstream [1].

b. The cluster-version operator loads Proxy information, as described
   in [2].  Before this commit, it assumed that the set of trusted CAs
   in the trusted-ca-bundle ConfigMap was the complete trust set, both
   for requests through the proxy and requests to no-proxy endpoints.

c. The network operator's Proxy controller combined any user-request
   trustedCA content with the system default trust bundle [3], but
   this did not include the current Ingress wildcard certificate CA.

d. From our docs [4,5]:

     The Ingress Operator generates a default certificate for an
     Ingress Controller to serve as a placeholder until you configure
     a custom default certificate. Do not use Operator-generated
     default certificates in production clusters.

When cluster admins:

* Neglect the docs-recommended trusted Ingress certificate for (d),
  e.g. because they are testing in a non-production channel, and
* Use the docs-recommended status.policyEngineURI for (a).

neither (c) nor (d) would recover for them, and the Cincinnati request
would fail on missing TLS/X.509 trust [6].

It seems like this would be a generic issue that could be sorted at
the network operator's (c), for all Proxy-consuming resources who
might want to reach back into the cluster via the Ingress router.  But
as a cheap hack until something like that happens, users can use:

  http://${POLICY_ENGINE_SERVICE_NAME}.${NAMESPACE}.svc.cluster.local/api/upgrades_info/graph

to bypass TLS.  Or, if they'd rather use encrypted communication, this
commit allows them to use:

  https://${POLICY_ENGINE_SERVICE_NAME}.${NAMESPACE}.svc.cluster.local/api/upgrades_info/graph

This commit breaks a few layers of abstraction:

* Hard-coding .cluster.local as a known no-proxy path, although that's
  formally documented in [3].
* Neglecting similar handling for release signatures, because
  getTransport -> HTTPClient -> loadConfigMapVerifierDataFromUpdate
  doesn't currently allow for "but if the target URI is noProxy, skip
  the TLS customization, and fall back to default trust stores".
* Requiring consumers to reach around status.policyEngineURI and talk
  directly to the underlying local Service.

But it's only a few lines, even if it's incomplete, awkward, and
brittle.  Folks concerned about any of those limitations can advocate
for changes to the network operator's (c) so there's a generic
decision around how to handle this kind of problem.  Or they can
explain why the network operator's current handling is appropriate,
and only the cluster-version operator needs Ingress CA injection (and
if so, whether that should happen only for Cincinnati requests, or
also for signature requests).

[1]: https://docs.openshift.com/container-platform/4.12/updating/updating-restricted-network-cluster/restricted-network-update-osus.html#update-service-create-service-cli_updating-restricted-network-cluster-osus
[2]: https://github.com/openshift/enhancements/blob/6b3209fa18ab3161429743550eed36391efc785f/enhancements/proxy/global-cluster-egress-proxy.md
[3]: https://github.com/openshift/api/blob/1b2161d23365fb5918167b2ba73e90ff80ca1805/config/v1/types_proxy.go#L50-L58
[4]: https://docs.openshift.com/container-platform/4.11/security/certificate_types_descriptions/ingress-certificates.html#location
[5]: https://github.com/openshift/openshift-docs/blame/53aa6335eb28cdc9ac0888b05002b374342f7b1e/security/certificate_types_descriptions/ingress-certificates.adoc#L24
[6]: https://issues.redhat.com/browse/OCPBUGS-1694
@openshift-ci-robot
Copy link
Contributor

@wking: Jira Issue OCPBUGS-1694 is in a security level that is not in the allowed security levels for this repo.
Allowed security levels for this repo are:

  • Red Hat Employee
  • default
Details

In response to this:

Sometimes folks point their ClusterVersion spec.upstream at a local OpenShift Update Service running in the same cluster.

a. Our docs recommend using the UpdateService status.policyEngineURI in the ClusterVersion spec.upstream.

b. The cluster-version operator loads Proxy information, as described here. Before this commit, it assumed that the set of trusted CAs in the trusted-ca-bundle ConfigMap was the complete trust set, both for requests through the proxy and requests to no-proxy endpoints.

c. The network operator's Proxy controller combined any user-request trustedCA content with the system default trust bundle, but this did not include the current Ingress wildcard certificate CA.

d. From our docs:

The Ingress Operator generates a default certificate for an Ingress Controller to serve as a placeholder until you configure a custom default certificate. Do not use Operator-generated default certificates in production clusters.

When cluster admins:

  • Neglect the docs-recommended trusted Ingress certificate for (d), e.g. because they are testing in a non-production channel, and
  • Use the docs-recommended status.policyEngineURI for (a).

neither (c) nor (d) would recover for them, and the Cincinnati request would fail on missing TLS/X.509 trust.

It seems like this would be a generic issue that could be sorted at the network operator's (c), for all Proxy-consuming resources who might want to reach back into the cluster via the Ingress router. But as a cheap hack until something like that happens, users can use:

http://${POLICY_ENGINE_SERVICE_NAME}.${NAMESPACE}.svc.cluster.local/api/upgrades_info/graph

to bypass TLS. Or, if they'd rather use encrypted communication, this commit allows them to use:

https://${POLICY_ENGINE_SERVICE_NAME}.${NAMESPACE}.svc.cluster.local/api/upgrades_info/graph

This commit breaks a few layers of abstraction:

  • Hard-coding .cluster.local as a known no-proxy path, although that's formally documented here.
  • Neglecting similar handling for release signatures, because getTransport -> HTTPClient -> loadConfigMapVerifierDataFromUpdate doesn't currently allow for "but if the target URI is noProxy, skip the TLS customization, and fall back to default trust stores".
  • Requiring consumers to reach around status.policyEngineURI and talk directly to the underlying local Service.

But it's only a few lines, even if it's incomplete, awkward, and brittle. Folks concerned about any of those limitations can advocate for changes to the network operator's (c) so there's a generic decision around how to handle this kind of problem. Or they can explain why the network operator's current handling is appropriate, and only the cluster-version operator needs Ingress CA injection (and if so, whether that should happen only for Cincinnati requests, or also for signature requests).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 17, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2023
@LalatenduMohanty
Copy link
Member

/assign @LalatenduMohanty

@LalatenduMohanty
Copy link
Member

/hold Adding a hold because I want to review it before it is merged

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 18, 2023
@petr-muller
Copy link
Member

Code is clear but I need to think a bit about the impact ;)

@DavidHurta
Copy link
Contributor

/cc

@openshift-ci openshift-ci bot requested a review from DavidHurta April 19, 2023 17:41
@DavidHurta
Copy link
Contributor

/hold I want to double check a few things.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2023
@LalatenduMohanty
Copy link
Member

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2023
@LalatenduMohanty
Copy link
Member

lifecycle frozen

@petr-muller
Copy link
Member

/lifecycle frozen

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 1, 2023

@petr-muller: The lifecycle/frozen label cannot be applied to Pull Requests.

Details

In response to this:

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment on lines +237 to +240
if strings.HasSuffix(upstreamURI.Hostname(), ".cluster.local") {
transport.TLSClientConfig = nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may instead of this "hack" utilize the ConfigMap default-ingress-cert from the openshift-config-managed namespace. For more information, here's the enhancement describing the ConfigMap (for even more curious folks, the Ingress certificates Documentation contains some general information and workflow).

From the enhancement:

The ingress operator publishes the default certificate of the default IngressController in a ConfigMap for other operators to consume. This ConfigMap is named default-ingress-cert and exists in the openshift-config-managed namespace. The intended consumers are other operators that need to incorporate the default certificate into their trust bundles in order to connect to Route resources.

Regarding a potential fix to our bug:

The standard way to validate a certificate is to verify that the certificate is signed by a trusted CA certificate. Consumers therefore may expect the default-ingress-cert ConfigMap to include the CA certificate that signed the default certificate rather than the default certificate itself.

For Go-based clients, this is not a problem as the Go TLS implementation has looser certificate validation that can be satisfied by configuring the certificate itself in the trusted certificates pool. As the ConfigMap is not intended to be used outside of OpenShift's own operators, which are Go-based, publishing the certificate itself should not pose a problem. Furthermore, the default-ingress-cert ConfigMap is an internal API, and to the extent that we document it at all, we should document that it has the default certificate, not the signing CA certificate.

So maybe we can configure the func (optr *Operator) getTLSConfig() (*tls.Config, error) method to include the default ingress CA bundle in the root CAs.

Something like this could be added to the method (not tested):

	cm, err = optr.cmConfigManagedLister.Get("default-ingress-cert")
	if apierrors.IsNotFound(err) {
		return nil, nil
	}
	if err != nil {
		return nil, err
	}

	if cm.Data["ca-bundle.crt"] != "" {
		if ok := certPool.AppendCertsFromPEM([]byte(cm.Data["ca-bundle.crt"])); !ok {
			return nil, fmt.Errorf("unable to add ca-bundle.crt certificates")
		}
	} else {
		return nil, nil
	}

Testing the idea out using curl seems to work.

$ # setup env variables
$ NAMESPACE=openshift-update-service
$ NAME=sample
$ POLICY_ENGINE_GRAPH_URI="$(oc -n "${NAMESPACE}" get -o jsonpath='{.status.policyEngineURI}/api/upgrades_info/v1/graph{"\n"}' updateservice "${NAME}")"
$
$ # extract the default ingress CA bundle to the local machine
$ oc extract -n openshift-config-managed configmap/default-ingress-cert 
$
$ # copy the default ingress's CA bundle to the CVO container
$ oc cp ca-bundle.crt "openshift-cluster-version/$(oc get po -n openshift-cluster-version -o=name | sed "s/pod\///g"):/tmp/default-ingress-cert-ca-bundle.crt"
$
$ # running curl without specifying a CA bundle - self signed certificate in certificate chain error
$ oc exec -n openshift-cluster-version deployments/cluster-version-operator -- curl   "$POLICY_ENGINE_GRAPH_URI?channel=stable-4.10" 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (60) SSL certificate problem: self signed certificate in certificate chain
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
command terminated with exit code 60
$ 
$ # running curl using the default ingress CA bundle - success
$ oc exec -n openshift-cluster-version deployments/cluster-version-operator -- curl --cacert /tmp/default-ingress-cert-ca-bundle.crt  "$POLICY_ENGINE_GRAPH_URI?channel=stable-4.10" 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"version":1,"nodes":[{"version":"4.9.28","payload":"quay.io/openshift-release-dev/ocp-release@sha256:4084d94969b186e20189649b5affba7da59f7d1943e4e5bc7ef78b981eafb7a8","metadata":{"io.openshift.upgrades.graph.release.channels":"candidate-4.10,fast-4.10,stable-4.10,candidate-4.9,fast-4.9,stable-4.9","io.openshift.upgrades.graph.release.manifestref":"sha256:4084d94969b186e20189649b5affba7da59f7d1943e4e5bc7ef78b981eafb7a8","url":"https://access.redhat.com/errata/RHBA-2022:1245"}},{"version":"4.10.20","payload":"quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600","metadata":{"io.openshift.upgrades.graph.release.channels":"candidate-4.10,eus-4.10,fast-4.10,stable-4.10,candidate-4.11,fast-4.11,stable-4.11,eus-4.12","io.openshift.upgrades.graph.release.manifestref":"sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600","url":"https://access.redhat.com/errata/RHBA-2022:5172"}},{"version":"4.9.21","payload":"quay.io/openshift-release-dev/ocp-release@sha256:fd96300600f9585e5847f5855ca14e2b3cafbce12aefe3b3f52c5da10c4476eb","metadata":{"io.openshift.upgrades.graph.previous.remove_regex":"4\\.8\\..*","io.openshift.upgrades.graph.release.channels":"candidate-4.10,fast-4.10,stable-4.10,candidate-4.9...

Worth pointing out from the commit message:

It seems like this would be a generic issue that could be sorted at the network operator's (c), for all Proxy-consuming resources who might want to reach back into the cluster via the Ingress router. But as a cheap hack until something like that happens, users can use:

http://${POLICY_ENGINE_SERVICE_NAME}.${NAMESPACE}.svc.cluster.local/api/upgrades_info/graph

to bypass TLS. Or, if they'd rather use encrypted communication, this commit allows them to use:

 https://${POLICY_ENGINE_SERVICE_NAME}.${NAMESPACE}.svc.cluster.local/api/upgrades_info/graph

We have lost the ability to resolve service DNS names in #920. Unfortunately, this wouldn't probably work now 😢.

$ oc exec -n openshift-cluster-version deployments/cluster-version-operator -- curl   "https://sample-policy-engine.openshift-update-service.svc.cluster.local:80/api/upgrades_info/graph?channel=stable-4.10" 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: sample-policy-engine.openshift-update-service.svc.cluster.local
command terminated with exit code 6

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 26, 2023

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2023
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 25, 2024
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Feb 25, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2024

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants