Skip to content

OCPBUGS-9037, OCPBUGS-64565: Use cluster wildcard certificate for ingress canary#1155

Closed
rfredette wants to merge 1 commit intoopenshift:masterfrom
rfredette:ocpbugs-9037-use-cluster-wildcard
Closed

OCPBUGS-9037, OCPBUGS-64565: Use cluster wildcard certificate for ingress canary#1155
rfredette wants to merge 1 commit intoopenshift:masterfrom
rfredette:ocpbugs-9037-use-cluster-wildcard

Conversation

@rfredette
Copy link
Contributor

Utilize the existing ingress controller certificate management controller to also manage the certificate for the ingress canary, and use that certificate when serving the canary endpoint.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 15, 2024
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is invalid:

  • expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.17.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Utilize the existing ingress controller certificate management controller to also manage the certificate for the ingress canary, and use that certificate when serving the canary endpoint.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from Miciah and knobunc October 15, 2024 15:34
@rfredette rfredette force-pushed the ocpbugs-9037-use-cluster-wildcard branch from 4f7f29d to a7792fa Compare October 15, 2024 16:38
@rfredette
Copy link
Contributor Author

test failures appear unrelated.
/test e2e-gcp-operator
/test e2e-hypershift

@Miciah Miciah added priority/backlog Higher priority than priority/awaiting-more-evidence. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Nov 13, 2024
@candita
Copy link
Contributor

candita commented Nov 20, 2024

/assign @Miciah
/assign

@candita
Copy link
Contributor

candita commented Nov 20, 2024

/retest-required

Copy link
Contributor

@candita candita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rfredette I've had these comments pending for some time. Please let me know if I've misunderstood the assignment here, but in some places it looks like we are just reusing the default cert. Was that the plan?

UID: daemonset.UID,
Controller: &trueVar,
}
if _, err := r.ensureDefaultCertificateForIngress(ca, "openshift-ingress-canary", canaryRef, ingress); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't there need to be a different function to ensure a canary cert rather than ensure a default cert? Does this ensure the correct cert?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is that the canary application should use the default IngressController's default certificate. This way, as long as the default IngressController has a correctly configured default certificate, so too will the canary application. Because #978 changed the canary application to use a TLS passthrough route, the only way to have the canary application use the default IngressController's default certificate is to copy that certificate to the canary application's namespace and configure the application to use that copy of the certificate.

If I understand correctly, ensureDefaultCertificateForIngress actually generates a new server certificate, using the existing CA certificate, so this logic doesn't quite implement the intent. We could use r.ensureDefaultCertificateForIngress(ca, ingress.Namespace, ref, ingress) (note the namespace) to get the existing certificate (or create it if it's missing), but it seems simpler just to do a Get from ingress.Namespace and then Create in "openshift-ingress-canary".

@candita
Copy link
Contributor

candita commented Jan 24, 2025

/restest-required

@candita
Copy link
Contributor

candita commented Feb 5, 2025

Failure in infra:
--- FAIL: TestNodePool/HostedCluster2/Main/TestAdditionalTrustBundlePropagation (2271.02s)

/test e2e-hypershift

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2025
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 14, 2025
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Oct 27, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 27, 2025

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9037. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Utilize the existing ingress controller certificate management controller to also manage the certificate for the ingress canary, and use that certificate when serving the canary endpoint.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rfredette
Copy link
Contributor Author

/reopen

@openshift-ci openshift-ci bot reopened this Nov 4, 2025
@rfredette rfredette changed the title OCPBUGS-9037: Use cluster wildcard certificate for ingress canary OCPBUGS-9037, OCPBUGS-64565: Use cluster wildcard certificate for ingress canary Nov 13, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. and removed jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. labels Nov 13, 2025
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is invalid:

  • expected the bug to target either version "4.21." or "openshift-4.21.", but it targets "4.17.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-64565, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Utilize the existing ingress controller certificate management controller to also manage the certificate for the ingress canary, and use that certificate when serving the canary endpoint.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Miciah
Copy link
Contributor

Miciah commented Nov 19, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 19, 2025
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-9037, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

This pull request references Jira Issue OCPBUGS-64565, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from lihongan November 19, 2025 00:30
@candita
Copy link
Contributor

candita commented Nov 19, 2025

/retest e2e-aws-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 19, 2025

@candita: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test e2e-aws-operator
/test e2e-aws-ovn
/test e2e-aws-ovn-hypershift-conformance
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-aws-ovn-upgrade
/test e2e-azure-operator
/test e2e-gcp-operator
/test e2e-hypershift
/test hypershift-e2e-aks
/test images
/test okd-scos-images
/test unit
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test e2e-aws-gatewayapi-conformance
/test e2e-aws-operator-techpreview
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-techpreview
/test e2e-aws-pre-release-ossm
/test e2e-azure-manual-oidc
/test e2e-azure-ovn
/test e2e-gcp-ovn
/test e2e-ibmcloud-operator
/test e2e-openstack-operator
/test okd-scos-e2e-aws-ovn

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-hypershift-conformance
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-serial-1of2
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-serial-2of2
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-pre-release-ossm
pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator
pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator
pull-ci-openshift-cluster-ingress-operator-master-e2e-hypershift
pull-ci-openshift-cluster-ingress-operator-master-hypershift-e2e-aks
pull-ci-openshift-cluster-ingress-operator-master-images
pull-ci-openshift-cluster-ingress-operator-master-okd-scos-images
pull-ci-openshift-cluster-ingress-operator-master-unit
pull-ci-openshift-cluster-ingress-operator-master-verify
pull-ci-openshift-cluster-ingress-operator-master-verify-deps
Details

In response to this:

/retest e2e-aws-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@candita
Copy link
Contributor

candita commented Nov 19, 2025

/retest

@lihongan
Copy link
Contributor

The bug OCPBUGS-9037 mentioned console health check as well, seems this fix is just for ingress canary, I'm wondering how to fix the console route ?

@rfredette rfredette force-pushed the ocpbugs-9037-use-cluster-wildcard branch from 768f6b8 to 595b17c Compare December 1, 2025 18:53
return true
}

func (r *reconciler) canarySecretName(Namespace string) (types.NamespacedName, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's a little weird that namespace is capitalized here.

}

volumes := daemonset.Spec.Template.Spec.Volumes
secretMode := int32(0420)
Copy link
Contributor

@candita candita Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why the defaultMode has to be set for the test but not in the function desiredCanaryDaemonSet?

Copy link
Contributor Author

@rfredette rfredette Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm setting the default mode on the expected result here. Default mode is set in the daemonset manifest here for the desired daemonset

Namespace: canaryCertName.Namespace,
OwnerReferences: []metav1.OwnerReference{canaryRef},
}
if err := r.client.Create(ctx, canaryCert); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be way off, but I'm not sure why you can't just store the name of the default cert instead of creating a copy of it? I defer to @Miciah on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to copy the secret's data so that the canary application uses the exact certificate that is used as the default IngressController's default certificate. That is, we specifically need canaryCert.Data to match defaultCert.Data (and it makes sense to set canaryCert.Type to defaultCert.Type as well). The name doesn't matter, other than that the copy needs to be in the same namespace as the canary daemonset and needs to match whatever name the canary daemonset specifies in its volume.

@rfredette rfredette force-pushed the ocpbugs-9037-use-cluster-wildcard branch from 595b17c to 3ecac04 Compare December 4, 2025 20:28
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 4, 2025

@rfredette: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-serial a7792fa link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-pre-release-ossm 3ecac04 link false /test e2e-aws-pre-release-ossm
ci/prow/e2e-aws-operator 3ecac04 link true /test e2e-aws-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@Miciah Miciah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message is missing a body. Please copy and paste the PR description into the commit message and add a link to https://issues.redhat.com/browse/OCPBUGS-9037 (feel free to do more, but I think that that would be good enough).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at the top of this file should be updated to document that the controller copies the certificate for the canary... though the more I think about it, the more I think we should have a separate controller for this purpose.

Comment on lines +110 to +117
// The ingress canary verifies that the default ingress controller is functioning. Since it uses a
// passthrough route, we mirror the default ingress controller's certificate in the canary to detect any
// issues with that certificate, so if the default controller's cert is updated, update the canary's cert as
// well.
if ingress.Name == manifests.DefaultIngressControllerName {
log.Info("Ensuring canary certificate")
daemonset := &appsv1.DaemonSet{}
err = r.client.Get(ctx, controller.CanaryDaemonSetName(), daemonset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a watch on secrets so that we update the canary's copy when the cluster-admin updates the original secret? Maybe re-using this controller isn't the best approach. I suppose we can refactor later on as a follow-up.

Comment on lines +120 to +121
// All ingresses should have a deployment, so this one may not have been
// created yet. Retry after a reasonable amount of time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment looks like copypasta.

canaryCertName := controller.RouterEffectiveDefaultCertificateSecretName(ingress, daemonset.Namespace)
canaryRef := metav1.OwnerReference{
APIVersion: "apps/v1",
Kind: "Daemonset",
Copy link
Contributor

@Miciah Miciah Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kind is miscapitalized here:

Suggested change
Kind: "Daemonset",
Kind: "DaemonSet",

Does using "Daemonset" actually work? I don't know how forgiving the API server [or garbage collector] is.

if err := r.client.Get(ctx, defaultCertName, defaultCert); err != nil {
errs = append(errs, fmt.Errorf("failed to get certificate for canary: %w", err))
}
canaryCert := defaultCert.DeepCopy()
Copy link
Contributor

@Miciah Miciah Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really only care about defaultCert.Data and defaultCert.SecretType, right?

I suppose using DeepCopy will work as you stomp ObjectMeta and the API server doesn't send StringData; using DeepCopy just does some extra (unnecessary) work.

errs = append(errs, fmt.Errorf("failed to get certificate for canary: %w", err))
}
canaryCert := defaultCert.DeepCopy()
canaryCertName := controller.RouterEffectiveDefaultCertificateSecretName(ingress, daemonset.Namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work, but is there a reason to copy the name from the effective certificate? If you used a static name, you could get rid of canarySecretName in pkg/operator/controller/canary/daemonset.go and simplify the logic a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One consequence of copying the name is that you can end up with multiple secrets in the canary namespace if IngressController.spec.defaultCertificate is updated.

For example, the TestUpdateDefaultIngressControllerSecret test updates the default certificate from the default "router-certs-default" secret to a "test-xyz" secret (where "xyz" is a randomly generated suffix) and then reverts the default certificate back to "router-certs-default". As a consequence the CI artifacts have both a "router-certs-default" secret and a "test-xyz" secret in the canary namespace.

If the canary daemonset were ever deleted, the owner reference would cause these secrets to be cleaned up, but otherwise you can accumulate these secrets. This isn't a major problem, but it does create some unnecessary cruft.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... I wonder whether the issue referenced at #1155 (comment) will prevent garbage collection from working properly?

@Miciah
Copy link
Contributor

Miciah commented Dec 5, 2025

Looking at the ingress-operator logs in the e2e-aws-operator artifacts, I see some repeated errors:

2025-12-04T22:34:39.140Z	ERROR	operator.init	controller/controller.go:300	Reconciler error	{"controller": "certificate_controller", "object": {"name":"default","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "default", "reconcileID": "e0744554-1c56-458f-bf75-071d2d90c374", "error": "failed to ensure certificate for canary: secrets \"router-certs-default\" already exists", "errorCauses": [{"error": "failed to ensure certificate for canary: secrets \"router-certs-default\" already exists"}]}

@rfredette
Copy link
Contributor Author

Closing in favor of #1334

@rfredette rfredette closed this Jan 6, 2026
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9037. The bug has been updated to no longer refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-64565. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Utilize the existing ingress controller certificate management controller to also manage the certificate for the ingress canary, and use that certificate when serving the canary endpoint.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants