Skip to content

OCPBUGS-9037: Change Canary to use passthrough route#978

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
rfredette:ocpbugs-9037-canary-passthrough
Jun 13, 2024
Merged

OCPBUGS-9037: Change Canary to use passthrough route#978
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
rfredette:ocpbugs-9037-canary-passthrough

Conversation

@rfredette
Copy link
Contributor

@rfredette rfredette commented Sep 19, 2023

When the default ingress controller is configured to use mTLS, connecting to edge or reencrypt routes requires the client to have a valid certificate/key pair. There is no way for a user to provide a client certificate or key to the ingress operator, so canary checks using edge encryption fail. This PR changes the canary route to be passthrough and has the canary host handle the TLS handshake, allowing it to function even if mTLS is otherwise required.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 19, 2023
@openshift-ci openshift-ci bot requested review from Miciah and alebedev87 September 19, 2023 14:33
@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from bc30df7 to 5af9bc9 Compare September 25, 2023 19:01
@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from 3aec956 to 591e372 Compare October 3, 2023 16:58
@rfredette
Copy link
Contributor Author

Looking through the logs from the latest e2e-aws-ovn run, I only see a single canary health check failure, which isn't enough to trigger the operator to change its CanaryChecksSucceeding status condition, so I think it's unlikely that that failure or the other e2e-*-ovn are related to these changes. I've rebased in case there were relevant changes my branch is missing, but mostly I think it just needs a retest.

@rfredette
Copy link
Contributor Author

On further investigation, there is a canary related failure in e2e-*-ovn, for test [sig-arch] Managed cluster [It] should expose cluster services outside the cluster [apigroup:route.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]. It expects the canary to return an HTTP code 200, which it seems to from this part of the test output:

Oct 3 18:29:54.955: INFO: stdout: "{\"test\":0,\"rc\":0,\"curl\":{\"code\":200},\"error\":\"\",\"body\":\"SGVhbHRoY2hlY2sgcmVxdWVzdGVkCg==\",\"headers\":\"HTTP/2 200 \\r\\nx-request-port: 8080\\r\\ncontent-type: text/plain; charset=utf-8\\r\\ncontent-length: 22\\r\\ndate: Tue, 03 Oct 2023 18:29:54 GMT\\r\\n\\r\\n\"}\n"

Despite getting what should be a valid response, the test fails.

@rfredette rfredette changed the title WIP: Change Canary to use passthrough route OCPBUGS-9037: Change Canary to use passthrough route Oct 6, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 6, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 6, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rfredette
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 6, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-9037, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from lihongan October 6, 2023 20:06
@rfredette
Copy link
Contributor Author

the ovn failures require the fix in openshift/origin#28301 or something similar, but e2e-hypershift's failure seems unrelated.
/test e2e-hypershift

@rfredette
Copy link
Contributor Author

The origin test fix has merged
/retest

@frobware
Copy link
Contributor

/retest

1 similar comment
@rfredette
Copy link
Contributor Author

/retest

@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from 591e372 to c350a7e Compare October 19, 2023 19:57
@frobware
Copy link
Contributor

frobware commented Oct 20, 2023

Given the change from re-encrypt to passthrough why do we need to do the things you highlight in commit c350a7e? What breaks, given the switch to passhthrough, if you don't do the additional reconciliation logic?

@frobware
Copy link
Contributor

/assign @frobware

@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from 20e8af7 to 991a3a7 Compare June 11, 2024 20:34
@rfredette
Copy link
Contributor Author

test failures look unrelated.
/retest

@frobware
Copy link
Contributor

/retest

Copy link
Contributor

@frobware frobware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested by watching for port rotations and with canaryCheckFrequency=8s and canaryCheckCycleCount=1 based on the conditional check you have to wait 16s for the port to rotate. Is that intentional or, based on my parameters, should it rotate every 8s?

@frobware
Copy link
Contributor

TestAllowedSourceRangesStatus - could be fixed by #1087.

@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from 991a3a7 to d29478a Compare June 13, 2024 14:39
@frobware
Copy link
Contributor

/lgtm

@frobware
Copy link
Contributor

My LGTM didn't seem to stick, trying again...

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2024
case intstr.Int:
if newRoute.Spec.Port.TargetPort.IntVal == portNum {
isValidPort = true
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This break doesn't do what you probably think it does—it breaks out of the switch, not out of the for loop.

// If newRoute's port is one of the canary service ports, the change may just be the ingress operator
// rotating canary ports. If the *only* change between oldRoute and newRoute is the port change, skip
// reconciling.
oldRoute.Spec.Port = newRoute.Spec.Port
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the assignment is so that the cmp.Equal effectively ignores a difference in Spec.Port, right? Using something like cmp.Equal(oldRoute.Spec, newRoute.Spec, cmpopts.IgnoreFields(routev1.RouteSpec{}, "Port")) would be a little more obvious.

@gcs278
Copy link
Contributor

gcs278 commented Jun 13, 2024

#1087 has merged.
/test e2e-azure-operator

rfredette and others added 2 commits June 13, 2024 12:34
TestClientTLS, TestMTLSWithCRLs, and TestRouterCompressionOperation
queried the canary route to test various router config options. However,
with the canary being switched to use a passthrough route, many router
config options no longer apply. This commit changes each test to deploy
their own backend and route.

TestClientTLS and TestMTLSWithCRLs were both changed to use the echo pod
backend, which echoes the contents of the request back to the client.

TestRouterCompressionOperation requires that the response have a
specific content-type header, which the echo pod doesn't provide. It now
deploys a httpd backend, which does include the content-type header, and
has been updated to enable compression on the appropriate content-type.

This is part of the fix for OCPBUGS-9037
Edge terminated TLS is subject to the ingress controller's client TLS
(mTLS) requirements. When mTLS is required, the client is also required
to provide a TLS certificate and key, but there is currently no way to
provide a client certificate or key to the ingress operator, causing
canary health checks to fail when mTLS is required.

To allow the canary healthchecks to work when mTLS is required, this
commit changes the canary to serve TLS and changes the canary route to
use passthrough encryption, and adds a test to verify that enabling mTLS
doesn't cause canary checks to fail.

The canary service has been updated to have the
service.beta.openshift.io/serving-cert-secret-name annotation, which
prompts the auto-generation of a secret with a certificate and key for
the canary to use for TLS. It also now serves on port 8443 instead of
8080. Previously, reconciliation of the canary service was limited to
making sure it existed, but now the operator checks for changes to the
service's annotations and ports, and updates them to the desired values
if necessary.

In order to make the serving certificate available to the canary, the
canary daemonset has been updated to mount the relevant secret, and set
the environment variables TLS_CERT and TLS_KEY to the full path to the
serving certificate and key, respectively. In order to make sure this
secret is available for the canary process, the reconciliation logic has
also been updated to keep the environment variables, volumes, and volume
mounts at their desired values.

The operator now watches the canary daemonset and service, and
reconciles canary resources when either changes.

TestCanaryWithMTLS enables mTLS on the default ingress controller, then
verifies that the CanaryChecksSucceeding condition remains true.

This is part of the fix for OCPBUGS-9037

Co-authored-by: Andrew McDermott <frobware@users.noreply.github.com>
@rfredette rfredette force-pushed the ocpbugs-9037-canary-passthrough branch from d29478a to 478b9bc Compare June 13, 2024 16:35
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2024
@Miciah
Copy link
Contributor

Miciah commented Jun 13, 2024

Thanks!
/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2024
@rfredette
Copy link
Contributor Author

e2e-aws-operator failed during startup:

{  openshift cluster install failed with infrastructure setup}

/test e2e-aws-operator

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD e30577e and 2 for PR HEAD 478b9bc in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 13, 2024

@rfredette: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 4d654d3 into openshift:master Jun 13, 2024
@openshift-ci-robot
Copy link
Contributor

@rfredette: Jira Issue OCPBUGS-9037: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-9037 has been moved to the MODIFIED state.

Details

In response to this:

When the default ingress controller is configured to use mTLS, connecting to edge or reencrypt routes requires the client to have a valid certificate/key pair. There is no way for a user to provide a client certificate or key to the ingress operator, so canary checks using edge encryption fail. This PR changes the canary route to be passthrough and has the canary host handle the TLS handshake, allowing it to function even if mTLS is otherwise required.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-cluster-ingress-operator-container-v4.17.0-202406132111.p0.g4d654d3.assembly.stream.el9 for distgit ose-cluster-ingress-operator.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.22.0-0.nightly-2026-01-28-225830

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants