-
Notifications
You must be signed in to change notification settings - Fork 136
OCPBUGS-26498: Upgrade Validation plugin for SHA1 certs #555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-26498: Upgrade Validation plugin for SHA1 certs #555
Conversation
|
@gcs278: This pull request references Jira Issue OCPBUGS-26498, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
c6a9ad2 to
5eb0acb
Compare
b88197d to
a378d74
Compare
|
@gcs278: This pull request references Jira Issue OCPBUGS-26498, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test verify |
|
/test e2e-upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! I made some comments, but the one thing I want to clarify is whether UpgradeRouteValidation should be checking tlsConfig.CACertificate or tlsConfig.DestinationCACertificate in addition to tlsConfig.Certificate (see #555 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the changes in this PR causing scanners to report issues with these files, or is this change logically separate from the SHA1 validation change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the PR commit Regenerate certs for unit tests is causes the scanner to report repo leaks due to certs changing in the test files.
Regenerate certs for unit tests is required because we need to differentiate between SHA1 and SHA256 certs in our testing for the Upgrade Validation plugin.
| name: "Edge termination cert key mismatch", | ||
| route: &routev1.Route{ | ||
| Spec: routev1.RouteSpec{ | ||
| Host: "www.example.com", | ||
| TLS: &routev1.TLSConfig{ | ||
| Termination: routev1.TLSTerminationEdge, | ||
| Certificate: testExpiredCAUnknownCertificate, | ||
| Certificate: testCAUnknownCertificate, | ||
| Key: testPrivateKey, | ||
| CACertificate: testCACertificate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case seems to have both a mismatch between server certificate and key and a mismatch between server certificate and CA certificate. The test case fails with the expected error (namely the mismatch between server certificate and key), so it passes as is. However, there are a few oddities here:
- The test-case name only mentions the one mismatch.
- The test case expects only one error—perhaps for the same reason that "Edge termination with expired cert, ignoring CA mismatch" passes, in which case it might be helpful to add a comment saying as much.
- I suspect the test case would pass when it shouldn't if the validation erroneously checked for a match between the server certificate and CA certificate and not between the server certificate and key.
- There is already a test, "Edge termination mismatched key and cert", which seems to cover the case of a mismatch between server certificate and key:
router/pkg/router/routeapihelpers/validation_test.go
Lines 1170 to 1183 in 20865c3
{ name: "Edge termination cert with unknown CA", route: &routev1.Route{ Spec: routev1.RouteSpec{ Host: "www.example.com", TLS: &routev1.TLSConfig{ Termination: routev1.TLSTerminationEdge, Certificate: testCAUnknownCertificate, Key: testCAUnknownCertificateKey, }, }, }, expectedErrors: 0, },
Is this a duplicate test case, or should it be named something like "Edge termination mismatched key and cert with unrelated CA"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test-case name only mentions the one mismatch.
Updated the name to "Edge termination mismatched key and cert with unrelated CA".
The test case expects only one error—perhaps for the same reason that "Edge termination with expired cert, ignoring CA mismatch" passes, in which case it might be helpful to add a comment saying as much.
Sorry, this is where I get confused. I thought you were taking about previously named "Edge termination cert key mismatch", which returns 2 expectedErrors.
Is this a duplicate test case, or should it be named something like "Edge termination mismatched key and cert with unrelated CA"?
I don't think it's a duplicate. It returns a different number of errors, so I've taken your suggested name change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case expects only one error—perhaps for the same reason that "Edge termination with expired cert, ignoring CA mismatch" passes, in which case it might be helpful to add a comment saying as much.
Sorry, this is where I get confused. I thought you were taking about previously named
"Edge termination cert key mismatch", which returns 2 expectedErrors.
Sorry, I must have been mixing up test cases or misreading the diff.
SHA1 certs will be unsupported in the future. Clean up unit tests to organize clearly between SHA1 and SHA256 certs. The unit tests had SHA1 certs that weren't tested separately. Also, the unit tests claimed to be using an expired certificate, but it was NOT expired. Document all certs by adding openssl comands for the regenerated certs to help future developers recreate certs if needed.
Bumping openshift/api requires bumping golang to 1.21 otherwise the ci/prow/verify job will fail on go tidy.
Bump openshift/api to get new route UnservableInFutureVersions status condition type and update K8S packages to 0.29.1.
When an route ingress already exists, but an expected condition isn't found, recordIngressCondition would previously not change/update anything. This wasn't an issue for when only had Admitted because the route ingress was updated synchronously with the Admitted condition. However, adding another route status condition exposes the scenario in which a route ingress will exist, but the desired condition won't. This also adds a unit test for recordIngressCondition which explicitly tests this case along with other edge cases.
Previously, recordIngressCondition sometimes clears the original return value conditions depending on whether the condition was found and changed. This doesn't seem consistent and makes it hard to unit test. The original return value only affects logging. Logging the original conditions is advantageous because modifying conditions can also lead to contention, and having visibility into the modified conditions is beneficial. Therefore, it seems reasonable to remove condition clearing as it benefits logging and it's inconsistent. Update unit testing to now verify original return value from recordIngressCondition.
Add the Upgrade Validation plugin which provides a framework for adding a status to a route to indicate future upgrade issues. For this initial implementation, if a route has a SHA1 certificate, the upgrade validation plugin will mark the route with a UnservableInFutureVersions status. If the route no longer has a SHA1 certificate, remove the status instead of setting UnservableInFutureVersions=false. This is specific to OpenShift 4.15 and is expected to be removed from 4.16 once backported to avoid confusion. It's specific to 4.15 because 4.15 accepts SHA1 certs, but 4.16 does not. Therefore, validating the upgrade should be only done in 4.15. Update the RejectionRecorder interface to be the RouteStatusRecorder to support more generic status-related use cases as we have more than just Admitted status now. Add UPGRADE_VALIDATION env variable, but default to true, so that we can disable in the future. Follow existing pattern similar to existing EXTENDED_VALIDATION env variable. Key the writerlease logic off of route UID AND condition type. This is now essential because if only route UID, then the lease logic will only update one condition as it doesn't differentiate from the work items for the different conditions. Adjust findMostRecentIngress to use the latest condition instead of just Admitted as we can no longer assume that Admitted is the only condition. Add a type for writelease WorkKey to ensure strings aren't used anymore as the previous logic was only using route.UID. Add a type for ContentionTracker key to ensure it is different than writelease WorkKey. Update ingressEqual function to detect contention if any condition is not equal. Previously, it hardcoded the "Admitted" status. Add a unit test, TestRouterContentionUpdateCondition to test these changes.
d5ea141 to
c22b003
Compare
gcs278
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Miciah Thanks for the detailed review. Changes are here: https://github.com/openshift/router/compare/d5ea141e7ff30459aeff301e46f09bd7715c839c..c22b003ff4b0140aaa0fecefdd699a1b74717415
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the PR commit Regenerate certs for unit tests is causes the scanner to report repo leaks due to certs changing in the test files.
Regenerate certs for unit tests is required because we need to differentiate between SHA1 and SHA256 certs in our testing for the Upgrade Validation plugin.
| if diff := cmp.Diff(baselineIngressState, original, cmpOpts...); len(diff) > 0 { | ||
| t.Errorf("expected original to match baseline ingress from route.Status (-want +got):\n%s", diff) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I believe removing the original.Conditions = nil in recordIngressCondition is a net positive .
- First, it had no impact other than logging.
originalwas only used for logging - Besides being confusing to modify a variable called
original, removing the conditions removes half the benefit of logging here.- I found myself trying to debug a contention issue, and the fact it doesn't log the conditions makes that much harder.
- Adding the new
UnservableInFutureVersionscondition and condition removal as contention points makes this even more desirable (more potential contention will be between conditions)
- It only clears conditions on the
originalreturn value when the ingress has changed...that feels confusing and inconsistent.
pkg/router/controller/status.go
Outdated
| // preserve a deep copy of the original ingress | ||
| original := existing.DeepCopy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a pessimization. Doing a deep copy has significant time and memory impact.
Yes, it's slower, I agree.
Is the intention only to make writing the unit test easier, or is it the intention also to improve logging or something else?
Well neither, I wrote it as a "future proofing" or "avoiding fragility" fix.
Reviewing how shallow copies work in Go again, I see now that we technically don't need the deepcopy, as the RouteIngress object is made up of values, not pointers, except for []conditions, which I now understand each item in a slice is a pointer, but the slice header is a value (which includes slice length).
Since the logic below only ever appends to the shallow copy slice of existing.Conditions (different slice header), that means that original will be unaffected by the current logic below.
I'll switch back to a shallow copy for now since it's provides the same result.
|
|
||
| return handleRouteStatusUpdate(context.TODO(), action, oc, route, latest, tracker) | ||
| }) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I like that. I had the idea to do a callback, but got stuck following through with it, but this was I wanted.
Maybe better left as a follow-up as it would be nontrivial to check for correctness.
Could you elaborate? It appear "correct" for our two use-cases (performIngressConditionUpdate, performIngressConditionRemoval), but are you saying we would need to explore "correctness" for other (future) use cases?
I ask because I prefer this logic, and don't exactly see how any more "non-trivial" it is than what I currently have.
| name: "Edge termination cert key mismatch", | ||
| route: &routev1.Route{ | ||
| Spec: routev1.RouteSpec{ | ||
| Host: "www.example.com", | ||
| TLS: &routev1.TLSConfig{ | ||
| Termination: routev1.TLSTerminationEdge, | ||
| Certificate: testExpiredCAUnknownCertificate, | ||
| Certificate: testCAUnknownCertificate, | ||
| Key: testPrivateKey, | ||
| CACertificate: testCACertificate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test-case name only mentions the one mismatch.
Updated the name to "Edge termination mismatched key and cert with unrelated CA".
The test case expects only one error—perhaps for the same reason that "Edge termination with expired cert, ignoring CA mismatch" passes, in which case it might be helpful to add a comment saying as much.
Sorry, this is where I get confused. I thought you were taking about previously named "Edge termination cert key mismatch", which returns 2 expectedErrors.
Is this a duplicate test case, or should it be named something like "Edge termination mismatched key and cert with unrelated CA"?
I don't think it's a duplicate. It returns a different number of errors, so I've taken your suggested name change.
| name: "Edge termination expired cert unknown CA", | ||
| name: "Edge termination cert with unknown CA", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry this is painful to review. Thanks for writing this out.
- "Edge termination expired cert unknown CA" is deleted and replaced by these two tests:
- "Edge termination expired cert with unknown CA", which uses an actually expired certificate so it does what the name says.
- "Edge termination cert with unknown CA", which does what "Edge termination expired cert unknown CA" actually did (despite the latter's misleading name).
Yes, that seems right.
"Edge termination expired cert key mismatch" is renamed to "Edge termination cert key mismatch" to reflect what it actually tests. We already have "Expired cert mismatched key", which actually covers what the name implies (though unlike "Edge termination expired cert key mismatch", "Expired cert mismatched key" doesn't specify CACertificate).
Yes, this one just had an incorrect name.
- "Edge termination expired cert mismatched CA" is deleted and replaced by these two tests:
- "Edge termination with expired cert, ignoring CA mismatch", which uses an actually expired certificate so it does what the name says. (Using an actually expired certificate also means that expectedErrors changes from 0 to 1.)
- "Edge termination cert mismatched CA", which does what "Edge termination expired cert mismatched CA" actually did (despite the misleading name).
Correct, the testExpiredCAUnknownCertificate it was using was NOT expired. Super confusing.
"Expired cert" is renamed to "expired cert edge" and fixed to use an actually expired certificate; what it actually tested is already tested by "Edge termination OK with certs".
Correct.
The rest of the changes are changes to individual tests, the test runner logic, or net-new tests.
I think so.
|
Thanks! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Thanks @Miciah! For the sake of time, I'm going to save #555 (comment) as a follow up. I do plan a follow up change in a hopeful fix for https://issues.redhat.com/browse/OCPBUGS-1689 as this PR has laid the groundwork from removing conditions. I'll wait for to hear from @frobware to remove the hold. |
|
Unrelated pod IP failure: |
|
@gcs278: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/hold cancel |
|
@gcs278: Jira Issue OCPBUGS-26498: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-26498 has not been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[ART PR BUILD NOTIFIER] This PR has been included in build ose-haproxy-router-base-container-v4.16.0-202404220845.p0.g7ad7ec1.assembly.stream.el9 for distgit ose-haproxy-router-base. |
|
@gcs278: Jira Issue OCPBUGS-26498: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-26498 has not been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick 4.15 |
|
@gcs278: cannot checkout DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-4.15 |
This PR adds support for setting a new status condition type,
UnservableInFutureVersions, when a SHA1 cert is detected on a route. It accomplishes this by adding a new plugin, the upgrade validation plugin. This plugin is simple: when a route configuration that is unsupported in the future is detected, it adds aUnservableInFutureVersionsstatus condition to the route. When the problematic configuration is no longer detected, it clears/removes theUnservableInFutureVersionscondition.In order to support updating a new route status condition, the
routeRejectionRecorderand associated methods were refactored to be more generic. Thewriterleaselogic used to update route status had to be refactored because the old code made assumptions that only one type of status condition would ever be updated.It also updates certificates for existing unit tests, as some certificates were SHA1 and incorrectly labeled as expired.
This PR is intended to be introduced in 4.16, but to be backported to 4.15 as
UnservableInFutureVersionsonly makes sense in 4.15 because 4.16 will reject SHA1 routes. In PR #552, there will be a partial revert, as it will no longer addUnservableInFutureVersionsto SHA1 routes due to them now being rejection.Dependent on openshift/api#1722
Related Ingress Operator PR openshift/cluster-ingress-operator#1014