Skip to content

Conversation

@joelddiaz
Copy link
Contributor

@joelddiaz joelddiaz commented Dec 14, 2021

Migrate away from depending on the Azure Active Directory Graph API since it will be sunset in June 2022.

  • No more mint mode for Azure. So continue supporting passthrough and manual mode as the only supported modes on Azure.
  • The implementation is to always mark the Azure root secret as passthrough, and pivot a live running originally-mint cluster to passthrough mode.
  • Overwrite existing Secrets with the content found in kube-system/azure-credentials (we can assume this as removing the root creds was never supported in Azure).
  • Try to clean up previously minted App Registrations / Service Principals (while the Azure AD Graph API still works), but treat errors as non-fatal (just set a condition on the CredentialsRequest).
  • Update test cases.

xref: https://issues.redhat.com/browse/CCO-173

Joel Diaz added 2 commits December 9, 2021 14:29
go get github.com/Azure/azure-sdk-for-go
go mod vendor ; go mod tidy
commit
go get github.com/Azure/go-autorest/autorest
go get github.com/Azure/go-autorest/autorest/adal
go get github.com/Azure/go-autorest/autorest/azure/auth
go get github.com/Azure/go-autorest/autorest/to
go get github.com/Azure/go-autorest/autorest/azure/cli
go get github.com/Azure/go-autorest/autorest/validation

followed by go mod vendor ; go mod tidy
and committing the changes
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 14, 2021
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2021
@codecov
Copy link

codecov bot commented Dec 14, 2021

Codecov Report

Merging #433 (bf76773) into master (64bab70) will decrease coverage by 0.21%.
The diff coverage is 82.40%.

❗ Current head bf76773 differs from pull request most recent head 32b6237. Consider uploading reports for the commit 32b6237 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #433      +/-   ##
==========================================
- Coverage   46.39%   46.17%   -0.22%     
==========================================
  Files          94       91       -3     
  Lines        9688     9236     -452     
==========================================
- Hits         4495     4265     -230     
+ Misses       4612     4455     -157     
+ Partials      581      516      -65     
Impacted Files Coverage Δ
pkg/azure/clients.go 0.00% <ø> (ø)
pkg/azure/minter.go 35.71% <55.55%> (-12.85%) ⬇️
pkg/operator/secretannotator/azure/reconciler.go 35.52% <75.00%> (-1.78%) ⬇️
pkg/azure/actuator.go 56.83% <78.26%> (+11.78%) ⬆️
...redentialsrequest/credentialsrequest_controller.go 54.08% <100.00%> (+3.53%) ⬆️
pkg/azure/mock/client_generated.go 20.72% <0.00%> (-63.97%) ⬇️
... and 2 more

Comment on lines 345 to 357
func (a *Actuator) syncPassthrough(ctx context.Context, cr *minterv1.CredentialsRequest, cloudCredsSecret *corev1.Secret, logger log.FieldLogger) error {
syncErr := a.syncCredentialSecrets(ctx, cr, cloudCredsSecret, logger)

// Since we are live pivoting from Mint to Passthrough, try to clean up the old App Registration
cleanupErr := a.cleanupAfterPassthroughPivot(ctx, cr, cloudCredsSecret, logger)
if cleanupErr != nil {
logger.WithError(cleanupErr).Warn("unable to clean up previously minted App Regisration/Service Principal")
}
if syncErr == nil && cleanupErr != nil {
// A syncErr is more important to communicate about
// but do set a condition if the only issue was
// during cleanup.
return cleanupErr
}
return syncErr
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (a *Actuator) syncPassthrough(ctx context.Context, cr *minterv1.CredentialsRequest, cloudCredsSecret *corev1.Secret, logger log.FieldLogger) error {
syncErr := a.syncCredentialSecrets(ctx, cr, cloudCredsSecret, logger)
// Since we are live pivoting from Mint to Passthrough, try to clean up the old App Registration
cleanupErr := a.cleanupAfterPassthroughPivot(ctx, cr, cloudCredsSecret, logger)
if cleanupErr != nil {
logger.WithError(cleanupErr).Warn("unable to clean up previously minted App Regisration/Service Principal")
}
if syncErr == nil && cleanupErr != nil {
// A syncErr is more important to communicate about
// but do set a condition if the only issue was
// during cleanup.
return cleanupErr
}
return syncErr
}
func (a *Actuator) syncPassthrough(ctx context.Context, cr *minterv1.CredentialsRequest, cloudCredsSecret *corev1.Secret, logger log.FieldLogger) error {
syncErr := a.syncCredentialSecrets(ctx, cr, cloudCredsSecret, logger)
if syncErr != nil {
return syncErr
}
// Since we are live pivoting from Mint to Passthrough, try to clean up the old App Registration
cleanupErr := a.cleanupAfterPassthroughPivot(ctx, cr, cloudCredsSecret, logger)
if cleanupErr != nil {
logger.WithError(cleanupErr).Warn("unable to clean up previously minted App Regisration/Service Principal")
}
return cleanupErr
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually addresses the potential that we would fail to sync the secret, but still attempted a cleanup. That could lead to bad things. Good catch!

Comment on lines -210 to -232
// When a service principal is deleted, it's corresponding credentials becomes invalid.
// Pass-through credentials are not created through crafted service principal.
// When a request is deleted, there is no service principal to delete.
// Thus, corresponding secret still provides valid credentials.
// For that reason, existing secret object needs to be deleted as well to avoid
// credentials leaking.
//
// Also, there is no harm in deleting the secret in general. Every component consuming
// the secret will be forbidden to talk to Azure API once the service principal is destroyed.
existingSecret := &corev1.Secret{}
err = a.client.Get(ctx, client.ObjectKey{Namespace: cr.Spec.SecretRef.Namespace, Name: cr.Spec.SecretRef.Name}, existingSecret)
if err == nil {
logger.Infof("Deleting secret %v/%v", cr.Spec.SecretRef.Namespace, cr.Spec.SecretRef.Name)
if err := a.client.Delete(ctx, existingSecret); err != nil {
return fmt.Errorf("unable to delete secret %v/%v: %v", cr.Spec.SecretRef.Namespace, cr.Spec.SecretRef.Name, err)
}
} else if !kerrors.IsNotFound(err) {
return fmt.Errorf("unable to get secret %v/%v: %v", cr.Spec.SecretRef.Namespace, cr.Spec.SecretRef.Name, err)
}

if credentialsRootSecret.Annotations[constants.AnnotationKey] == constants.PassthroughAnnotation {
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to remove this code? I believe if we remove this, the target secret would not be removed when the credentials request is deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is already handled in the non-actuator-specific Delete() path https://github.com/openshift/cloud-credential-operator/blob/master/pkg/operator/credentialsrequest/credentialsrequest_controller.go#L436-L457
So this code was unnecessary to begin with.

Comment on lines 485 to +493
validate: func(t *testing.T, c client.Client) {
cr := getCredRequest(t, c)
s := &corev1.Secret{}
// secret should be deleted
assert.Error(t, c.Get(context.TODO(),
types.NamespacedName{Name: cr.Spec.SecretRef.Name, Namespace: cr.Spec.SecretRef.Namespace},
s),

targetSecret := getCredRequestTargetSecret(t, c, cr)

rootSecret := getRootSecret(t, c)

// The targetSecret should be a copy of the root secret.
assertSecretEquality(t, rootSecret, targetSecret)
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check for OrphanedCloudResource condition here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is set by the credentials_request controller, so the best we can do with the test at the actuator-level is check for the right error returned by the actuator. (Although I should go ahead and add a test for the condition in the credentials_request controller. I'll work on that now).

@joelddiaz
Copy link
Contributor Author

@akhil-rane i fixed up the code where you suggested and added a whole new set of test cases at the credentialsrequest_controller level to cover the new OrphanedCloudResource condition. PTAL

@joelddiaz
Copy link
Contributor Author

/test e2e-azure

@joelddiaz
Copy link
Contributor Author

/test verify

1 similar comment
@joelddiaz
Copy link
Contributor Author

/test verify

@joelddiaz
Copy link
Contributor Author

/test e2e-azure-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2021

@joelddiaz: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test coverage
  • /test e2e-aws
  • /test e2e-upgrade
  • /test images
  • /test unit
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test e2e-aws-manual-oidc
  • /test e2e-azure
  • /test e2e-gcp
  • /test e2e-gcp-manual-oidc
  • /test e2e-openstack
  • /test e2e-openstack-parallel

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-cloud-credential-operator-master-coverage
  • pull-ci-openshift-cloud-credential-operator-master-e2e-aws
  • pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade
  • pull-ci-openshift-cloud-credential-operator-master-images
  • pull-ci-openshift-cloud-credential-operator-master-unit
  • pull-ci-openshift-cloud-credential-operator-master-verify
  • pull-ci-openshift-cloud-credential-operator-master-verify-deps
Details

In response to this:

/test e2e-azure-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@joelddiaz
Copy link
Contributor Author

/test e2e-azure-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2021

@joelddiaz: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test coverage
  • /test e2e-aws
  • /test e2e-upgrade
  • /test images
  • /test unit
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test e2e-aws-manual-oidc
  • /test e2e-azure
  • /test e2e-gcp
  • /test e2e-gcp-manual-oidc
  • /test e2e-openstack
  • /test e2e-openstack-parallel

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-cloud-credential-operator-master-coverage
  • pull-ci-openshift-cloud-credential-operator-master-e2e-aws
  • pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade
  • pull-ci-openshift-cloud-credential-operator-master-images
  • pull-ci-openshift-cloud-credential-operator-master-unit
  • pull-ci-openshift-cloud-credential-operator-master-verify
  • pull-ci-openshift-cloud-credential-operator-master-verify-deps
Details

In response to this:

/test e2e-azure-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@joelddiaz
Copy link
Contributor Author

/test verify

@akhil-rane
Copy link
Contributor

this looks good

@joelddiaz
Copy link
Contributor Author

@akhil-rane more changes to pull in an updated build-machinery-go package (that was failing the 'make verify' step).
Let me know when this looks okay, and I'll squash to make the PR look more reasonable.

@joelddiaz
Copy link
Contributor Author

/test e2e-azure-upgrade

@akhil-rane
Copy link
Contributor

yeah we can squash here, all good

Joel Diaz added 5 commits December 16, 2021 14:57
failing on installing yaml-patch binary otherwise
Stop trying to detect whether the creds in kube-system/azure-credentials
are good enough for Minting new credentials. We now will only support
Manual mode (where the annotator does nothing) and Passthrough mode
where we will blindly annotate the Secret as 'passthrough'.
Update the Azure actuator to only support passthrough mode. Attempt to
clean up previously created App Registrations / Service Principals, but
treat failures to clean up as non-critical.

In the event that we fail to clean up, set a new "OrphanedCloudResource"
condition to document that we were unable to clean up.

When successfully cleaning up, clear out the old AzureStatus fields.

Add test cases covering the new OrphanedCloudResources condition.
@joelddiaz
Copy link
Contributor Author

/hold waiting for #426

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 16, 2021
@lwan-wanglin
Copy link

@joelddiaz Here are the test cases for the card https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitems?query=trello%3ACCO%5C-173&sorting=id, PTAL! thanks. It's a little hard to reproduce the OrphanedCloudResource condition.

And here I have some questions:

  1. If we need to remove azure mint support from validation part on installer code ?https://github.com/openshift/installer/blob/02c23e64691b499d9cae9d534892e0424469b1b2/pkg/types/validation/installconfig.go#L610
  2. If we need to doc things somewhere like "make sure .spec.credentialsMode of cloudcredential CR is not "Mint" before upgrade your cluster to 4.10" , otherwise the upgrade will fail, cco logs will show the below error
time="2021-12-17T07:31:13Z" level=error msg="error checking whether credentials already exists: invalid mode" controller=credreq cr=openshift-cloud-credential-operator/openshift-cloud-network-config-controller secret=openshift-cloud-network-config-controller/cloud-credentials
time="2021-12-17T07:31:17Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/azure-disk-csi-driver-operator

@joelddiaz
Copy link
Contributor Author

/test e2e-azure

@joelddiaz
Copy link
Contributor Author

@joelddiaz Here are the test cases for the card https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitems?query=trello%3ACCO%5C-173&sorting=id, PTAL! thanks. It's a little hard to reproduce the OrphanedCloudResource condition.

Yes, this could be tricky. Perhaps you install an old version of OpenShift with Mint mode, disable CCO (either by marking it unmanaged in CVO and scaling down CVO, or move CCO to Manual mode), then swap the credentials in kube-system/azure-credentials with ones that can't do Mint mode, then upgrade to the new OpenShift/CCO, re-enable CCO, and it should migrate to Passthrough but fail to clean up the previously minted App Registrations.

And here I have some questions:

1. If we need to remove azure mint support from validation part on installer code ?https://github.com/openshift/installer/blob/02c23e64691b499d9cae9d534892e0424469b1b2/pkg/types/validation/installconfig.go#L610

Yes, good point. I'll add a Jira under the installer team.

2. If we need to doc things somewhere like "make sure .spec.credentialsMode of cloudcredential CR is not "Mint" before upgrade your cluster to 4.10" , otherwise the upgrade will fail, cco logs will show the below error

time="2021-12-17T07:31:13Z" level=error msg="error checking whether credentials already exists: invalid mode" controller=credreq cr=openshift-cloud-credential-operator/openshift-cloud-network-config-controller secret=openshift-cloud-network-config-controller/cloud-credentials
time="2021-12-17T07:31:17Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/azure-disk-csi-driver-operator

Yes, the docs will need some careful updates. cc @jeana-redhat

@joelddiaz
Copy link
Contributor Author

@joelddiaz Here are the test cases for the card https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitems?query=trello%3ACCO%5C-173&sorting=id, PTAL! thanks. It's a little hard to reproduce the OrphanedCloudResource condition.

Oh, and the test cases look good!

@joelddiaz
Copy link
Contributor Author

/test e2e-azure

@sferich888
Copy link

/label px-approved

@openshift-ci openshift-ci bot added the px-approved Signifies that Product Support has signed off on this PR label Dec 17, 2021
@joelddiaz
Copy link
Contributor Author

@lwan-wanglin @jeana-redhat anything else needed before getting the qe/docs approve labels?
@akhil-rane anything else needed before an lgtm?

@joelddiaz
Copy link
Contributor Author

/test e2e-azure

@akhil-rane
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 17, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 17, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akhil-rane, joelddiaz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [akhil-rane,joelddiaz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@joelddiaz
Copy link
Contributor Author

/unhold
#246 has merged

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 17, 2021
@lwan-wanglin
Copy link

lwan-wanglin commented Dec 20, 2021

@joelddiaz Thanks for your given steps , I can reproduce deleting SP failure case, and I filed a new test case to cover it . I missed one check(verify SPs are deleted from Azure cloud) on previous test cases, now I update this step to the test cases, but I meet a situation, I find cco just remove role assignment, but not deleting applications, is it expected? if not , please let me know, I will file a bug to track.

For this PR, I think we can add qe-approved label.
/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Dec 20, 2021
@joelddiaz
Copy link
Contributor Author

I find cco just remove role assignment, but not deleting applications, is it expected? if not , please let me know, I will file a bug to track.

If you are seeing that the App Registration are not being deleted, that is not expected. Do open a BZ.

@jeana-redhat
Copy link

Will def be able to doc this, just need to know when each piece is landing. I know y'all will keep me updated on that :)
/label docs-approved

@openshift-ci openshift-ci bot added the docs-approved Signifies that Docs has signed off on this PR label Dec 20, 2021
@joelddiaz joelddiaz mentioned this pull request Dec 20, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 20, 2021

@joelddiaz: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 9906c6a into openshift:master Dec 20, 2021
@akhil-rane
Copy link
Contributor

/cherry-pick release-4.9

@openshift-cherrypick-robot

@akhil-rane: #433 failed to apply on top of branch "release-4.9":

Applying: update azure-sdk-for-go
.git/rebase-apply/patch:381: new blank line at EOF.
+
.git/rebase-apply/patch:1651: new blank line at EOF.
+
.git/rebase-apply/patch:4266: new blank line at EOF.
+
warning: 3 lines add whitespace errors.
Using index info to reconstruct a base tree...
M	go.mod
M	go.sum
M	vendor/modules.txt
Falling back to patching base and 3-way merge...
Auto-merging vendor/modules.txt
CONFLICT (content): Merge conflict in vendor/modules.txt
Removing vendor/github.com/Azure/azure-sdk-for-go/NOTICE
Removing vendor/github.com/Azure/azure-sdk-for-go/LICENSE
Auto-merging go.sum
CONFLICT (content): Merge conflict in go.sum
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 update azure-sdk-for-go
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. docs-approved Signifies that Docs has signed off on this PR lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants