Skip to content

Conversation

@sanchezl
Copy link
Contributor

@sanchezl sanchezl commented May 13, 2024

On Hypershift, the image pull secrets were created, but not properly initialized due to the image_pull_secret_controller getting stuck waiting for the existence of the bound-service-account-signing-key secret in the openshift-kube-apiserver namespace.

Ideally, Hypershift would prefer to mount the service account signing key as as a volume on OCM pod. This would match what we already do this for the KCM pod. This PR provides an OCM-only fix to get us going until we make the changes to the way hypershift runs OCM.

Since what we need is the just the hash of the service account signing public key, if OCM does not find the bound-service-account-signing-key secret on startup, OCM will create a throwaway API token and extract the service account signing public key's hash from the token.

Currently, on hypershift, the service account signing public key is provided as a CLI option. OCM would need to be restarted if the service account signing public key changes,

@openshift-ci openshift-ci bot requested review from bparees and soltysh May 13, 2024 18:36
@sanchezl sanchezl changed the title revert revert OCPBUGS-33600: revert revert May 13, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 13, 2024
@openshift-ci-robot
Copy link
Contributor

@sanchezl: This pull request references Jira Issue OCPBUGS-33600, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

  • internal registry mage pull secret controllers
  • add image pull secret to mountable secrets
  • fallback to alternate method of obtaining token signing key hash

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sanchezl
Copy link
Contributor Author

/payload-job periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2024

@sanchezl: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8ad79e00-117f-11ef-8080-22dfebaceee0-0

@sanchezl
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 13, 2024
@openshift-ci-robot
Copy link
Contributor

@sanchezl: This pull request references Jira Issue OCPBUGS-33600, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sanchezl
Copy link
Contributor Author

/retest-required

1 similar comment
@sanchezl
Copy link
Contributor Author

/retest-required

@sanchezl sanchezl changed the title OCPBUGS-33600: revert revert OCPBUGS-33600: revert revert and fix for hypershift May 14, 2024
@openshift-ci-robot
Copy link
Contributor

@sanchezl: This pull request references Jira Issue OCPBUGS-33600, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2024
@soltysh
Copy link
Contributor

soltysh commented May 14, 2024

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label May 14, 2024
Comment on lines 115 to 122
if err != nil && !errors.IsNotFound(err) {
// signing key secret exists, skip
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to make any assumptions one way or the other about the secret's existence based on this information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't reach here unless the cache has synced. If the secret should exist, but doesn't at this point, I don't see the harm in performing the fallback here, and the regular control loop will eventually pick up the secret and override if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are 3 cases:

  1. The secret exists (err == nil) and the fallback is performed. The regular control loop should eventually catch up.
  2. The secret does not exist (errors.IsNotFound(err)) and the fallback is performed. Even if the secret is later created, the regular control loop takes over.
  3. Some different error occurred, and the fallback is not performed. This seems like a problem on a cluster that will never have the signing key secret.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it now. Fixed.

}

// create a throwaway API token
expirationSeconds := int64(10 * time.Minute / time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this arbitrary? How short can the requested expiration time be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the minimum allowed.

Comment on lines 153 to 155
if err := c.fallbackKeyIDObservation(ctx); err != nil {
runtime.HandleError(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see in the description "OCM would need to be restarted if the service account signing public key changes," but what if this one-shot attempt fails with a transient error? It sounds like it would not recover on its own.

Does it make sense to do the fallback as part of the controller sync if and only if the signing key secret does not exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller sync is triggering on the one secret, a secret that wouldn't exisit to trigger the sync in hypershift, so I don't think I can add the fall back to the controller sync.

I can add a few retries on err.

@sanchezl sanchezl force-pushed the revert-revert branch 3 times, most recently from 6f6ef67 to 1b30c4e Compare May 14, 2024 17:34
Comment on lines 116 to 126
if err != nil && !errors.IsNotFound(err) {
// signing key secret exists, skip and let sync handle
return true, nil
}
if err != nil {
runtime.HandleError(err)
return false, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this ever need to return true (and stop)?

Suggested change
if err != nil && !errors.IsNotFound(err) {
// signing key secret exists, skip and let sync handle
return true, nil
}
if err != nil {
runtime.HandleError(err)
return false, nil
}
if err == nil {
// signing key secret exists, skip and let sync handle
return false, nil
} else if !errors.IsNotFound(err) {
// error other than notfound, signing key secret may or may not exist
runtime.HandleError(err)
return false, nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the secret DNE, return true and stop the inner polling loop. runWorker() itself is running in an UntilContext loop.
If there is an error, the inner loop retries every 10 seconds for a minute.

@benluddy
Copy link
Contributor

/lgtm
/hold

@sanchezl Do you want to re-run the hypershift payload job?

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 14, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, sanchezl, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 14, 2024

@sanchezl: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 5bcef36 link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@benluddy
Copy link
Contributor

/payload-job periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 14, 2024

@benluddy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/058004d0-1234-11ef-9c79-b53ffec4c3e7-0

@sanchezl
Copy link
Contributor Author

/payload 4.16 ci blocking

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 14, 2024

@sanchezl: trigger 5 job(s) of type blocking for the ci release of OCP 4.16

  • periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-e2e-aws-sdn-serial
  • periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/39b901c0-1234-11ef-905a-e5452c9fb821-0

@sanchezl
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 15, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 7637f1a into openshift:master May 15, 2024
@openshift-ci-robot
Copy link
Contributor

@sanchezl: Jira Issue OCPBUGS-33600: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-33600 has been moved to the MODIFIED state.

Details

In response to this:

On Hypershift, the image pull secrets were created, but not properly initialized due to the image_pull_secret_controller getting stuck waiting for the existence of the bound-service-account-signing-key secret in the openshift-kube-apiserver namespace.

Ideally, Hypershift would prefer to mount the service account signing key as as a volume on OCM pod. This would match what we already do this for the KCM pod. This PR provides an OCM-only fix to get us going until we make the changes to the way hypershift runs OCM.

Since what we need is the just the hash of the service account signing public key, if OCM does not find the bound-service-account-signing-key secret on startup, OCM will create a throwaway API token and extract the service account signing public key's hash from the token.

Currently, on hypershift, the service account signing public key is provided as a CLI option. OCM would need to be restarted if the service account signing public key changes,

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-openshift-controller-manager-container-v4.17.0-202405151441.p0.g7637f1a.assembly.stream.el9 for distgit ose-openshift-controller-manager.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-05-16-092402

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants