Skip to content

Enable bound SA tokens#718

Merged
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
marun:bound-sa-tokens
Jan 24, 2020
Merged

Enable bound SA tokens#718
openshift-merge-robot merged 4 commits intoopenshift:masterfrom
marun:bound-sa-tokens

Conversation

@marun
Copy link
Contributor

@marun marun commented Jan 16, 2020

Implements operator support for openshift/enhancements#150

TODO

  • - Add key management controller and test coverage of same
  • - Ensure keys are configured in the apiserver pods
  • - Ensure configuration of issuer and audience
  • - Unskip test coverage for the TokenRequest API
  • - Ensure e2e is passing consistently

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 16, 2020
@marun
Copy link
Contributor Author

marun commented Jan 16, 2020

/retest

serviceAccountIssuer: auth.openshift.io
apiAudiences:
- auth.openshift.io
serviceAccountSigningKeyFile: /etc/kubernetes/static-pod-certs/secrets/service-account-signing-key/service-account.key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there are need to override this for bootstrapping? Or do we get this key from the render step that early?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is not to enable bound tokens in the bootstrapping phase, since as per a discussion on the enhancement there does not appear to be a need for that. Will it be necessary to set these values in code that can detect that bootstrapping is complete rather than here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're ok not having bound tokens available from the bootstrap kubeapiserver. we asked in the enhancement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was more whether this path is a valid one during bootstrap or whether the process dies if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Will test tomorrow. Maybe the operator will have to detect that it is past the bootstrap phase and compose the config accordingly in code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two override yaml files, one for the bootstrap phase, one for after. Just set sensible values for each.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

c.queue.AddAfter(workQueueKey, readyInterval+10*time.Second)
}

certConfigMap, err := c.configMapClient.ConfigMaps(targetNamespace).Get(CertConfigMapName, metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the secret is changed, but the config map is not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its the operator secret that is changing, that is not a problem because it is not used directly by the apiserver instances. In the case of the operand secret, notice that promotion occurs only after the configmap has been successfully updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the functional separation makes my intent clear?

// Giving time for apiserver instances to pick up the change in public keys before
// changing the private key minimizes the potential for one or more apiservers to
// issue tokens signed by the new private key that apiservers without the
// corresponding public key are unable to validate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we lack back-pressure again here, right? If rolling update is blocked for some reason, we might get into trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still a problem now that actual state is considered rather than just waiting for a random amount of time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will read the new code. Looking at the deployed state should be enough.

// corresponding public key are unable to validate.
//
// TODO(marun) Find a more accurate indication that all apiservers are capable of
// validating tokens signed by the new private key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the etcd encryption code, we wait until all API servers have settled on the same revision, and that there is no new pending revision. Maybe that approach works here too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having the CM revisioned.

Are the service account tokens private and pub keys read dynamically since we're just swapping them here without redeployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to the etcd code in question? And should there be a corresponding update to the kcmo token controller?

re: dynamic key reads - afaict it's enough to just update the resource. Any changes to the resources/config that influence the state of apiserver pods prompt a redeployment of a single pod and only if that redeployment is successful will the change be rolled out to all pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this code, you're even ok to avoid demanding a stable level, you just need each revision on nodes to include the cert for your key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL.

serviceAccountPublicKeyFiles:
- /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs
- /etc/kubernetes/static-pod-certs/configmaps/bound-sa-token-signing-certs
serviceAccountIssuer: auth.openshift.io
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be mentioned as a default value in openshift/api#569

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it wasn't clear to me that a value that wasn't set by default at the API level should be documented for the API type. Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

TokenReadyAnnotation = "kube-apiserver.openshift.io/ready-to-use"
readyInterval = 5 * time.Minute

CertConfigMapName = "bound-sa-token-signing-certs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there will never be actual certs in this CM, will they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I'm being consistent with the name of the controller-manager equivalent (sa-tokens-signing-certs) which has effectively the same content.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, ok

}
needKeypair := errors.IsNotFound(err) || len(signingSecret.Data[PrivateKeyKey]) == 0 || len(signingSecret.Data[PublicKeyKey]) == 0
if needKeypair {
newSecret, err := newSigningSecret()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While ApplySecret has certain output, a human-friendly log line might be helpful here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// corresponding public key are unable to validate.
//
// TODO(marun) Find a more accurate indication that all apiservers are capable of
// validating tokens signed by the new private key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having the CM revisioned.

Are the service account tokens private and pub keys read dynamically since we're just swapping them here without redeployment?

Comment on lines +203 to +328
go wait.Until(func() {
ticker := time.NewTicker(time.Minute)
defer ticker.Stop()

for {
c.queue.Add(workQueueKey)
select {
case <-ticker.C:
case <-stopCh:
return
}
}

}, time.Minute, stopCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this correct? That inner ticker looks quite unnecessary given that wait.Until has its own internal timer which does the same, this basically runs wait.Until and freezes it on its first loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right - it should be sufficient to call func() {c.queue.Add(workQueueKey)} as the argument to wait.Util. Updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

serviceAccountPublicKeyFiles:
- /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs
- /etc/kubernetes/static-pod-certs/configmaps/bound-sa-token-signing-certs
serviceAccountIssuer: auth.openshift.io
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please pass as a flag, not as a struct value. Same with all these values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're looking for apiServerArguments above and add comments to help future me who won't remember this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
needKeypair := errors.IsNotFound(err) || len(signingSecret.Data[PrivateKeyKey]) == 0 || len(signingSecret.Data[PublicKeyKey]) == 0
if needKeypair {
klog.Infof("Creating a new signing secret for bound service account tokens.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's important enough for an info message, it's important enough for an event. If it isn't important enough for an event, then it belongs at a lower level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which would you prefer - even or lower level log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// did not have the latest public key would not be able to
// validate those new tokens.
TokenReadyAnnotation = "kube-apiserver.openshift.io/ready-to-use"
readyInterval = 5 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have the ability to inspect the revisions to figure out which ones contains the key in question. You can then inspect the levels on the nodes (the kubeapiservers.operator.openshift.io) to know if the nodes actually have the levels required. No need for time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// passed (see comment above the ready interval constant). Do not return
// immediately to ensure that the new public key can be set in the configmap
// in advance of promotion.
c.queue.AddAfter(workQueueKey, readyInterval+10*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you make your decision based on levels of the configmap actually on the nodes and trigger based on updates to kubeapiserver.operator.openshift.io, you don't need to have this delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return ret
}

func (c *BoundSATokenSignerController) sync() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please break this function into logical bits with no data based between them. I see

  1. create a secret if needed. depends on nothing outside the secrets
  2. update configmap if needed. This can retrieve the current secret. A stale lister will always get another event notification, so staleness of a cache doesn't matter and this can then be contained.
  3. promoting a key. This is based on the current secret in the operator namespace, the current secret in the operand namespace, the revisions on the nodes, the content of the configmaps on those nodes. All of which can be cached.

If any step has an error, the other steps should still be run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@marun marun changed the title WIP Enable bound SA tokens Enable bound SA tokens Jan 21, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2020
Copy link
Contributor

@stlaz stlaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking better, found some issues

return nil
}

// ensureOperandSigningSecret ensures that the current signing key is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a verb here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

break
}
}
return hasValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

hasValue := false
for _, value := range configMap.Data {
if value == desiredValue {
hasValue = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return false, err
}
found := false
for _, value := range configMap.Data {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configMapHasValue ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

err := syncMethod()
if err != nil {
utilruntime.HandleError(err)
syncFailed = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we aggregate the errors in a slice and then return NewAggregate(errs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to log the error raw to simplify troubleshooting. afaict wrapping errors in any way (including NewAggregate) changes the file+line that is logged. Is that ok, or do you require that I use NewAggregate?

@marun
Copy link
Contributor Author

marun commented Jan 21, 2020

Updated with fixup, PTAL

@marun
Copy link
Contributor Author

marun commented Jan 22, 2020

/retest

2 similar comments
@stlaz
Copy link
Contributor

stlaz commented Jan 22, 2020

/retest

@stlaz
Copy link
Contributor

stlaz commented Jan 22, 2020

/retest

return ret
}

func (c *BoundSATokenSignerController) sync() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the inverse is more standard: true = ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. All the more important given that the call site was assuming true = ok too.

// ensureOperatorSigningSecret ensures the existence of a secret in the operator
// namespace containing an RSA keypair used for signing and validating bound service
// account tokens.
func (c *BoundSATokenSignerController) ensureOperatorSigningSecret() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureNextOperatorSigningSecret

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// newSigningSecret creates a new secret populated with a new keypair.
func newSigningSecret() (*corev1.Secret, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newNextSigningSecret

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// minimize the potential for not being able to validate issued tokens.
nextKeyIndex := len(configMap.Data) + 1
nextKeyKey := ""
for len(nextKeyKey) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the len comparison even needed with the break?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} else {
// Update the operand secret only if the current public key has been synced to
// all nodes.
syncRequired, err = c.publicKeySyncedToAllNodes(currPublicKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/syncRequired/syncAllowed/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs
# The following path contains the public keys needed to verify bound sa
# tokens. This is only supported post-bootstrap.
- /etc/kubernetes/static-pod-resources/configmaps/bound-sa-token-signing-certs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just copy these into the overrides file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which overrides file? afaict the only ones available are for bootstrap. Would you prefer that I add another post-bootstrap override file rather than adding this feature-specific one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been renamed to config-overrides.yaml as requested.

"config.yaml",
specialMergeRules,
defaultConfig,
boundSATokenConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't we have the overrides yaml here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict the overrides yaml are only for bootstrap:

https://github.com/openshift/cluster-kube-apiserver-operator/tree/master/bindata/bootkube/config

I don't know why there are 2 files for bootstrap. I do know that putting these options in the either of the bootstrap overrides will break bootstrapping because the bound token keypair is only created post-bootstrap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've posted a new PR to document the requirement to put post-bootstrap overrides in the renamed file: #731

@marun
Copy link
Contributor Author

marun commented Jan 23, 2020

@sttts Updated, PTAL

marun and others added 3 commits January 23, 2020 16:03
This controller is modeled after the one that manages key material for
legacy sa tokens in cluster-kube-controller-manager-operator, but is
simplified by not having to enable bound sa tokens during bootstrap.
@marun
Copy link
Contributor Author

marun commented Jan 24, 2020

@sttts Updated, PTAL

@marun
Copy link
Contributor Author

marun commented Jan 24, 2020

/test all

@marun
Copy link
Contributor Author

marun commented Jan 24, 2020

@damemi I've added the DO NOT MERGE Bump revision limit in TestRevisionLimit commit to bump the apparently magic revision limit in the TestRevisionLimit. This PR is likely increasing the number of revisions in the process of coordinating the keypair configuration required to enable bound sa tokens. I'd appreciate your input as to whether changing the magic number is an acceptable workaround of if more involved changes are required.

totalRevisionLimit := operatorSpec.SucceededRevisionLimit + operatorSpec.FailedRevisionLimit
if operatorSpec.SucceededRevisionLimit == 0 {
totalRevisionLimit += 5
totalRevisionLimit += 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a follow-up: @damemi you wrote this original test. Why was this 5? Magic number?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now get the +=5 I believe (it's the default value?). But why does the loop below not wait for the revision pruning controller to do its work, but errors the test immediately?

The addition of bound token configuration - specifically keypair
management - appears to be adding to the number of revisions just
enough to break the test. The test needs to be updated to allow the
pruning controller time to work.
// Check total+1 to account for possibly a current new revision that just hasn't pruned off the oldest one yet.
if len(newRevisions) > int(totalRevisionLimit)+1 {
t.Errorf("more revisions (%v) than total allowed (%v): %+v", len(revisions), totalRevisionLimit, revisions)
// TODO(marun) If number of revisions has been exceeded, need to give time for the pruning controller to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @damemi

@sttts
Copy link
Contributor

sttts commented Jan 24, 2020

/lgtm
/approve

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 24, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marun, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 303d32d into openshift:master Jan 24, 2020
deads2k added a commit to deads2k/cluster-kube-apiserver-operator that referenced this pull request Jan 27, 2020
deads2k added a commit to deads2k/cluster-kube-apiserver-operator that referenced this pull request Jan 27, 2020
@marun marun deleted the bound-sa-tokens branch February 4, 2020 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants