Enable bound SA tokens#718
Enable bound SA tokens#718openshift-merge-robot merged 4 commits intoopenshift:masterfrom marun:bound-sa-tokens
Conversation
|
/retest |
| serviceAccountIssuer: auth.openshift.io | ||
| apiAudiences: | ||
| - auth.openshift.io | ||
| serviceAccountSigningKeyFile: /etc/kubernetes/static-pod-certs/secrets/service-account-signing-key/service-account.key |
There was a problem hiding this comment.
isn't there are need to override this for bootstrapping? Or do we get this key from the render step that early?
There was a problem hiding this comment.
The intent is not to enable bound tokens in the bootstrapping phase, since as per a discussion on the enhancement there does not appear to be a need for that. Will it be necessary to set these values in code that can detect that bootstrapping is complete rather than here?
There was a problem hiding this comment.
I think we're ok not having bound tokens available from the bootstrap kubeapiserver. we asked in the enhancement.
There was a problem hiding this comment.
My question was more whether this path is a valid one during bootstrap or whether the process dies if not.
There was a problem hiding this comment.
Ah, ok. Will test tomorrow. Maybe the operator will have to detect that it is past the bootstrap phase and compose the config accordingly in code?
There was a problem hiding this comment.
There are two override yaml files, one for the bootstrap phase, one for after. Just set sensible values for each.
| c.queue.AddAfter(workQueueKey, readyInterval+10*time.Second) | ||
| } | ||
|
|
||
| certConfigMap, err := c.configMapClient.ConfigMaps(targetNamespace).Get(CertConfigMapName, metav1.GetOptions{}) |
There was a problem hiding this comment.
what happens if the secret is changed, but the config map is not?
There was a problem hiding this comment.
If its the operator secret that is changing, that is not a problem because it is not used directly by the apiserver instances. In the case of the operand secret, notice that promotion occurs only after the configmap has been successfully updated.
There was a problem hiding this comment.
Hopefully the functional separation makes my intent clear?
| // Giving time for apiserver instances to pick up the change in public keys before | ||
| // changing the private key minimizes the potential for one or more apiservers to | ||
| // issue tokens signed by the new private key that apiservers without the | ||
| // corresponding public key are unable to validate. |
There was a problem hiding this comment.
so we lack back-pressure again here, right? If rolling update is blocked for some reason, we might get into trouble.
There was a problem hiding this comment.
Is this still a problem now that actual state is considered rather than just waiting for a random amount of time?
There was a problem hiding this comment.
Will read the new code. Looking at the deployed state should be enough.
| // corresponding public key are unable to validate. | ||
| // | ||
| // TODO(marun) Find a more accurate indication that all apiservers are capable of | ||
| // validating tokens signed by the new private key. |
There was a problem hiding this comment.
In the etcd encryption code, we wait until all API servers have settled on the same revision, and that there is no new pending revision. Maybe that approach works here too?
There was a problem hiding this comment.
+1 for having the CM revisioned.
Are the service account tokens private and pub keys read dynamically since we're just swapping them here without redeployment?
There was a problem hiding this comment.
Can you point me to the etcd code in question? And should there be a corresponding update to the kcmo token controller?
re: dynamic key reads - afaict it's enough to just update the resource. Any changes to the resources/config that influence the state of apiserver pods prompt a redeployment of a single pod and only if that redeployment is successful will the change be rolled out to all pods.
There was a problem hiding this comment.
For this code, you're even ok to avoid demanding a stable level, you just need each revision on nodes to include the cert for your key.
| serviceAccountPublicKeyFiles: | ||
| - /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs | ||
| - /etc/kubernetes/static-pod-certs/configmaps/bound-sa-token-signing-certs | ||
| serviceAccountIssuer: auth.openshift.io |
There was a problem hiding this comment.
this should be mentioned as a default value in openshift/api#569
There was a problem hiding this comment.
Ok, it wasn't clear to me that a value that wasn't set by default at the API level should be documented for the API type. Done.
| TokenReadyAnnotation = "kube-apiserver.openshift.io/ready-to-use" | ||
| readyInterval = 5 * time.Minute | ||
|
|
||
| CertConfigMapName = "bound-sa-token-signing-certs" |
There was a problem hiding this comment.
nit: there will never be actual certs in this CM, will they?
There was a problem hiding this comment.
No, but I'm being consistent with the name of the controller-manager equivalent (sa-tokens-signing-certs) which has effectively the same content.
| } | ||
| needKeypair := errors.IsNotFound(err) || len(signingSecret.Data[PrivateKeyKey]) == 0 || len(signingSecret.Data[PublicKeyKey]) == 0 | ||
| if needKeypair { | ||
| newSecret, err := newSigningSecret() |
There was a problem hiding this comment.
While ApplySecret has certain output, a human-friendly log line might be helpful here
| // corresponding public key are unable to validate. | ||
| // | ||
| // TODO(marun) Find a more accurate indication that all apiservers are capable of | ||
| // validating tokens signed by the new private key. |
There was a problem hiding this comment.
+1 for having the CM revisioned.
Are the service account tokens private and pub keys read dynamically since we're just swapping them here without redeployment?
| go wait.Until(func() { | ||
| ticker := time.NewTicker(time.Minute) | ||
| defer ticker.Stop() | ||
|
|
||
| for { | ||
| c.queue.Add(workQueueKey) | ||
| select { | ||
| case <-ticker.C: | ||
| case <-stopCh: | ||
| return | ||
| } | ||
| } | ||
|
|
||
| }, time.Minute, stopCh) |
There was a problem hiding this comment.
It this correct? That inner ticker looks quite unnecessary given that wait.Until has its own internal timer which does the same, this basically runs wait.Until and freezes it on its first loop?
There was a problem hiding this comment.
I think you're right - it should be sufficient to call func() {c.queue.Add(workQueueKey)} as the argument to wait.Util. Updated.
| serviceAccountPublicKeyFiles: | ||
| - /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs | ||
| - /etc/kubernetes/static-pod-certs/configmaps/bound-sa-token-signing-certs | ||
| serviceAccountIssuer: auth.openshift.io |
There was a problem hiding this comment.
please pass as a flag, not as a struct value. Same with all these values.
There was a problem hiding this comment.
You're looking for apiServerArguments above and add comments to help future me who won't remember this as well.
| } | ||
| needKeypair := errors.IsNotFound(err) || len(signingSecret.Data[PrivateKeyKey]) == 0 || len(signingSecret.Data[PublicKeyKey]) == 0 | ||
| if needKeypair { | ||
| klog.Infof("Creating a new signing secret for bound service account tokens.") |
There was a problem hiding this comment.
if it's important enough for an info message, it's important enough for an event. If it isn't important enough for an event, then it belongs at a lower level.
There was a problem hiding this comment.
Which would you prefer - even or lower level log?
| // did not have the latest public key would not be able to | ||
| // validate those new tokens. | ||
| TokenReadyAnnotation = "kube-apiserver.openshift.io/ready-to-use" | ||
| readyInterval = 5 * time.Minute |
There was a problem hiding this comment.
You have the ability to inspect the revisions to figure out which ones contains the key in question. You can then inspect the levels on the nodes (the kubeapiservers.operator.openshift.io) to know if the nodes actually have the levels required. No need for time.
| // passed (see comment above the ready interval constant). Do not return | ||
| // immediately to ensure that the new public key can be set in the configmap | ||
| // in advance of promotion. | ||
| c.queue.AddAfter(workQueueKey, readyInterval+10*time.Second) |
There was a problem hiding this comment.
if you make your decision based on levels of the configmap actually on the nodes and trigger based on updates to kubeapiserver.operator.openshift.io, you don't need to have this delay.
| return ret | ||
| } | ||
|
|
||
| func (c *BoundSATokenSignerController) sync() error { |
There was a problem hiding this comment.
please break this function into logical bits with no data based between them. I see
- create a secret if needed. depends on nothing outside the secrets
- update configmap if needed. This can retrieve the current secret. A stale lister will always get another event notification, so staleness of a cache doesn't matter and this can then be contained.
- promoting a key. This is based on the current secret in the operator namespace, the current secret in the operand namespace, the revisions on the nodes, the content of the configmaps on those nodes. All of which can be cached.
If any step has an error, the other steps should still be run.
stlaz
left a comment
There was a problem hiding this comment.
looking better, found some issues
| return nil | ||
| } | ||
|
|
||
| // ensureOperandSigningSecret ensures that the current signing key is |
| break | ||
| } | ||
| } | ||
| return hasValue |
| hasValue := false | ||
| for _, value := range configMap.Data { | ||
| if value == desiredValue { | ||
| hasValue = true |
| return false, err | ||
| } | ||
| found := false | ||
| for _, value := range configMap.Data { |
| err := syncMethod() | ||
| if err != nil { | ||
| utilruntime.HandleError(err) | ||
| syncFailed = true |
There was a problem hiding this comment.
usually we aggregate the errors in a slice and then return NewAggregate(errs).
There was a problem hiding this comment.
I'd prefer to log the error raw to simplify troubleshooting. afaict wrapping errors in any way (including NewAggregate) changes the file+line that is logged. Is that ok, or do you require that I use NewAggregate?
|
Updated with fixup, PTAL |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
| return ret | ||
| } | ||
|
|
||
| func (c *BoundSATokenSignerController) sync() bool { |
There was a problem hiding this comment.
the inverse is more standard: true = ok
There was a problem hiding this comment.
Done. All the more important given that the call site was assuming true = ok too.
| // ensureOperatorSigningSecret ensures the existence of a secret in the operator | ||
| // namespace containing an RSA keypair used for signing and validating bound service | ||
| // account tokens. | ||
| func (c *BoundSATokenSignerController) ensureOperatorSigningSecret() error { |
There was a problem hiding this comment.
ensureNextOperatorSigningSecret
| } | ||
|
|
||
| // newSigningSecret creates a new secret populated with a new keypair. | ||
| func newSigningSecret() (*corev1.Secret, error) { |
| // minimize the potential for not being able to validate issued tokens. | ||
| nextKeyIndex := len(configMap.Data) + 1 | ||
| nextKeyKey := "" | ||
| for len(nextKeyKey) == 0 { |
There was a problem hiding this comment.
is the len comparison even needed with the break?
| } else { | ||
| // Update the operand secret only if the current public key has been synced to | ||
| // all nodes. | ||
| syncRequired, err = c.publicKeySyncedToAllNodes(currPublicKey) |
| - /etc/kubernetes/static-pod-resources/configmaps/sa-token-signing-certs | ||
| # The following path contains the public keys needed to verify bound sa | ||
| # tokens. This is only supported post-bootstrap. | ||
| - /etc/kubernetes/static-pod-resources/configmaps/bound-sa-token-signing-certs |
There was a problem hiding this comment.
just copy these into the overrides file
There was a problem hiding this comment.
Which overrides file? afaict the only ones available are for bootstrap. Would you prefer that I add another post-bootstrap override file rather than adding this feature-specific one?
There was a problem hiding this comment.
There was a problem hiding this comment.
This file has been renamed to config-overrides.yaml as requested.
| "config.yaml", | ||
| specialMergeRules, | ||
| defaultConfig, | ||
| boundSATokenConfig, |
There was a problem hiding this comment.
didn't we have the overrides yaml here as well?
There was a problem hiding this comment.
afaict the overrides yaml are only for bootstrap:
I don't know why there are 2 files for bootstrap. I do know that putting these options in the either of the bootstrap overrides will break bootstrapping because the bound token keypair is only created post-bootstrap.
There was a problem hiding this comment.
There was a problem hiding this comment.
I've posted a new PR to document the requirement to put post-bootstrap overrides in the renamed file: #731
|
@sttts Updated, PTAL |
This controller is modeled after the one that manages key material for legacy sa tokens in cluster-kube-controller-manager-operator, but is simplified by not having to enable bound sa tokens during bootstrap.
|
@sttts Updated, PTAL |
|
/test all |
|
@damemi I've added the |
test/e2e/operator_test.go
Outdated
| totalRevisionLimit := operatorSpec.SucceededRevisionLimit + operatorSpec.FailedRevisionLimit | ||
| if operatorSpec.SucceededRevisionLimit == 0 { | ||
| totalRevisionLimit += 5 | ||
| totalRevisionLimit += 10 |
There was a problem hiding this comment.
For a follow-up: @damemi you wrote this original test. Why was this 5? Magic number?
There was a problem hiding this comment.
I now get the +=5 I believe (it's the default value?). But why does the loop below not wait for the revision pruning controller to do its work, but errors the test immediately?
The addition of bound token configuration - specifically keypair management - appears to be adding to the number of revisions just enough to break the test. The test needs to be updated to allow the pruning controller time to work.
| // Check total+1 to account for possibly a current new revision that just hasn't pruned off the oldest one yet. | ||
| if len(newRevisions) > int(totalRevisionLimit)+1 { | ||
| t.Errorf("more revisions (%v) than total allowed (%v): %+v", len(revisions), totalRevisionLimit, revisions) | ||
| // TODO(marun) If number of revisions has been exceeded, need to give time for the pruning controller to |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: marun, sttts The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
Implements operator support for openshift/enhancements#150
TODO
TokenRequestAPI