-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8s server and agent nodeattestor plugin #557
Conversation
3c385ed
to
fabc04a
Compare
doc/plugin_agent_nodeattestor_k8s.md
Outdated
to access the Kubernetes API Server and submit a Certificate Signing Request (CSR). | ||
The credentials consist of a private key and a certificate stored on disks. | ||
It also needs a root certificate to validate the TLS certificate presented | ||
ny the Kubernetes API Server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the Kubernetes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
doc/plugin_agent_nodeattestor_k8s.md
Outdated
|
||
In order to retrieve the identity document, the plugin needs user credentials | ||
to access the Kubernetes API Server and submit a Certificate Signing Request (CSR). | ||
The credentials consist of a private key and a certificate stored on disks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stored on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
func (p *K8sPlugin) setConfig(c *K8sConfig) { | ||
p.m.Lock() | ||
defer p.m.Unlock() | ||
p.c = c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should reset p.kubeClient so that it can be recreated with a potentially changed config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return nil, errors.New("k8s: not configured") | ||
} | ||
|
||
if p.kubeClient == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is racy in the face of more than one attestation happening at a time. i would suggest moving the client initialization into the Configure call. that would do two things 1) fails the configure call if bad paths are provided, and 2) allows you to provide the client as part of the configuration which is retrieved under a log (i.e. getConfig
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work @enricoschiattarella!! Left a few comments, curious to hear your thoughts on them
doc/plugin_agent_nodeattestor_k8s.md
Outdated
and is in the form: | ||
|
||
``` | ||
spiffe://<trust domain>/spire/agent/k8s/system:node:<host name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The :
character is defined as reserved in RFC 3986. SPIFFE IDs must conform with this RFC. It states that if reserved characters are used outside of their special meaning(s), then they must be percent-encoded.
I recommend simply avoiding the use of :
altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I liked the idea of carrying-over opaquely from Kubernetes RBAC encoding but I 100% agree that this is looking for trouble. If it looks ok, I will change it to "/spire/agent/k8s/system/node/hostname".
The reason to maintain the "system/node" part is that if we ever use Kubernetes to attest anything else using the certs api, there will be no conflicts.
doc/plugin_agent_nodeattestor_k8s.md
Outdated
| `k8s_private_key_path` | The path to the private key on disk (PEM encoded PKCS1 or PKCS8) used to authenticate the agent to the Kubernetes API Server| | | ||
| `k8s_certificate_path` | The path to the certificate bundle on disk. Used to authenticate the agent to the Kubernetes API Server | | | ||
| `k8s_ca_certificate_path` | The root certificate used to validate the certificate presented by the Kubernetes API Server | | | ||
| `kubeconfig_path` | Optional. The path to the kubeconfig file containing Kubernetes cluster access information | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this optional, and what do I get if I specify it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a "standard" sequence to determine the location kubeconfig, if not provided explicitly.
First try env variable KUBECONFIG, then try home directory, then return error.
It is the sequence followed by kubectl.
I will add an explanation.
func getAgentName() string { | ||
name, err := os.Hostname() | ||
if err != nil { | ||
name = "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an assumption in SPIRE that all agents can be uniquely addressed... as such, I think it would be best to raise an error here instead of set unknown since it seems that this name is ultimately used in the agent's spiffe id? Also brings up questions about hosts that might have the same hostname? Curious to hear your thoughts on this in the context of kubernetes cluster
More info here #558
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a Kubernetes environment we can rely on each node (and pod) having a unique IP address because essentially NAT is prohibited, so if the trust domain coincides with the Kubernetes cluster, we could use the IP address.
https://kubernetes.io/docs/concepts/cluster-administration/networking/
If the trust domain includes multiple Kubernetes clusters, then I think the best we can do is to include a fingerprint of the CA certificate associated to the kubernetes cluster.
Something like "spire/agent/k8s/cluster/XXXX/system:node:".
As a last resort, we can completely forget about the hostname and use a fingerprint of the cert, same as the x509pop attestation does.
I'm not sure what's the current thinking is on the question you referred.
Agent naming uniqueness is a nice property...
Human-readable SVIDs are nice too...
I will be happy to implement whatever you guys believe is the best overall solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this topic on the SIG-Kubernetes call today. We concluded that it is probably OK for the resulting ID to be less human friendly than it could be. It is likely that a SPIRE server will manage multiple kubernetes clusters, so including the thumbprint of the CA seems like a good idea. The proposed path including thumbprint and node name is 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Server side node attestors can optionally kick back selectors which can be used to group agents together. In this case, it would be awesome if we provide a selector for CA thumbprint k8s:ca_thumbprint:XXXXX
. Doing so would allow users to create SPIFFE IDs which address entire clusters, allowing them to register workloads to run in one cluster or another.
// Still no luck, try default (home) | ||
home := os.Getenv("HOME") | ||
if home != "" { | ||
kubeConfigFilePath = path.Join(home, ".kube", "config") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if we can document this search behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
return nil, fmt.Errorf("Error creating private key: %v", err) | ||
} | ||
|
||
certsIntf := kubeClient.CertificatesV1beta1().CertificateSigningRequests() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the beta1
suffix here - how stable is this API? Do we expect it to be the same a year or more from now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature that depends on this API is going GA in the next release, so in the long term the mechanics of node attestation should be stable.
The API itself will not graduate yet, though, so minor changes are possible.
This is the issue for tracking and discussion:
} | ||
|
||
certsIntf := kubeClient.CertificatesV1beta1().CertificateSigningRequests() | ||
cert, err := csr.RequestNodeCertificate(certsIntf, key, types.NodeName(getAgentName())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the CSR signing process in kubernetes do anything to assert the correct value in NodeName? In other words, could I get kubernetes to sign a CSR with the name of a different node here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it does not. It just checks that CN conforms to the expected format and key usage is what is expected. There are plans to strengthen this and support proof-of-ownership, TPMs, etc.
Options for improving over the baseline are:
- Disable automatic CSR approval, rely on Kubernetes admin to verify the content of the CSR manually before approving it
- Create a custom CSR signer that will perform extra checks/challenges
- K8s attestor in SPIRE server checks the IP address of the agent connection and compares it against what is in the Kubernetes certificate
I think all options are doable but IMO, unless this is a showstopper, the best course of action is to live with the limitation for the time being and leverage the improvements of the Kubernetes node attestation mechanism as they become available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this topic on the SIG-Kubernetes call today. The biggest drawback here is that we are essentially asserting the ability of the caller to get a CSR approved, and not asserting anything about the actual identity of the caller. In certain situations, this can result in a non-agent workload being able to obtain the identities of all other workloads in the cluster.
To mitigate this, we should try to assert that the caller is in fact an agent. We discussed using a service account token to do this, configuring the server node attestor with the namespace and service account that the agent runs as. This token can either 1) be used as credentials to submit the CSR, and verified by the server, or 2) passed directly to the server.
Taking this approach gives stronger assurance that the caller we are attesting is an agent, however it still allows for a dishonest agent (or anything that can run as the named service account) to claim that it is a different node/agent. To mitigate this, we could use the kubelet identity to submit the CSR rather than the service account. The server attestor can then verify that the kubelet identity which submitted the CSR matches the SPIFFE ID we are about to issue. IMO, this is the ideal approach, and provides the strongest level of attestation of all the options. @enricoschiattarella has mentioned that obtaining the kubelet identity may be non-trivial, but perhaps it warrants some further research.
To avoid the non-agents-could-obtain-arbitrary-identities problem, we should either add the service account token logic, or use the kubelet identity directly for CSR submission, with a preference for the latter. If we choose the service account path, it probably warrants a direct call out in the plugin documentation (perhaps in the form of a Security Considerations section?). SPIRE is known for its ability to strongly attest and issue identity, but some sacrifices are made in SA token approach (namely, that we are not actually asserting the identity of a node, rather we are asserting that the caller lives in a particular service account).
Some related notes:
- Using SA token for CSR submission instead of sending it directly to the server avoids the need to pass a SA token to a system "outside" of the kubernetes control plane
- @mattmoyer informed us that the identity used to submit the CSR is included in the signed certificate - this could be directly extracted by the server attestor for validation, as opposed to having to hit the api server to interrogate the identity of the submitter
- Node identity may be embedded in SA tokens in the future, it is unclear whether the node identity will 1) be included in resulting signed certificates, 2) be discoverable through interrogation of the API server, looking at the CSR approval stuff. If neither 1 or 2, we would end up having to pass the token directly to SPIRE anyways.
func (p *K8sPlugin) loadConfigData() (*configData, error) { | ||
config := p.getConfig() | ||
if config == nil { | ||
return nil, errors.New("k8s: not configured") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be a good idea to be more specific with these errors, like k8s node attestor: not configured
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
return newError("not configured") | ||
} | ||
|
||
if dataType := req.AttestationData.Type; dataType != pluginName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to ensure that req.AttestationData
is not nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add the check
} | ||
|
||
attestationData := new(x509pop.AttestationData) | ||
if err := json.Unmarshal(req.AttestationData.Data, attestationData); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here - what if req.AttestationData
is nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
} | ||
|
||
// receive and validate the challenge response | ||
responseReq, err := stream.Recv() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... I am curious if this can be abused by a client that never sends the challenge response... or triggered by clients which crash before being able to do so? Perhaps this Recv
should be guarded with a select block on context cancellation?
@azdagron does the streaming shim have a default timeout on the context here? Is this a problem elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Recv()
call should return with an error in the case of context cancellation. You bring up a good point though. We should probably define timeouts for all plugin related calls. I don't think there is anything specific to do in the plugin code.
ecc354e
to
8b9b152
Compare
After discussion, we came to consensus that for now using the service account token was good enough and likely to get better if more identity components end up included in the service account token (like node name, for example). I've rebased this on top of current master and pushed a commit that transitions the code to use the service account token instead. |
The `k8s_sat` plugin attests nodes running in inside of Kubernetes. The server | ||
validates the signed service account token provided by the agent. It extracts | ||
the service account name and namespace from the token claims. The server uses a | ||
one-time UUID provided by the agent for generate a SPIFFE ID with the form: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"by the agent to generate"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
The SPIFFE ID with the form: | ||
|
||
``` | ||
spiffe://<trust domain>/spire/agent/k8s_sat/<UUID> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should cluster name be included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
In addition, this plugin generates the following selectors of type `k8s_sat` for each node: | ||
|
||
| Value | Example | Description | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notion that selectors are typed is something that is not necessarily exposed to users. For instance, the registration CLI takes -selector k8s_sat:cluster-name:MyCluster
. In the interest of simplicity, I'd lean towards s/Value/Selector/ and include the k8s_sat
type prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
token, err := loadTokenFromFile(config.tokenPath) | ||
if err != nil { | ||
return satError.New("unable to get token value: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps "unable to read token from %v: %v", config.tokenPath, err
or something similar will provide more information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
func verifyTokenSignature(clusters []*clusterConfig, token *jwt.JSONWebToken) (cluster *clusterConfig, claims *k8s.SATClaims, err error) { | ||
var lastErr error | ||
for _, cluster := range clusters { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we make users configure the cluster name on the agent attestor, maybe we could avoid this bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 good call
p.config = config | ||
} | ||
|
||
func verifyTokenSignature(clusters []*clusterConfig, token *jwt.JSONWebToken) (cluster *clusterConfig, claims *k8s.SATClaims, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check that alg
is correctly set somewhere in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verifiers in square's go-jose library make sure that the algorithm in the token lines up with the verification key that is passed. we're good here.
Selectors: []*common.Selector{ | ||
makeSelector("cluster:name", cluster.name), | ||
makeSelector("service-account:namespace", claims.Namespace), | ||
makeSelector("service-account:name", claims.ServiceAccountName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance we could simplify this by dropping the suffixes?
k8s_sat:cluster:foo
k8s_sat:service-account:bar
k8s_sat:namespace:baz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
BaseSPIFFEID: k8s.AgentID(config.trustDomain, attestationData.UUID), | ||
Selectors: []*common.Selector{ | ||
makeSelector("cluster:name", cluster.name), | ||
makeSelector("service-account:namespace", claims.Namespace), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the plugin name (and hence beginning of all selectors) is k8s_sat
which makes me lean towards snake case for service account too
k8s_sat:service_account
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: Enrico Schiattarella <[email protected]>
Signed-off-by: Andrew Harding <[email protected]>
Signed-off-by: Andrew Harding <[email protected]>
dcc81fd
to
f72dd3a
Compare
Signed-off-by: Andrew Harding <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚴 🌾
Pull Request check list
Affected functionality
Description of change
Provides a new node attestation plugin for server and agent aimed at Kubernetes environments.
The agent plugin uses the Kubernetes Certificates API to submit a CSR and get it signed by Kubernetes API Server. Kubernetes API Server signs it using its CA and returns the certificate to the agent.
The agent presents the certificate to the server as an identity document.
Server verifies that agent actually possesses corresponding private key and returns a SVID if validation succeeds.
NOTE: The change in glide.yaml (moving package github.com/stretchr/testify from testImport to import section) was done to work around a glide issue. Without that change, glide would report an error " Failed to generate lock file: Generating lock produced conflicting versions of github.com/stretchr/testify. import (~1.1.3), testImport (~1.1.0)"
There are a couple of open glide issues related to this:
Masterminds/glide#597
Masterminds/glide#564
Another way to work around it would be to not use glide for k8s.io/client-go dependency but do go get ... from Makefile/build script