Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spire-agent: allow automatic re-bootstrap #4624

Open
sorindumitru opened this issue Oct 31, 2023 · 4 comments · May be fixed by #5892
Open

spire-agent: allow automatic re-bootstrap #4624

sorindumitru opened this issue Oct 31, 2023 · 4 comments · May be fixed by #5892
Assignees
Labels
help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog

Comments

@sorindumitru
Copy link
Collaborator

If an agent is disconnected from spire-server for a longer period of time it may end up in the situation where the bundle it saved on disk will be out of date. This will lead to the agent being unable to connect to spire-server because the certificate the server presents can not be validated using the outdated bundle the agent has:

{"error":"create attestation client: failed to dial dns:///spire-server:14511: context deadline exceeded: connection error: desc = \"transport: authentication handshake failed: x509svid: could not verify leaf certificate: x509: certificate signed by unknown authority (possibly because of \\\"x509: ECDSA verification failure\\\" while trying to verify candidate authority certificate \\\"spireat-server\\\")\"","file":"run.go:207","func":"github.com/spiffe/spire/cmd/spire-agent/cli/run.(*Command).Run","level":"error","msg":"Agent crashed","time":"2023-10-31T06:42:52-04:00"}

Even if spire-agent has access to an up to date bundle from other sources (e.g. a config map that spire-server maintains and
was used for the initial bootstrap) it will not use as long as it has a bundle saved. The operator has to intervene to delete that saved bundle to get the agent to reconnect. I think this was a deliberate design choice, but maybe it doesn't always make sense.

In a lot of cases the initial bootstrap is automated, e.g. the config map example above, so it would likely wouldn't make much of a difference if we allowed the agent to re-bootstrap. It would definitely be more convenient from the perspective of the operator since in some environments it's difficult to get the required access to remove that saved bundle.

This could be configurable and logs and metrics would need to be emitted to make it possible for operators to be notified of the re-bootstrap.

  • Version: any
  • Platform: any
  • Subsystem: agent
@evan2645 evan2645 added the triage/in-progress Issue triage is in progress label Oct 31, 2023
@MarcosDY
Copy link
Collaborator

MarcosDY commented Nov 1, 2023

This can be caused for 2 main reasons:
1- Agent is down for some time, and didn't got SVIDs updates
2- Server is forced to create a new bundle, and agent lost communication.

In both cases we can get into an status where our stored Agent SVID is no longer valid, and we must force a new attestation.
And using current Agent SVID will result in

could not verify leaf certificate: x509: certificate signed by unknown authority

To solve 1 when loading SVID from disk we can verify if bundle is expired and if that happens use configured bundle to force an attestation.

That is not going to solve 2 because it is possible that bundle is not expired...
A possible solution is try to communicate with the server presenting our SVID and verify if it is success, but it it fails then try to attest agent to get new SVID.

@sorindumitru
Copy link
Collaborator Author

2 is what I was thinking as a fix for this too. We could maybe make use of the sequence number in the bundle to determine that, but maybe that's not the best idea for some cases(e.g. databases failures where you'd rebuild the trust domain from scratch).

@evan2645 evan2645 added help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog and removed triage/in-progress Issue triage is in progress labels Dec 7, 2023
@benlongo
Copy link

benlongo commented Feb 8, 2024

Does this also happen if the trust_bundle_url is https://...? In that case it is safe to refetch the trust bundle arbitrarily

@kfox1111
Copy link
Contributor

kfox1111 commented Jan 9, 2025

This is one of the last major hurdles I have left to deal with I think...

There are some use cases I have, where it is totally safe to rebootstrap.

Two come to mind immediately:

  1. For the helm chart, agents get their initial trust bundles from a configmap. The spire-server will reupload as needed. If an agent is ever too far out of sync, it can just rebootstrap with that trust bundle. Its up to date.
  2. When used in conjunction with x509pop server plugin support for servers trust bundle #5572, the mechanism to get up to date x509 certs also provides an up to date bootstrap trust bundle.

Do we just need a config flag allowing rebootstrapping in case of expired trust bundle, or does the logic/config need to be more complex then that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants