Certificate renewal bot#10099
Conversation
This adds a new `tbot` tool to continuously renew a set of certificates after registering with a Teleport cluster using a similar process to standard node joining. This makes some modifications to user certificate generation to allow for certificates that can be renewed beyond their original TTL, and exposes new gRPC endpoints: * `CreateBotJoinToken` creates a join token for a bot user * `GenerateInitialRenewableUserCerts` exchanges a token for a set of certificates with a new `renewable` flag set A new `tctl` command, `tctl bots add`, creates a bot user and calls `CreateBotJoinToken` to issue a token. A bot instance can then be started using a provided command.
* Use role requests to split renewable certs from end-user certs * Add bot configuration file * Use `teleport.dev/bot` label * Remove `impersonator` flag on initial bot certs * Remove unnecessary `renew` package * Misc other cleanup
This adds additional restrictions on when a certificate's `renewable` flag is carried over to a new certificate. In particular, it now also denies the flag when either role requests are present, or the `disallowReissue` flag has been previously set. In practice `disallow-reissue` would have prevented any undesired behavior but this improves consistency and resolves a TODO.
* Fully flesh out config template rendering * Fix rendering for SSH configuration templates * Added `String()` impls for destination types * Improve certificate renewal logging; show more detail * Properly fall back to default (all) roles * Add mode hints for files * Add/update copyright headers
* Add `CreateBot`, `DeleteBot`, and `GetBotUsers` gRPC endpoints * Replace `tctl bot (add|rm|ls)` implementations with gRPC calls * Define a few new constants, `DefaultBotJoinTTL`, `BotLabel`, `BotGenerationLabel`
* Fixed a few nil pointer derefs when using config from CLI args * Properly create destination if `--destination-dir` flag is used * Remove improper default on CLI flag * `DestinationConfig` is now a list of pointers
|
I'm trying to avoid adding more onto this PR directly since it's already gigantic, so I'll keep a list of addon PRs here:
|
nklaassen
left a comment
There was a problem hiding this comment.
did a first pass on this
Fixes the majority of smaller issues caught by reviewers, thanks all!
Issuing initial renewable certificate ended up requiring a lot of hacks to skip checks that prevented anonymous bots from getting certs even though we'd verified their identity elsewhere (via token). This reverts all those hacks and splits initial bot cert logic into a dedicated `generateInitialRenewableUserCerts()` function which should make the whole process much easier to follow.
Users should instead use the CreateBot/DeleteBot endpoints.
|
|
||
| // createBotRole creates a role from a bot template with the given parameters. | ||
| func createBotRole(ctx context.Context, s *Server, botName string, resourceName string, roleRequests []string) error { | ||
| return s.UpsertRole(ctx, &types.RoleV4{ |
There was a problem hiding this comment.
Can you use NewRole which will call CheckAndSetDefaults
There was a problem hiding this comment.
Fixed, looks like Role is missing SetMetadata() so setting our labels is a bit ugly, oh well.
There was a problem hiding this comment.
you could implement SetMetadata(), it's probably just not there because there was no caller
There was a problem hiding this comment.
I've gone ahead and implemented SetMetadata()
Also, add notes about the supported YAML shapes.
|
I think this is about as ready as it's going to get for review, though it's a bit of a monster so apologies in advance to the reviewers. A rough review/merge plan that hopefully makes sense:
|
|
|
||
| .PHONY: $(BUILDDIR)/tbot | ||
| $(BUILDDIR)/tbot: | ||
| GOOS=$(OS) GOARCH=$(ARCH) $(CGOFLAG) go build -tags "$(PAM_TAG) $(FIPS_TAG)" -o $(BUILDDIR)/tbot $(BUILDFLAGS) ./tool/tbot |
There was a problem hiding this comment.
Do you get different behavior if you omit the PAM/FIPS tags? If not, can we do without them?
There was a problem hiding this comment.
I was about to say "we can just drop them" but I suppose it's possible the FIPS tag is honored in some of the teleport libraries we use. PAM is almost certainly unnecessary but I included it since tsh does for some reason as well.
There was a problem hiding this comment.
I've removed the PAM tag. I'm wary of touching the FIPS tag since I'm not sure where it's read (no +build seems to check for it). My gut feeling is that if tsh cares, we should too.
| // DeleteBot deletes a bot user. | ||
| rpc DeleteBot(DeleteBotRequest) returns (google.protobuf.Empty); | ||
| // GetBotUsers gets all users with bot labels. | ||
| rpc GetBotUsers(GetBotUsersRequest) returns (stream types.UserV2); |
There was a problem hiding this comment.
Curious, why does this leverage streaming?
There was a problem hiding this comment.
GetUsers() does and I just cloned its declaration
|
|
||
| func (s *Server) deleteBot(ctx context.Context, botName string) error { | ||
| // TODO: | ||
| // remove any locks for the bot's impersonator role? |
There was a problem hiding this comment.
Do we have a separate issue where we're tracking locking of bots?
There was a problem hiding this comment.
I've mostly resolved bot locking in #10098, I think this is the only remaining open question. I'm inclined to not delete any locks automatically since that would be new (and maybe unexpected) behavior - I don't think we delete locks when e.g. removing a user or role today.
If you agree I can just remove the TODO.
There was a problem hiding this comment.
I've replaced the TODO here with a note explaining why locks aren't currently deleted. Let me know if you'd prefer to see a behavior change instead!
| // roles listed in the request and doesn't attempt to verify that the | ||
| // current user has permissions for those embedded roles. We assume that |
There was a problem hiding this comment.
Couldn't this result in a privilege escalation? So you can create a bot that would issues certs with roles you don't have access to?
There was a problem hiding this comment.
Yes, it can. From what I can, tell, though, VerbCreate on Role is effectively root-level in Teleport today for exactly this reason, so at least I don't think we're introducing a new path to privilege escalation.
We do check for VerbCreate on both User and Role here so I think we have our bases covered but definitely let me know if I've missed something, this is a pretty scary bit of code IMO.
There was a problem hiding this comment.
@timothyb89 I'm curious if you have considered making a bot a "resource" like any other object in the cluster. Then it could just be managed by usual tctl create/get/rm commands, we could have a types.KindBot which would grant users explicit permissions to manage them, and would be easy to reason about "bot" as a "resource" in general. Probably a bit too late to ask design questions at this point but I'm curious if this has ever been considered.
There was a problem hiding this comment.
That was part of Zac's original design, actually, and I ended up removing it.
Ultimately we needed to reuse so much of Teleport's traditional auth infrastructure (necessitating a backing User and Role) that the entire Bot resource existed solely to be created and removed, alongside the other resources it still needed. Removing the resource was a surprisingly large negative diff and didn't end up breaking any functionality because it just wasn't used.
I wouldn't be surprised to see it come back eventually if we eventually add more of bot-specific metadata. This whole "stuffing bot data into User labels" strategy is probably not great in the long term .
Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
* Add renewable certificate generation checks This adds a new validation check for renewable certificates that maintains a renewal counter as both a certificate extension and a user label. This counter is used to ensure only a single certificate lineage can exist: for example, if a renewable certificate is stolen, only one copy of the certificate can be renewed as the generation counter will not match When renewing a certificate, first the generation counter presented by the user (via their TLS identity) is compared to a value stored with the associated user (in a new `teleport.dev/bot-generation` label field). If they aren't equal, the renewal attempt fails. Otherwise, the generation counter is incremented by 1, stored to the database using a `CompareAndSwap()` to ensure atomicity, and set on the generated certificate for use in future renewals. * Add unit tests for the generation counter This adds new unit tests to exercise the generation counter checks. Additionally, it fixes two other renewable cert tests that were failing. * Remove certRequestGeneration() function * Emit audit event when cert generations don't match * Fully implement `tctl bots lock` * Show bot name in `tctl bots ls` * Lock bots when a cert generation mismatch is found * Make CompareFailed respones from validateGenerationLabel() more actionable * Update lib/services/local/users.go Co-authored-by: Nic Klaassen <nic@goteleport.com> * Backend changes for tbot IoT and AWS joining (#10360) * backend changes * add token permission check * pass ctx from caller Co-authored-by: Roman Tkachenko <roman@goteleport.com> * fix comment typo Co-authored-by: Roman Tkachenko <roman@goteleport.com> * use UserMetadata instead of Identity in RenewableCertificateGenerationMismatch event * Client changes for tbot IoT joining (#10397) * client changes * delete replaced APIs * delete unused tbot/auth.go * add license header * don't unecessarily fetch host CA * log fixes * s/tunnelling/tunneling/ Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com> * auth server addresses may be proxies Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com> * comment typo fix Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com> * move *Server methods out of auth_with_roles.go (#10416) Co-authored-by: Tim Buckley <tim@goteleport.com> Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com> Co-authored-by: Tim Buckley <tim@goteleport.com> Co-authored-by: Roman Tkachenko <roman@goteleport.com> Co-authored-by: Tim Buckley <tim@goteleport.com> Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com> Co-authored-by: Nic Klaassen <nic@goteleport.com> Co-authored-by: Roman Tkachenko <roman@goteleport.com> Co-authored-by: Zac Bergquist <zmb3@users.noreply.github.com>
Add `Role.SetMetadata()`, simplify more `trace.WrapWithMessage()` calls, clear some TODOs and lints, and address other misc feedback items.
| } | ||
|
|
||
| // createBot creates a new certificate renewal bot from a bot request. | ||
| func (s *Server) createBot(ctx context.Context, req *proto.CreateBotRequest) (*proto.CreateBotResponse, error) { |
There was a problem hiding this comment.
Should this function also validate that all roles specified in the CreateBotRequest are valid and exist in the system? Ignore if it's already validated somewhere and I just missed it.
There was a problem hiding this comment.
It doesn't, I'll add a quick check
| // If this fails it's likely to be some miscellaneous competing | ||
| // write. The request should be tried again - if it's malicious, | ||
| // someone will get a generation mismatch and trigger a lock. | ||
| return trace.WrapWithMessage(err, "Database comparison failed, try the request again") |
There was a problem hiding this comment.
Would trace.CompareFailed be a more appropriate error to return?
| if err != nil { | ||
| return trace.Wrap(err) | ||
| } | ||
| if err := s.UpsertLock(context.Background(), lock); err != nil { |
There was a problem hiding this comment.
| if err := s.UpsertLock(context.Background(), lock); err != nil { | |
| if err := s.UpsertLock(ctx, lock); err != nil { |
| // If this fails it's likely to be some miscellaneous competing | ||
| // write. The request should be tried again - if it's malicious, | ||
| // someone will get a generation mismatch and trigger a lock. | ||
| return trace.WrapWithMessage(err, "Database comparison failed, try the request again") |
| default: | ||
| // delete bot join tokens so they can't be re-used | ||
| if err := a.DeleteToken(ctx, provisionToken.GetName()); err != nil { | ||
| log.WithError(err).Warnf("could not delete bot provision token %q after generating certs", |
There was a problem hiding this comment.
Nit:
| log.WithError(err).Warnf("could not delete bot provision token %q after generating certs", | |
| log.WithError(err).Warnf("Could not delete bot provision token %q after generating certs.", |
| // bots use this endpoint but get a user cert | ||
| // botResourceName must be set, enforced in CheckAndSetDefaults | ||
| botResourceName := provisionToken.GetBotName() | ||
| expires := a.GetClock().Now().Add(defaults.DefaultRenewableCertTTL) |
There was a problem hiding this comment.
Why does this generate the cert with default cert TTL instead of the one the bot is configured with?
There was a problem hiding this comment.
It's somewhat arbitrary. We don't pass along a desired TTL for the initial certs via the token request here which complicates things. That said, in practice bots immediately attempt to renew certs after joining, and do request a user-configurable TTL then. So for however long these initial certs last, in practice they are disposed of within seconds.
That being said, the DefaultRenewableCertTTL is way too long, I'll set it to 1 hour.
There was a problem hiding this comment.
Yeah even if it's disposed of quickly, there's always a chance that a potential attacker can snatch it so shorting the default TTL is probably a good idea.
| if err := services.ValidateUser(new); err != nil { | ||
| return trace.Wrap(err) | ||
| } | ||
| newValue, err := services.MarshalUser(new.WithoutSecrets().(types.User)) |
There was a problem hiding this comment.
Do a safe type assertion to avoid panics (just in case).
| ID: new.GetResourceID(), | ||
| } | ||
|
|
||
| existingValue, err := services.MarshalUser(existing.WithoutSecrets().(types.User)) |
There was a problem hiding this comment.
Do a safe type assertion to avoid panics (just in case).
Ensure all requestable roles exist when creating a bot, adjust the default renewable cert TTL down to 1 hour, and check types during `CompareAndSwapUser()`
This adds a new
tbottool to continuously renew a set of certificates after registering with a Teleport cluster using a similar process to standard node joining.This makes some modifications to user certificate generation to allow for certificates that can be renewed beyond their original TTL, and exposes new gRPC endpoints:
CreateBotcreates a bot user, role, and a join tokenDeleteBotremoves an existing bot user and its resourcesGetBotUsersfetches all Users with bot fields setGenerateInitialRenewableUserCertsexchanges a token for a set of certificates with a newrenewableflag setA new
tctlcommand,tctl bots add, creates a bot user and callsCreateBotJoinTokento issue a token. A bot instance can then be started using a provided command.Refer to the RFD for more details: #7986
Security Considerations
CreateBotandDeleteBotgRPC calls create roles which allow role impersonation (i.e. they allow arbitrary privilege escalation). The endpoints do check that the calling user has relevant RBAC permissions to create/delete Roles and Users. We assume that the "create Role" RBAC privilege is effectively root access in Teleport today.Outstanding TODOs
Thorough unit testingMostly done in Add renewable certificate generation checks #10098tbot initto configure file ACLs (client side; separate WIP at Implementtbot initsubcommand and ACL management #10289)Validate / resolve outstanding TODOs related to bot locking (don't expect much exciting content here - bots are just users after all)Done in Add renewable certificate generation checks #10098