Skip to content

Machine ID: Introduce BotService#35441

Merged
strideynet merged 23 commits intomasterfrom
strideynet/machineid-service
Dec 21, 2023
Merged

Machine ID: Introduce BotService#35441
strideynet merged 23 commits intomasterfrom
strideynet/machineid-service

Conversation

@strideynet
Copy link
Copy Markdown
Contributor

@strideynet strideynet commented Dec 6, 2023

Closes #33808
Closes #10477

Quite a few deprecations in this one. Deprecated RPCs and perms now emit a warning log message which should encourage most folks to migrate prior to v16.0.0.

Fairly large diff - in part due to about 1800 lines of tests that expand the coverage of bot related code significantly. Some of these tests are for the legacy/deprecated endpoints that weren't covered previously but I wanted to make sure they behaved properly in their deprecated state.

changelog: SEE PR

Machine ID Bots can now be managed as yaml with the typical tctl commands, e.g:

  • tctl get bot foo
  • tctl get bots
  • tctl create -f bot.yaml
kind: bot
version: v1
metadata:
  name: test
spec:
  roles:
  - editor
  traits:
  - name: logins
    values:
    - root

The ability to specifically control access to managing Bots has been added to Teleport's RBAC engine. Users managing Bots should now be explicitly granted permissions to manage the bot resource. eg:

spec:
  allow:
    rules:
    - resources:
      - bot
      verbs:
      - list
      - create
      - read
      - update
      - delete

The previous required permissions, a combination of access to user and role, will no longer grant access to Bot RPCs in Teleport 16.0.0.

Additionally, three new audit events are now emitted:

  • bot.create
  • bot.update
  • bot.delete

For those using the Teleport API, the following RPCs are deprecated and will stop functioning in Teleport 16.0.0:

  • proto.AuthService.CreateBot - switch to machineid.v1.BotService.CreateBot
  • proto.AuthService.DeleteBot - switch to machineid.v1.BotService.DeleteBot
  • proto.AuthService.GetBotUsers - switch tomachineid.v1.BotService.ListBots

@strideynet strideynet force-pushed the strideynet/machineid-service branch 2 times, most recently from 480dd85 to eea5e1d Compare December 11, 2023 22:20
@strideynet strideynet force-pushed the strideynet/machineid-service branch from c6b2863 to 1cbbe8c Compare December 12, 2023 10:59
@strideynet strideynet marked this pull request as ready for review December 12, 2023 11:11
@github-actions github-actions Bot added audit-log Issues related to Teleports Audit Log size/xl tctl tctl - Teleport admin tool ui labels Dec 12, 2023
@github-actions
Copy link
Copy Markdown
Contributor

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

@public-teleport-github-review-bot
Copy link
Copy Markdown

@strideynet - this PR will require admin approval to merge due to its size. Consider breaking it up into a series smaller changes.

@strideynet strideynet changed the title Introduce BotService Machine ID: Introduce BotService Dec 12, 2023
Comment thread tool/tctl/common/bots_command.go
Comment thread lib/auth/init.go Outdated
Comment thread lib/auth/init.go Outdated
Comment thread lib/auth/init.go Outdated
Comment thread lib/auth/init.go Outdated
Copy link
Copy Markdown
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by review, just regarding that one comment.

Copy link
Copy Markdown
Member

@ravicious ravicious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it just a cursory look as I'm not the best person to review this, but I see that you already requested a review from Tim.

Comment thread api/client/client.go Outdated
// CreateBot creates a new bot user.
rpc CreateBot(CreateBotRequest) returns (CreateBotResponse);
//
// Deprecated: Use [teleport.machineid.v1.BotService] instead.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I ensure that a doc link like this is valid?

This comment ends up in api/client/proto/authservice.pb.go. I was under the impression that if machineid.v1 is not directly imported in that file, then a doc link must use the full import path, which I assume would be github.com/gravitational/teleport/…. Isaiah pointed it out in one of my PRs but I've never tried to actually verify if my doc links are correct.

https://tip.golang.org/doc/comment#doclinks

I know it doesn't matter much as most of the time those "doc links" will not end up in any docs and are just for internal reference. Still, it's been bugging me since it was pointed out to me so I figured maybe you know the answer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I suppose it's a little odd here since my intention was to "comment on the proto" and therefore here I've referred to the proto service rather than the generated code.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can:

$ go install golang.org/x/pkgsite/cmd/pkgsite@latest
$ pkgsite .

And then navigate to http://localhost:8080/github.com/gravitational/teleport#section-directories to see the docs rendered locally.

Comment thread lib/auth/init.go Outdated
Comment thread lib/auth/machineid/machineidv1/bot_service.go
return bot, nil
}

func (bs *BotService) UpdateBot(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout the codebase we have these UpdateX functions which semantically mean "if this resource already exists, overwrite it completely (a-la Upsert), otherwise error". AFAICT it's very unusual to need this semantic:
If you're calling UpdateX it almost always means that you've just called GetX, changed a field or two, and are now updating the object. This makes the check for "if this resource already exists" superfluous -- you already know it exists because you just got it with GetX.

Note that HTTP doesn't have any such semantic:

  • CREATE: CreateX

  • GET: GetX

  • DELETE: DeleteX

  • PUT: UpsertX

  • PATCH: no equivalent except an odd UpdateX like UpdateRemoteCluster

  • POST: according to the RFC:

    The POST method requests that the target resource process the
    representation enclosed in the request according to the resource's
    own specific semantics.

    Arguably our UpdateX is this miscellaneous type, however as I pointed out previously it's an odd semantic to implement.

Given this line of reasoning, I think we should aim to avoid adding these UpdateX by default. Usually an Upsert seems to do the job.

One loose end I have on this topic is: how do Upserts behave with the new optimistic locking behavior? If a resource has a Revision, will it be checked and rejected if it doesn't match what's in the database on an Upsert? If yes, then Upsert already implements the current Update semantic (and Updates can be replaced with Upserts). If that's not how it works, and Upserts just ignore Revision, then the current Updates could be seen as means of distributed synchronization to prevent a case like

  1. My code calls GetX
  2. Some other part of the system calls DeleteX on that object
  3. My code updates a field and calls UpdateX on that object

However I don't think this was an intentional design pattern.

cc @rosstimothy @espadolini

Copy link
Copy Markdown
Contributor Author

@strideynet strideynet Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw: my implementation here is following the recently proposed RFD 153 on the design of resource APIs. So it may be worth following up with some of these concerns there - #34103 .

This makes the check for "if this resource already exists" superfluous

I don't think this is true - we're operating in a distributed system. I can think of a few scenarios where writing to a resource which has been modified or deleted since you read it risks harming consistency and whilst we don't have "real" locking this seems like the most sensible alternative.

Given this line of reasoning, I think we should aim to avoid adding these UpdateX by default. Usually an Upsert seems to do the job.

I believe if anything long term goal is to do the opposite of what you suggest, and eliminate upsert in favour of create and update. Right now, several things rely on the ability to upsert (e.g tctl create -f / the terraform provider / the kubernetes operator). Once we reach a point where all of those no longer rely on this behaviour, then I think we can almost globally deprecate upsert RPCs.

One loose end I have on this topic is: how do Upserts behave with the new optimistic locking behavior?

IIRC the behaviour of upsert RPCs has not been changed. Optimistic locking is entirely ignored.

However I don't think this was an intentional design pattern.

It was my understanding that this was the entire motivation behind adding optimistic locking to the update RPC.

In addition to this,update RPCs are encouraged to support update masks where appropriate. At least in this case, the UpdateBot RPC implements this to allow selective modification of the roles or traits associated with the Bot without knowledge of the state of other fields.

Copy link
Copy Markdown
Contributor

@espadolini espadolini Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upsert is by definition unconditional, and it's required (at least as a primitive operation, it doesn't have to be exposed through grpc necessarily) for the cache to work.

The old Update primitive backend operation has no right to exist, and it has the semantics that @ibeckermayer mentioned. ConditionalUpdate is the one that you want to use, and you use exactly like that. You get the resource, you make some decisions and calculate the new resource, you ConditionalUpdate the resource with a constraint on the revision you got, and your update succeeds if and only if nothing has written the underlying backend item since you read it.

Whether the revision should be exposed to the client over grpc is a choice that should be done depending on the resource type and the semantics we want to give to each operation; it might make sense sometimes to expose an "update" function that lets the caller update some fields with no regard to what was there in the first place, but even in that case the auth server should 100% use the conditional update operation internally.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and yeah, "i just read the item successfully so the item exists" is 100% bogus and it's a pattern that we should start removing from Teleport whenever possible)

Copy link
Copy Markdown
Contributor

@ibeckermayer ibeckermayer Dec 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@strideynet

It was my understanding that this was the entire motivation behind adding optimistic locking to the update RPC.

I was referring to the state of things pre-optimistic-locking.

I believe if anything long term goal is to do the opposite of what you suggest, and eliminate upsert in favour of create and update. Right now, several things rely on the ability to upsert (e.g tctl create -f / the terraform provider / the kubernetes operator). Once we reach a point where all of those no longer rely on this behaviour, then I think we can almost globally deprecate upsert RPCs.

Seems like an odd goal... Is create -f not the user explicitly asking for an upsert? I can imagine implementing it with a get + update, but I'm not currently seeing what the point of that would be.

@espadolini

Upsert is by definition unconditional

Granted, dumb question.

The old Update primitive backend operation has no right to exist, and it has the semantics that @ibeckermayer mentioned. ConditionalUpdate is the one that you want to use, and you use exactly like that. You get the resource, you make some decisions and calculate the new resource, you ConditionalUpdate the resource with a constraint on the revision you got, and your update succeeds if and only if nothing has written the underlying backend item since you read it.

Is this not essentially what Update has become, presuming it now uses the optimistic locking mechanism?

I can see several plausible update-like semantics:

  1. update the entire object irrespective of revision (error if object dne)
  2. update the entire object on a specific revision (error if revision/object dne)
  3. patch a few fields irrespective of revision (error if object dne)
  4. patch a few fields on a specific revision (error if revision/object dne)
  5. write an entire object irrespective of revision/existence (upsert)

analysis:

  1. is the status quo
  2. is presumably the primary reason we implemented optimistic locking
  3. we have in a few places (HTTP patch semantic)
  4. seems practically useless -- if we have the object's revision, that typically means we have the object, so we may as well just use 2
  5. seems useful to me, although there seems to be some discussion of getting rid of it

Feel free to enlighten me on where I might be confused about how our optimistic locking system works (highly plausible), what we need it to enforce (I'm mostly ignorant except in very general terms) or what I'm missing about upsert.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out that PR @strideynet, this whole discussion is more suited to that RFD so let's continue it over there: #34103 (comment)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ibeckermayer do you think this blocks this PR moving forward ? I'm a little concerned about how long discussing this will run and I have some time pressure to get this merged as it's blocking other work.

Copy link
Copy Markdown
Contributor

@ibeckermayer ibeckermayer Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, I can review the rest and we can just consider this potentially-pending-a-future-PR.

Since you did go through the trouble of adding a field mask, it's worth following this line of inquiry through to its conclusion.

Comment thread tool/tctl/common/bots_command.go
Comment thread tool/tctl/common/collection.go
Copy link
Copy Markdown
Contributor

@timothyb89 timothyb89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments after playing with the branch for a while. The implementation turned out pretty nice! The separate service and RFD 153 resources make this much tidier than I remember.

I didn't notice any compatibility issues but haven't yet tested the Terraform provider integration (which probably could use an update anyway, now that we have sane bot objects).

Comment thread tool/tctl/common/bots_command.go Outdated
Comment thread lib/auth/machineid/machineidv1/bot_service.go
Comment thread lib/auth/machineid/machineidv1/bot_service.go
Comment thread tool/tctl/common/bots_command.go
Comment thread tool/tctl/common/bots_command.go
Comment thread api/proto/teleport/machineid/v1/bot.proto
Co-authored-by: Tim Buckley <tim@goteleport.com>
@strideynet
Copy link
Copy Markdown
Contributor Author

FYI @timothyb89 - I've addressed your comments if you'd like to re-review.

Copy link
Copy Markdown
Contributor

@timothyb89 timothyb89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I'm excited to finally see this!

Comment thread tool/tctl/common/bots_command.go Outdated
Comment thread tool/tctl/common/bots_command.go
Comment thread tool/tctl/common/bots_command.go
@strideynet
Copy link
Copy Markdown
Contributor Author

strideynet commented Dec 19, 2023

Flaky Test Detector is failing due to test length - it looks like the TestAdminActionMFA test takes about 7.67 seconds to run on my mac, which at 100 repetitions passes the 10 minute threshold. As far as I can tell, the changes I've made are only adding about 200-400ms to this test - I'm not sure there's another way I can approach this without adding that overhead and the test is already pretty slow. CC @Joerger

@Joerger
Copy link
Copy Markdown
Contributor

Joerger commented Dec 19, 2023

@strideynet This test will be skipped by the flaky test detector momentarily - #35882

@strideynet
Copy link
Copy Markdown
Contributor Author

Merged master in to pick up joerger's fix for the flaky-test failure

// CreateBot creates a new bot user.
rpc CreateBot(CreateBotRequest) returns (CreateBotResponse);
//
// Deprecated: Use [teleport.machineid.v1.BotService] instead.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can:

$ go install golang.org/x/pkgsite/cmd/pkgsite@latest
$ pkgsite .

And then navigate to http://localhost:8080/github.com/gravitational/teleport#section-directories to see the docs rendered locally.

Comment thread api/types/constants.go Outdated
Comment thread api/types/user.go
// IsBot returns true if the user is a bot.
func (u UserV2) IsBot() bool {
_, ok := u.GetMetadata().Labels[BotGenerationLabel]
_, ok := u.GetMetadata().Labels[BotLabel]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does creating the new bot resource still create a user and role in the backend, or are those replaced by the bot resource?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Bot resource is still just transformed into a User/Role - eventually we'll remove this but that'll involve delving into the core of Teleport's RBAC 😓

This change here is just to make this a little less fragile. I'm not quite sure why we used the Generation label here - it stays at "0" for bots using delegated joining and could be removed in future. The BotLabel is a much better indicator of a user/role being linked to a bot.

Comment thread lib/auth/clt.go Outdated
Comment thread lib/auth/machineid/machineidv1/bot_service.go Outdated
// BotResourceName returns the default name for resources associated with the
// given named bot.
func BotResourceName(botName string) string {
return "bot-" + strings.ReplaceAll(botName, " ", "-")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to remove / as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, this has just moved the previous implementation of BotResourceName from lib/auth into lib/auth/machineid/machineidv1 - I do agree that maybe we ought to replace / but I feel it's out of scope for this PR and I don't want to introduce additional risk by changing this.

strideynet and others added 6 commits December 21, 2023 10:40
Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
@strideynet strideynet added this pull request to the merge queue Dec 21, 2023
Merged via the queue into master with commit e3191b0 Dec 21, 2023
@strideynet strideynet deleted the strideynet/machineid-service branch December 21, 2023 12:18
@public-teleport-github-review-bot
Copy link
Copy Markdown

@strideynet See the table below for backport results.

Branch Result
branch/v14 Failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audit-log Issues related to Teleports Audit Log machine-id size/xl tctl tctl - Teleport admin tool ui

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Machine ID: Introduce Bot resource and RPCs Comprehensive audit log events for the certificate renewal bot

8 participants