Add Trigger, Rollback, ForceDone autoupdate RPCs#52931
Conversation
| tries := 0 | ||
| const maxTries = 3 | ||
|
|
||
| for { |
There was a problem hiding this comment.
Curious about the motivation for repeatedly retrying on conflict here. Is there something we could watch instead?
(Not blocking, just want to understand the architecture better.)
There was a problem hiding this comment.
The point of "optimistic" locking is that in the happy path where there's no contention, you'll get a single read and a single write. The downside of that is that if there's contention you're gonna have to retry once or twice, and it's a little nicer if we do a couple of retries in the auth where we're closer to the backend rather than having to retry from the client. There's no way to know if a write will succeed or to wait for a write to succeed, the only point at which such a decision can be made is the conditional update.
There was a problem hiding this comment.
I understand the intent behind optimistic locking, but it's unusual to me to see it applied with an arbitrary retry count directly on the backend, vs., e.g., retrying a transaction. I'm assuming we don't have a transaction-like concept given the multitude of backends, and the optimistic lock is our only tool? (Just want to understand this better, not suggesting a change)
There was a problem hiding this comment.
This is, in fact, a transaction (-ish): get some things, make some decisions based on the things, figure out what to write and what are the conditions for the write to go through and not need a retry of the whole transaction, attempt to apply the write. The underlying dynamodb rpc that's used in more complicated situations that involve more than one item (which we surface as (backend.Backend).AtomicWrite) is literally called dynamodb:TransactWriteItems, but for simpler scenarios where the condition is on the revision of the same item that's being written, it's just a conditional write (exposed as (backend.Backend).ConditionalUpdate).
| tries := 0 | ||
| const maxTries = 3 | ||
|
|
||
| for { |
There was a problem hiding this comment.
The point of "optimistic" locking is that in the happy path where there's no contention, you'll get a single read and a single write. The downside of that is that if there's contention you're gonna have to retry once or twice, and it's a little nicer if we do a couple of retries in the auth where we're closer to the backend rather than having to retry from the client. There's no way to know if a write will succeed or to wait for a write to succeed, the only point at which such a decision can be made is the conditional update.
e6845bf to
9dc0a65
Compare
d12dd7c to
6087b1e
Compare
9dc0a65 to
94db9ce
Compare
f98f8cf to
fda5630
Compare
94db9ce to
949f253
Compare
* Add Trigger, Rollback, ForceDone autoupdate RPCs * Add all_started_groups bool + switch to group set * fix error type
* Add Trigger, Rollback, ForceDone autoupdate RPCs * Add all_started_groups bool + switch to group set * fix error type
* Add Trigger, Rollback, ForceDone autoupdate RPCs * Add all_started_groups bool + switch to group set * fix error type
* Add rollout mutation functions (#52930) * Add Trigger, Rollback, ForceDone autoupdate RPCs (#52931) * Add Trigger, Rollback, ForceDone autoupdate RPCs * Add all_started_groups bool + switch to group set * fix error type * Align semver libs (#52795) * Convert autoupdate version handling to coreos/go-semver * get the right version in installer endpoint + get rid of x/mod/semver * depguard x/mod/semver * Add nolint rules for existing x/mod/semver usages * Add depguard explanation * Add autoupdate trigger/mark-done/rollback commands (#52933) * Add updater info in Hello (#53911) * Introduce autoupdate_agent_report proto types (#54175) * Introduce autoupdate_agent_report proto types * Fix tests + remove delete all RPC * Move updater info proto from authclient to types (#54236) * Report updater info in Hello (#53938) * Report updater info in Hello * Add UUID to Hello * lint * Fix after rebase --------- Co-authored-by: Stephen Levine <stephen.levine@goteleport.com> * Send goodbye even when doing soft-reload (#54176) * Send goodbye even when doing soft-reload * Save and replay Goodbye on connect * Add SoftReload flag to Goodbye * SendGoodbye -> SetAndSendGoodbye * Display update group in `tctl inventory` (#54324) * Add autoupdate manual rollout audit events (#52934) * Add autoupdate trigger/merk-done/rollback audit events * Remove useless resource metadata and add groups to audit event * Add events to web UI * Add autoupdate_agent_report backend service (#54333) * Add autoupdate_agent_report backend service * Saner resource validation * Add agent rollout cache + service + client (#54772) * Add agent rollout cache + service + client * fix after rebase * add event in tests * fix autoupdateagenmtreport event streaming * lint * Fix backport: slog -> logrus * Generate autoupdate agent report periodically (#54865) * Generate autoupdate agent report periodically * address edoardo's feedback * Apply suggestions from code review Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> * fix proto field lookup + address feedback * fix tests + add license --------- Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> * Add omission info in autoupdate report (#55001) * Add agent counters to autoupdate_agent_rollout proto (#55096) * Add agent counters to autoupdate_agent_rollout proto * int64 -> uint64 * Add reports to client and rewrite mockClient using testify (#55097) * Add reports to client and rewrite mockClient using testify When adding the ListAutoUpdateAgentReports() function to the Client interface I realized that the mock client was not supporting List endpoints. Instead of expanding the custom mock system, I rewrote the mock client to use the standard testify/mock library. * checkIfEmpty -> checkIfCallsWereDone * Make halt-on-error autoupdate strategy use agent reports (#55116) * Make half-on-error autoupdate strategy use agent reports * Make report helpers reusable for time-based strategy * address edoardo's feedback * Set the agent count when reconciling time-based rollouts (#55152) * Set the agent count when reconciling time-based rollouts * Apply suggestions from code review Co-authored-by: Stephen Levine <stephen.levine@goteleport.com> --------- Co-authored-by: Stephen Levine <stephen.levine@goteleport.com> * Fix flaky `TestServer_generateAgentVersionReport` (#56015) * [v18] Add autoupdate agent report commands (#56495) * Add autoupdate agent report commands * Address feedback * autoupdate canary support: proto messages (#56259) * autoupdate canary support: inventory and auth primitives (#56261) * autoupdate canary support: tctl (#56473) * autoupdate canary support: tctl support This commits makes `tctl autoupdate agents status` display groups in the canary state properly. * add `--force` flag to `tctl autoupdate agents start-update` * autoupdate canary support: modulate proxy response (#56468) This commit makes the TEleport Proxcy service find and pind endpoints fetch the updater ID from the request parameters and lookup if the requestor is a canary. If it is, the requestor will be told to update. * autoupdate canary support: rollout controller (#56467) * autoupdate canary support: rollout controller This commit adds canary support to the autoupdate_agent_rollout controller when the strategy is "halt-on-error". * Apply suggestions from code review * Fix backport: add inventory clock + deal with edoardo breaking everything * Fix tests after backport * fixup! Fix tests after backport * fixup! fixup! Fix tests after backport * lint authproto -> clientproto * Fix autoupdate canary sampling for the catch-all group * Tune the canary logic (#56926) - Users can now specify how many canaries they want - Instead of looking at the current group size, we rely on user input - max canary 10 -> 5 (I have not done the max message size yet) - fix a bug causing the start date to be reset when doing canary -> active * Reliably detect update.yaml after soft reloads * always send group in agent hello (#55071) * Fix detection on initial install * fix log * Always persist new configuration * cleanup * fix tests * fix tests relying on go 1.24 * fix crd snapshot tests + fix linter issue --------- Co-authored-by: Stephen Levine <stephen.levine@goteleport.com> Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
PR 2/4 adding manual rollout control as specified in RFD 184.
This PR adds manual rollout RPCs. Audit events and tctl commands will be added in a followup PR.