Make halt-on-error autoupdate strategy use agent reports by hugoShaka · Pull Request #55116 · gravitational/teleport

hugoShaka · 2025-05-23T20:49:37Z

This PR makes the halt-on-error strategy list agent reports, count agents, report the counts in the rollout, and progress groups based on this. Backward compatibility with previous versions is maintained by falling back to the previous "wait for 1 hour" behaviour if "initial_count" is not set for the group.

Part of: RFD 184

Goal (internal): https://github.com/gravitational/cloud/issues/11856

lib/auth/autoupdate/autoupdatev1/service.go

lib/autoupdate/rollout/strategy_haltonerror.go

espadolini · 2025-05-26T18:32:38Z

lib/autoupdate/rollout/transitions_test.go

 					AutoupdateMode: autoupdate.AgentsUpdateModeEnabled,
 				},
-				Status: nil,
+				Status: proto.Clone(status).(*autoupdatev1pb.AutoUpdateAgentRolloutStatus),


This exists, btw.

Suggested change

Status: proto.Clone(status).(*autoupdatev1pb.AutoUpdateAgentRolloutStatus),

Status: proto.CloneOf(status),

tigrato · 2025-05-27T14:20:11Z

lib/auth/auth.go

 			Key:           autoUpdateAgentReportKey,
-			Duration:      time.Minute,
-			FirstDuration: retryutils.FullJitter(time.Minute),
+			Duration:      constants.AutoUpdateAgentReportPeriod,


Should we ensure that 1m is never reached as max duration? ticker delay + network + backend delays can make agents be marked as stalled

ticker delay + network + backend delays can make agents be marked as stalled

Absolutely, that's a great point.

If a report is stale and we see a temporary agent drop it's not going to affect the rollout, maybe delay it by 30 seconds. The opposite, double-counting agents as they reconnect to a different auth, would be more problematic I think, as it would lead to a rollout progressing while not all agents are updated.

Maybe I'm wrong and the drops will prove to be an issue, but I'd rather end up with slow rollout than a too hasty one. If we observe issues in cloud we will fix the logic accordingly.

* Make half-on-error autoupdate strategy use agent reports * Make report helpers reusable for time-based strategy * address edoardo's feedback

* Add rollout mutation functions (#52930) * Add Trigger, Rollback, ForceDone autoupdate RPCs (#52931) * Add Trigger, Rollback, ForceDone autoupdate RPCs * Add all_started_groups bool + switch to group set * fix error type * Align semver libs (#52795) * Convert autoupdate version handling to coreos/go-semver * get the right version in installer endpoint + get rid of x/mod/semver * depguard x/mod/semver * Add nolint rules for existing x/mod/semver usages * Add depguard explanation * Add autoupdate trigger/mark-done/rollback commands (#52933) * Add updater info in Hello (#53911) * Introduce autoupdate_agent_report proto types (#54175) * Introduce autoupdate_agent_report proto types * Fix tests + remove delete all RPC * Move updater info proto from authclient to types (#54236) * Report updater info in Hello (#53938) * Report updater info in Hello * Add UUID to Hello * lint * Fix after rebase --------- Co-authored-by: Stephen Levine <stephen.levine@goteleport.com> * Send goodbye even when doing soft-reload (#54176) * Send goodbye even when doing soft-reload * Save and replay Goodbye on connect * Add SoftReload flag to Goodbye * SendGoodbye -> SetAndSendGoodbye * Display update group in `tctl inventory` (#54324) * Add autoupdate manual rollout audit events (#52934) * Add autoupdate trigger/merk-done/rollback audit events * Remove useless resource metadata and add groups to audit event * Add events to web UI * Add autoupdate_agent_report backend service (#54333) * Add autoupdate_agent_report backend service * Saner resource validation * Add agent rollout cache + service + client (#54772) * Add agent rollout cache + service + client * fix after rebase * add event in tests * fix autoupdateagenmtreport event streaming * lint * Fix backport: slog -> logrus * Generate autoupdate agent report periodically (#54865) * Generate autoupdate agent report periodically * address edoardo's feedback * Apply suggestions from code review Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> * fix proto field lookup + address feedback * fix tests + add license --------- Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> * Add omission info in autoupdate report (#55001) * Add agent counters to autoupdate_agent_rollout proto (#55096) * Add agent counters to autoupdate_agent_rollout proto * int64 -> uint64 * Add reports to client and rewrite mockClient using testify (#55097) * Add reports to client and rewrite mockClient using testify When adding the ListAutoUpdateAgentReports() function to the Client interface I realized that the mock client was not supporting List endpoints. Instead of expanding the custom mock system, I rewrote the mock client to use the standard testify/mock library. * checkIfEmpty -> checkIfCallsWereDone * Make halt-on-error autoupdate strategy use agent reports (#55116) * Make half-on-error autoupdate strategy use agent reports * Make report helpers reusable for time-based strategy * address edoardo's feedback * Set the agent count when reconciling time-based rollouts (#55152) * Set the agent count when reconciling time-based rollouts * Apply suggestions from code review Co-authored-by: Stephen Levine <stephen.levine@goteleport.com> --------- Co-authored-by: Stephen Levine <stephen.levine@goteleport.com> * Fix flaky `TestServer_generateAgentVersionReport` (#56015) * [v18] Add autoupdate agent report commands (#56495) * Add autoupdate agent report commands * Address feedback * autoupdate canary support: proto messages (#56259) * autoupdate canary support: inventory and auth primitives (#56261) * autoupdate canary support: tctl (#56473) * autoupdate canary support: tctl support This commits makes `tctl autoupdate agents status` display groups in the canary state properly. * add `--force` flag to `tctl autoupdate agents start-update` * autoupdate canary support: modulate proxy response (#56468) This commit makes the TEleport Proxcy service find and pind endpoints fetch the updater ID from the request parameters and lookup if the requestor is a canary. If it is, the requestor will be told to update. * autoupdate canary support: rollout controller (#56467) * autoupdate canary support: rollout controller This commit adds canary support to the autoupdate_agent_rollout controller when the strategy is "halt-on-error". * Apply suggestions from code review * Fix backport: add inventory clock + deal with edoardo breaking everything * Fix tests after backport * fixup! Fix tests after backport * fixup! fixup! Fix tests after backport * lint authproto -> clientproto * Fix autoupdate canary sampling for the catch-all group * Tune the canary logic (#56926) - Users can now specify how many canaries they want - Instead of looking at the current group size, we rely on user input - max canary 10 -> 5 (I have not done the max message size yet) - fix a bug causing the start date to be reset when doing canary -> active * Reliably detect update.yaml after soft reloads * always send group in agent hello (#55071) * Fix detection on initial install * fix log * Always persist new configuration * cleanup * fix tests * fix tests relying on go 1.24 * fix crd snapshot tests + fix linter issue --------- Co-authored-by: Stephen Levine <stephen.levine@goteleport.com> Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

hugoShaka requested review from espadolini, sclevine and vapopov May 23, 2025 20:49

hugoShaka added the no-changelog Indicates that a PR does not require a changelog entry label May 23, 2025

hugoShaka marked this pull request as ready for review May 23, 2025 22:08

hugoShaka changed the title ~~Make half-on-error autoupdate strategy use agent reports~~ Make halt-on-error autoupdate strategy use agent reports May 23, 2025

github-actions bot added the size/md label May 23, 2025

github-actions bot requested review from eriktate and tigrato May 23, 2025 22:08

espadolini approved these changes May 26, 2025

View reviewed changes

lib/auth/autoupdate/autoupdatev1/service.go Outdated Show resolved Hide resolved

lib/autoupdate/rollout/strategy_haltonerror.go Outdated Show resolved Hide resolved

espadolini approved these changes May 26, 2025

View reviewed changes

hugoShaka mentioned this pull request May 26, 2025

Set the agent count when reconciling time-based rollouts #55152

Merged

tigrato approved these changes May 27, 2025

View reviewed changes

public-teleport-github-review-bot bot removed request for eriktate, sclevine and vapopov May 27, 2025 14:22

hugoShaka added 3 commits May 27, 2025 14:17

Make half-on-error autoupdate strategy use agent reports

480897d

Make report helpers reusable for time-based strategy

42cdfff

address edoardo's feedback

f6836d2

hugoShaka force-pushed the hugo/progress-halt-on-error-based-on-reports branch from 3696e2c to f6836d2 Compare May 27, 2025 18:17

hugoShaka enabled auto-merge May 27, 2025 18:18

hugoShaka added this pull request to the merge queue May 27, 2025

sclevine approved these changes May 27, 2025

View reviewed changes

Merged via the queue into master with commit b7191e9 May 27, 2025
41 checks passed

hugoShaka deleted the hugo/progress-halt-on-error-based-on-reports branch May 27, 2025 18:59

hugoShaka added a commit that referenced this pull request Jul 18, 2025

Make halt-on-error autoupdate strategy use agent reports (#55116)

d72ff45

* Make half-on-error autoupdate strategy use agent reports * Make report helpers reusable for time-based strategy * address edoardo's feedback

hugoShaka added a commit that referenced this pull request Jul 24, 2025

Make halt-on-error autoupdate strategy use agent reports (#55116)

bd0b8b4

* Make half-on-error autoupdate strategy use agent reports * Make report helpers reusable for time-based strategy * address edoardo's feedback

hugoShaka added a commit that referenced this pull request Jul 25, 2025

Make halt-on-error autoupdate strategy use agent reports (#55116)

f8ee0a7

* Make half-on-error autoupdate strategy use agent reports * Make report helpers reusable for time-based strategy * address edoardo's feedback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make halt-on-error autoupdate strategy use agent reports#55116

Make halt-on-error autoupdate strategy use agent reports#55116
hugoShaka merged 3 commits intomasterfrom
hugo/progress-halt-on-error-based-on-reports

hugoShaka commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

espadolini May 26, 2025

Uh oh!

hugoShaka May 26, 2025

Uh oh!

tigrato May 27, 2025

Uh oh!

hugoShaka May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	Status: proto.Clone(status).(*autoupdatev1pb.AutoUpdateAgentRolloutStatus),
	Status: proto.CloneOf(status),

Conversation

hugoShaka commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

espadolini May 26, 2025

Choose a reason for hiding this comment

Uh oh!

hugoShaka May 26, 2025

Choose a reason for hiding this comment

Uh oh!

tigrato May 27, 2025

Choose a reason for hiding this comment

Uh oh!

hugoShaka May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hugoShaka May 27, 2025 •

edited

Loading