Skip to content

Tune the canary logic#56926

Merged
hugoShaka merged 1 commit intomasterfrom
hugo/allow-users-to-control-canaries
Jul 18, 2025
Merged

Tune the canary logic#56926
hugoShaka merged 1 commit intomasterfrom
hugo/allow-users-to-control-canaries

Conversation

@hugoShaka
Copy link
Copy Markdown
Contributor

Part of RFD 184: allow users to configure canaries

This PR introduces the following changes, as discussed with @sclevine:

  • Users can now specify how many canaries they want
  • To decide if we do canary update, Instead of looking at the current group size, we rely on user input
  • max canary 10 -> 5 (I have not done the max message size yet)
  • fix a bug causing the start date to be reset when doing canary -> active

Goal (internal): https://github.com/gravitational/cloud/issues/13207

@hugoShaka hugoShaka requested a review from sclevine July 18, 2025 00:07
@hugoShaka hugoShaka added the no-changelog Indicates that a PR does not require a changelog entry label Jul 18, 2025
@github-actions github-actions bot requested review from doggydogworld and r0mant July 18, 2025 00:07
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jul 18, 2025

Amplify deployment status

Branch Commit Job ID Status Preview Updated (UTC)
hugo/allow-users-to-control-canaries 00885db 4 ✅SUCCEED hugo-allow-users-to-control-canaries 2025-07-18 16:17:22

// canary_count is the number of canary agents that will be updated before the whole group is updated.
// when set to 0, the group does not enter the canary phase. This number is capped to 5.
// This number must always be lower than the total number of agents in the group, else the rollout will be stuck.
int64 canary_count = 6;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Just curious if there's specific reason this is int64? Seems like this will always be a small-ish number.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int32 and int64 are the same size over the wire (both use varints) so I tend to use int64 by default (the main diff is that int64 are encoded as string by protojson while int32 are integers). I'll switch to int32.

Comment thread api/types/autoupdate/config.go Outdated
@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from doggydogworld July 18, 2025 03:17
@hugoShaka hugoShaka force-pushed the hugo/allow-users-to-control-canaries branch from 21ccdb6 to d59d604 Compare July 18, 2025 16:00
@hugoShaka hugoShaka enabled auto-merge July 18, 2025 16:01
- Users can now specify how many canaries they want
- Instead of looking at the current group size, we rely on user input
- max canary 10 -> 5 (I have not done the max message size yet)
- fix a bug causing the start date to be reset when doing canary ->
  active
@hugoShaka hugoShaka force-pushed the hugo/allow-users-to-control-canaries branch from d59d604 to 00885db Compare July 18, 2025 16:08
@hugoShaka hugoShaka added this pull request to the merge queue Jul 18, 2025
Merged via the queue into master with commit 82ba6ac Jul 18, 2025
44 checks passed
@hugoShaka hugoShaka deleted the hugo/allow-users-to-control-canaries branch July 18, 2025 16:50
hugoShaka added a commit that referenced this pull request Jul 18, 2025
- Users can now specify how many canaries they want
- Instead of looking at the current group size, we rely on user input
- max canary 10 -> 5 (I have not done the max message size yet)
- fix a bug causing the start date to be reset when doing canary ->
  active
hugoShaka added a commit that referenced this pull request Jul 23, 2025
- Users can now specify how many canaries they want
- Instead of looking at the current group size, we rely on user input
- max canary 10 -> 5 (I have not done the max message size yet)
- fix a bug causing the start date to be reset when doing canary ->
  active
hugoShaka added a commit that referenced this pull request Jul 24, 2025
- Users can now specify how many canaries they want
- Instead of looking at the current group size, we rely on user input
- max canary 10 -> 5 (I have not done the max message size yet)
- fix a bug causing the start date to be reset when doing canary ->
  active
hugoShaka added a commit that referenced this pull request Jul 24, 2025
- Users can now specify how many canaries they want
- Instead of looking at the current group size, we rely on user input
- max canary 10 -> 5 (I have not done the max message size yet)
- fix a bug causing the start date to be reset when doing canary ->
  active
hugoShaka added a commit that referenced this pull request Jul 24, 2025
- Users can now specify how many canaries they want
- Instead of looking at the current group size, we rely on user input
- max canary 10 -> 5 (I have not done the max message size yet)
- fix a bug causing the start date to be reset when doing canary ->
  active
github-merge-queue bot pushed a commit that referenced this pull request Jul 25, 2025
* autoupdate canary support: proto messages (#56259)

* autoupdate canary support: inventory and auth primitives (#56261)

* autoupdate canary support: tctl (#56473)

* autoupdate canary support: tctl support

This commits makes `tctl autoupdate agents status` display groups in the
canary state properly.

* add `--force` flag to `tctl autoupdate agents start-update`

* autoupdate canary support: modulate proxy response (#56468)

This commit makes the TEleport Proxcy service find and pind endpoints
fetch the updater ID from the request parameters and lookup if the
requestor is a canary. If it is, the requestor will be told to update.

* autoupdate canary support: rollout controller (#56467)

* autoupdate canary support: rollout controller

This commit adds canary support to the autoupdate_agent_rollout
controller when the strategy is "halt-on-error".

* Apply suggestions from code review

* Tune the canary logic (#56926)

- Users can now specify how many canaries they want
- Instead of looking at the current group size, we rely on user input
- max canary 10 -> 5 (I have not done the max message size yet)
- fix a bug causing the start date to be reset when doing canary ->
  active

* Fix autoupdate canary sampling for the catch-all group

* Reliably detect update.yaml after soft reloads

* Fix detection on initial install

* fix log

* Fix tests after rebase

* Always persist new configuration

* cleanup

* fix tests

---------

Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>
hugoShaka added a commit that referenced this pull request Jul 25, 2025
- Users can now specify how many canaries they want
- Instead of looking at the current group size, we rely on user input
- max canary 10 -> 5 (I have not done the max message size yet)
- fix a bug causing the start date to be reset when doing canary ->
  active
github-merge-queue bot pushed a commit that referenced this pull request Jul 28, 2025
* Add rollout mutation functions (#52930)

* Add Trigger, Rollback, ForceDone autoupdate RPCs (#52931)

* Add Trigger, Rollback, ForceDone autoupdate RPCs

* Add all_started_groups bool + switch to group set

* fix error type

* Align semver libs (#52795)

* Convert autoupdate version handling to coreos/go-semver

* get the right version in installer endpoint + get rid of x/mod/semver

* depguard x/mod/semver

* Add nolint rules for existing x/mod/semver usages

* Add depguard explanation

* Add autoupdate trigger/mark-done/rollback commands (#52933)

* Add updater info in Hello (#53911)

* Introduce autoupdate_agent_report proto types (#54175)

* Introduce autoupdate_agent_report proto types

* Fix tests + remove delete all RPC

* Move updater info proto from authclient to types (#54236)

* Report updater info in Hello (#53938)

* Report updater info in Hello

* Add UUID to Hello

* lint

* Fix after rebase

---------

Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>

* Send goodbye even when doing soft-reload (#54176)

* Send goodbye even when doing soft-reload

* Save and replay Goodbye on connect

* Add SoftReload flag to Goodbye

* SendGoodbye -> SetAndSendGoodbye

* Display update group in `tctl inventory` (#54324)

* Add autoupdate manual rollout audit events (#52934)

* Add autoupdate trigger/merk-done/rollback audit events

* Remove useless resource metadata and add groups to audit event

* Add events to web UI

* Add autoupdate_agent_report backend service (#54333)

* Add autoupdate_agent_report backend service

* Saner resource validation

* Add agent rollout cache + service + client (#54772)

* Add agent rollout cache + service + client

* fix after rebase

* add event in tests

* fix autoupdateagenmtreport event streaming

* lint

* Fix backport: slog -> logrus

* Generate autoupdate agent report periodically (#54865)

* Generate autoupdate agent report periodically

* address edoardo's feedback

* Apply suggestions from code review

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* fix proto field lookup + address feedback

* fix tests + add license

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* Add omission info in autoupdate report (#55001)

* Add agent counters to autoupdate_agent_rollout proto (#55096)

* Add agent counters to autoupdate_agent_rollout proto

* int64 -> uint64

* Add reports to client and rewrite mockClient using testify (#55097)

* Add reports to client and rewrite mockClient using testify

When adding the ListAutoUpdateAgentReports() function to the Client interface
I realized that the mock client was not supporting List endpoints.
Instead of expanding the custom mock system, I rewrote the mock client to use
the standard testify/mock library.

* checkIfEmpty -> checkIfCallsWereDone

* Make halt-on-error autoupdate strategy use agent reports (#55116)

* Make half-on-error autoupdate strategy use agent reports

* Make report helpers reusable for time-based strategy

* address edoardo's feedback

* Set the agent count when reconciling time-based rollouts (#55152)

* Set the agent count when reconciling time-based rollouts

* Apply suggestions from code review

Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>

---------

Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>

* Fix flaky `TestServer_generateAgentVersionReport` (#56015)

* [v18] Add autoupdate agent report commands (#56495)

* Add autoupdate agent report commands

* Address feedback

* autoupdate canary support: proto messages (#56259)

* autoupdate canary support: inventory and auth primitives (#56261)

* autoupdate canary support: tctl (#56473)

* autoupdate canary support: tctl support

This commits makes `tctl autoupdate agents status` display groups in the
canary state properly.

* add `--force` flag to `tctl autoupdate agents start-update`

* autoupdate canary support: modulate proxy response (#56468)

This commit makes the TEleport Proxcy service find and pind endpoints
fetch the updater ID from the request parameters and lookup if the
requestor is a canary. If it is, the requestor will be told to update.

* autoupdate canary support: rollout controller (#56467)

* autoupdate canary support: rollout controller

This commit adds canary support to the autoupdate_agent_rollout
controller when the strategy is "halt-on-error".

* Apply suggestions from code review

* Fix backport: add inventory clock + deal with edoardo breaking everything

* Fix tests after backport

* fixup! Fix tests after backport

* fixup! fixup! Fix tests after backport

* lint authproto -> clientproto

* Fix autoupdate canary sampling for the catch-all group

* Tune the canary logic (#56926)

- Users can now specify how many canaries they want
- Instead of looking at the current group size, we rely on user input
- max canary 10 -> 5 (I have not done the max message size yet)
- fix a bug causing the start date to be reset when doing canary ->
  active

* Reliably detect update.yaml after soft reloads

* always send group in agent hello (#55071)

* Fix detection on initial install

* fix log

* Always persist new configuration

* cleanup

* fix tests

* fix tests relying on go 1.24

* fix crd snapshot tests + fix linter issue

---------

Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changelog Indicates that a PR does not require a changelog entry size/sm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants