Skip to content

Use a non-global metrics registry in Teleport#50913

Merged
hugoShaka merged 3 commits intomasterfrom
hugo/teleport-use-non-global-metrics-registry
Jan 10, 2025
Merged

Use a non-global metrics registry in Teleport#50913
hugoShaka merged 3 commits intomasterfrom
hugo/teleport-use-non-global-metrics-registry

Conversation

@hugoShaka
Copy link
Copy Markdown
Contributor

@hugoShaka hugoShaka commented Jan 9, 2025

This PR adds a new non-global per-process metrics registry in Teleport.

Using the global registry and global metrics causes conflicts in tests as we are starting multiple Teleport processes and/or other non-teleport processes (tbot, the operator, ...).

Having a new per-process metrics registry will allow Teleport services to register metrics scoped to their Teleport process. This will reduce the conflicts happening in tests.

To ensure backward compatibility, the Teleport metrics server serves both the process-scoped registry and the global registry.

Required for the autoupdate controller metrics PR.

@hugoShaka
Copy link
Copy Markdown
Contributor Author

I didn't want to add a metrics RFD, but it would be good to start using the process registry instead of the global one for the next features we build/metrics we add.

@hugoShaka hugoShaka added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v15 backport/branch/v17 labels Jan 9, 2025
Copy link
Copy Markdown
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'll let the experts approve first.

// and the global registry (used by some Teleport services and many dependencies).
gatherers := prometheus.Gatherers{
prometheus.DefaultGatherer,
process.metricsRegistry,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If conflicting metrics are registered I assume they'll be dropped, but unaffected metrics will keep working. Do you know if that's correct?

Copy link
Copy Markdown
Contributor Author

@hugoShaka hugoShaka Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If conflicting metrics are registered I assume they'll be dropped

Currently, registration conflicts in the global registry can cause:

  • hard failure / error returned
  • panics
  • silent failure (metric does not get registered and we don't know about it)

Adding a local registry will not change the failure modes in case of conflict in the same registry. However, we are adding a new failure mode: metrics conflicting between the local and global registry. In this case, the global will prevail (I did this for backward compatibility reasons as everything is using the global registry today) the local registry will take precedence.

As we start using the local registry more, we might create such hard to detect conflicts. The situation is not strictly worse than today (we already have some racy metric registration with silent failure going on 😬). To ensure no conflict happen we can prefix new metrics by wrapping the registry when passing it to the service.

I think we would benefit from metrics guideline, setting the teleport component in the metric subsystem would reduce the probability of conflict.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, very informative.

// and the global registry (used by some Teleport services and many dependencies).
gatherers := prometheus.Gatherers{
prometheus.DefaultGatherer,
process.metricsRegistry,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, very informative.

@hugoShaka hugoShaka force-pushed the hugo/teleport-use-non-global-metrics-registry branch from 853ac49 to 7be0540 Compare January 9, 2025 20:28
@hugoShaka hugoShaka force-pushed the hugo/teleport-use-non-global-metrics-registry branch from d153fda to fa31b4a Compare January 9, 2025 22:46
Comment on lines +3447 to +3448
// As we move more things to the local registry, especially in other tools like tbot, we will have less
// conflicts in tests.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it advantageous to move all of our current global metrics to the local registry? If so what kind of migration strategy should we have to eliminate global metrics?

Copy link
Copy Markdown
Contributor Author

@hugoShaka hugoShaka Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it advantageous to move all of our current global metrics to the local registry?

I think so because we will:

  • stop having conflicts between tbot, teleport and other programs when running in the same test
  • start having accurate metrics when running multiple teleport components together (e.g. in tests, or embedded tbot)
  • stop picking up random metrics declared by dependencies we have

If so what kind of migration strategy should we have to eliminate global metrics?

I've not thought this yet, but by supporting both we can take our time with the transition. I'd like to get a few new metrics using the local registry before chosing a recommended pattern. Once we know how we want metrics to be declared and collected, we can write a short metrics RFD and start passing the local registry to the different services.

Migrating might not be trivial because we are heavily relying on package-scoped metrics and global registries. We will need to:

  • propagate the registerer from the main process to every service registering metrics
  • start putting metrics in structs instead of a package-scoped var
  • get rid of the sync.Once and other hacks in place currently avoiding double-registration

I think tbot is a very good starting point because of its limited scope, the conflicts caused by embeddedtbot, and the conflicts it causes in integration tests.

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
@hugoShaka hugoShaka enabled auto-merge January 10, 2025 15:38
@hugoShaka hugoShaka added this pull request to the merge queue Jan 10, 2025
Merged via the queue into master with commit 5b5bab9 Jan 10, 2025
@hugoShaka hugoShaka deleted the hugo/teleport-use-non-global-metrics-registry branch January 10, 2025 16:03
@public-teleport-github-review-bot
Copy link
Copy Markdown

@hugoShaka See the table below for backport results.

Branch Result
branch/v15 Failed
branch/v16 Failed
branch/v17 Failed

hugoShaka added a commit that referenced this pull request Jan 17, 2025
* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
hugoShaka added a commit that referenced this pull request Jan 17, 2025
* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
hugoShaka added a commit that referenced this pull request Jan 17, 2025
* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
hugoShaka added a commit that referenced this pull request Jan 17, 2025
* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
hugoShaka added a commit that referenced this pull request Jan 21, 2025
* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
hugoShaka added a commit that referenced this pull request Jan 21, 2025
* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
hugoShaka added a commit that referenced this pull request Jan 21, 2025
* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this pull request Jan 21, 2025
* Use a non-global metrics registry in Teleport (#50913)

* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* Serve metrics from the local registry in the diagnostic service (#51031)

* Use local metrics registry in the diagnostic service

* Test metrics are served by the diag service

* Init local registry at runtime instead of config (#51074)

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this pull request Jan 21, 2025
* Use a non-global metrics registry in Teleport (#50913)

* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* Serve metrics from the local registry in the diagnostic service (#51031)

* Use local metrics registry in the diagnostic service

* Test metrics are served by the diag service

* Init local registry at runtime instead of config (#51074)

* lint

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this pull request Jan 21, 2025
* Use a non-global metrics registry in Teleport (#50913)

* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* Serve metrics from the local registry in the diagnostic service (#51031)

* Use local metrics registry in the diagnostic service

* Test metrics are served by the diag service

* Init local registry at runtime instead of config (#51074)

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this pull request Jan 21, 2025
* Use a non-global metrics registry in Teleport (#50913)

* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* Serve metrics from the local registry in the diagnostic service (#51031)

* Use local metrics registry in the diagnostic service

* Test metrics are served by the diag service

* Init local registry at runtime instead of config (#51074)

* lint

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
carloscastrojumo pushed a commit to carloscastrojumo/teleport that referenced this pull request Feb 19, 2025
* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
hugoShaka added a commit that referenced this pull request May 1, 2025
* Use a non-global metrics registry in Teleport (#50913)

* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* Serve metrics from the local registry in the diagnostic service (#51031)

* Use local metrics registry in the diagnostic service

* Test metrics are served by the diag service

* Init local registry at runtime instead of config (#51074)

* lint

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this pull request May 12, 2025
* bump e

* Make UnknownResource proto-friendly (#54047)

* [v15] In-process metrics registry (#51204)

* Use a non-global metrics registry in Teleport (#50913)

* Support a non-global registry in Teleport

* lint

* Update lib/service/service.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* Serve metrics from the local registry in the diagnostic service (#51031)

* Use local metrics registry in the diagnostic service

* Test metrics are served by the diag service

* Init local registry at runtime instead of config (#51074)

* lint

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* Fix metrics registry after rebase

* set cmc to mon-thu (#53767)

* Fix CMC weekdays bug (#54076) (#54116)

* [v15] Backport manage updates v2 (#53286)

* [v15] RFD 184: managed updates v2, server-side logic, client, and package

* Add autoupdate agent protos (#47666)

* Add autoupdate agent protos

* fix tests

* Add create/update/delete RPCs + add missing event proto

* Update api/proto/teleport/autoupdate/v1/autoupdate.proto

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* address timr's feedback + fix tests

* buf lint

* buf lint pt.2

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* fix agent autoupdate protos (#47830)

* Add autoupdate agent type validations (#47831)

* Add autoupdate agent validations

* Add AutoUpdateAgentRollout constants

* Fix autoupdate API licenses

Teleport's `api/` and `integrations/` should be Apache-licensed.

Only the main teleport process should be licenses under AGPLv3.

* address feedback

* Add AutoUpdateAgentRollout service and cache (#47833)

* Fix defaults on incomplete AU config or version resources (#47872)

* Fix panic on incomplete AU config or version resources

* lint

* address tiago's feedback

* [v17] enforce conditional updates on AutoUpdate* + rename typos (#48390)

* enforce conditaional updates on AutoUpdate* + rename typos

* fix tests

* [v17] implement autoupdate_agent_rollout reconciler (#48944)

* implement autoupdate_agent_rollout reconciler

* address edoardo's feedback

* address edoardo's feedback pt.2

* fixup! address edoardo's feedback

* lint

* [v17] RFD 184: automatic updates, server-side logic (#52275)

* Implement immediate schedule support for automatic updates (#47920)

* Implement immediate schedule support

* expose edition, fips, and ensure ping endpoint answers

* fix after rebase

* fix cache tests

* introduce webclient.ReusableClient (#49296)

* Move autoupdate code in proxy to make more sense (#49484)

* Move autoupdate code in proxy to make more sense

* lint + godoc

* Start `autoupdate_agent_rollout` controller in auth service (#49101)

* run autoupdate_agent_rollout controller

* Recover from panics inside the controller

* Address tim's feedback

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* kube-agent-updater: add RFD-184 trigger and version getter (#49297)

* add proxy version getter and maintenance trigger

* add failover trigger and versionGetter

* lint

* Apply suggestions from code review

Co-authored-by: Marco Dinis <marco.dinis@goteleport.com>

* address marco's feedback

* licensing

---------

Co-authored-by: Marco Dinis <marco.dinis@goteleport.com>

* Rename lib/kubernetestoken to lib/kube/token (#49554)

* Rename lib/kubernetestoken to lib/kube/token

* Lint

* Make the proxy read from autoupdate_agent_rollout (#49380)

* Add autoupdate_agenbt_rollout support

* fix ping proxy tests

* address creack's feedback

* Address sclevine's feedback

Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>

* fix panic in tests

---------

Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>

* Fix flaky TestAutoUpdateAgentShouldUpdate (#49883)

* Fix flaky TestAutoUpdateAgentShouldUpdate

* Update lib/web/apiserver_ping_test.go

* Update lib/web/autoupdate_common_test.go

* autoupdate: reconcile rollout status and add strategy interface (#49735)

* autoupdate: reconcile rollout status and add strategy interface

* fix missing constants + add license

* lint

* fix proto field id

* Fix flaky TestAgentRolloutController (#49886)

* Fix falky TestAgentRolloutController

* switch to real clock + increase Eventually timeout

* Make reconciliation period a parameter + add TELEPORT_UNSTABLE env var

* Update lib/service/service_test.go

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* Apply suggestions from code review

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* Remove env var

* lint

---------

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* Compute global rollout state (#49945)

* Compute global rollout state

* Simplify + missing wrong proto message description

* lint

* simplify

* for edoardo

* fix compute status test

* autoupdate: implement time-based strategy (#49736)

This commit implements the time-based rollout strategy describen in
RFD 184. The autoupdate_agent_rollout controller will make the groups
active based on their start days, start hour, and maintenance duration.
Once the maintenance window is over, the group becomes DONE.
In the DONE state, new agents will instalkl the target version but
existing agents will no longer be told to actively update.

* Use CMC as default config when set (#50039)

* autoupdate: Use CMC as default config when set

Part of: [RFD-184](#47126)

This commit implements backward compatibility when CMC is specified.
After this PR, if the user has no `autoupdate_config` resource but a
`cluster_maintenance_config` resource from RFD 109, we will use the CMC
to generate the config (update hour and update days) and craft the
`autoupdate_agent_rollout`.

* Update lib/autoupdate/rollout/client_test.go

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* address feedback

* lint

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* Change autoupdate proto messages (#50234)

* Change autoupdate proto messages

This commits does 3 changes:
- reflect the maintenance duration on the rollout in a new spec field
- add a rollout start time field in its status
- change wait_days into wait_hours

* int64 -> in32 for consistency with other fields

* Add autoupdate_config and autoupdate_agent_rollout validation (#50181)

This commit removes the restrictions of the autoupdate_agent_rollout and autoupdate_config schedules but adds groups validation.

It also adds some optional server-side validation that should not be enforced at the resource level.

* autoupdate: implement halt-on-error strategy (#49737)

* autoupdate: implement halt-on-error strategy

* rewrite wait_days logic into wait_hours

* Apply suggestions from code review

Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>

---------

Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>

* add tctl create/get/edit support for autoupdate_agent_rollout (#50393)

* add tctl create/get/edit support for autoupdate_agent_rollout

* fix bad copy paste

* set rollout start date and don't start updating if rollout just changed (#50365)

This commit does two changes:
- the controller now sets the rollout start time when resetting the
  rollout
- the controller will not start a group if the rollout changed during
  the maintenance window (checks if the rollout start time is in the
  window)

* Reduce clock usage + add time and period override in rollout controller (#50634)

* Enable strategies in the autoupdate rollout controller (#50635)

* autoupdate rollout: honour the maintenance window duration (#50745)

* autoupdate rollout: honour the maintenance window duration

* Update lib/autoupdate/rollout/reconciler.go

Co-authored-by: Bartosz Leper <bartosz.leper@goteleport.com>

* Address feedback

* Update lib/autoupdate/rollout/strategy.go

---------

Co-authored-by: Bartosz Leper <bartosz.leper@goteleport.com>

* Fix proto resource 153 marshalling for autoupdate_* resources (#50688)

* Fix proto resource 153 marshalling

* Update tool/tctl/common/collection_test.go

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* Update tool/tctl/common/collection_test.go

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* Address feedback

- Change from Resource153AdapterV2 to ProtoResource153Adapter
- fix test failures and unmarshal proto resources properly
- add a failing round-trip proto 153 test case
- bonus: fix the table tesst reosurce create that did not support
  running a single row

* Apply suggestions from code review

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* lint

---------

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* Add autoupdate controller metrics (#50807)

* Add autoupdate controller metrics

* Do no panic in case of error conflict

* kube-agent-update: Use the RFD-184 webapi proxy update protocol by default when possible (#50464)

* kube-agent-update: Use the RFD-184 webapi proxy update protocol by default when possible

* Update integrations/kube-agent-updater/cmd/teleport-kube-agent-updater/main.go

Co-authored-by: Tiago Silva <tiago.silva@goteleport.com>

* log update group

---------

Co-authored-by: Tiago Silva <tiago.silva@goteleport.com>

* Add 'tctl autoupdate agents status' (#51079)

* Ensure proxy version getter adds the leading 'v' (#51687)

* Always create debug socket and expose health endpoints (#51616)

* Always create debug socket and expose health endpoints

* Consolidate the diagnostic multiplexers in a single function

* Fix tests

* Apply suggestions from code review

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* Fix autoupdate rollout controller metrics (#51803)

* kube-agent-updater pre-release builds trust the staging repo + insecure validator private repo fix (#51815)

* Fix insecure resolver in private repos + trust pre-release builds

* fixup! Fix insecure resolver in private repos + trust pre-release builds

* Use new autoupdate APIs in discovery service (#51758)

* Remove name parameter from proxy version getter

* Use autoupdate_agent_rollout as a source of version in scripts and integrations

* Fix tests

* Handle gracefully absence of a proxy in kube discovery sevrice

* Update lib/srv/discovery/kube_integration_watcher.go

Co-authored-by: Tiago Silva <tiago.silva@goteleport.com>

* Address marco's feedback

* Address marco's feedback pt.2

* Gracefully handle if we can't get autoupdate version

* fixup! Update lib/srv/discovery/kube_integration_watcher.go

---------

Co-authored-by: Tiago Silva <tiago.silva@goteleport.com>

* Autoupdate changelog entry in v17.3

* Fix tests after rebase, pt.1

* Update front preset fixtures since the preset role changed

* Add install script using teleport-update and oneoff.sh (#52155)

* Refactor node-join script to take safer options and reuse install option logic (#52196)

* Add install script using teleport-update and oneoff.sh

* Refactor node-join script to take safer options and reuse install option logic

* GoDoc + make functions private

* Address edoardo's feedback

* Allow prerelease Teleport to install official artifacts (#52444)

* Accept to install CE when running an AGPL build for backeard compat

* Bump e to fix build (oneoff args change)

* Make node install scripts install Teleport via teleport-update (#52226)

* Make the node install script use teleport-update

* Apply suggestions from code review

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* Fix curl args + address bash exec comments

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* Use install.sh in discovery's default installer (#52368)

* Use install.sh in discovery's default installer

* fixup! Use install.sh in discovery's default installer

* Address marco's feedback

* Update lib/auth/grpcserver.go

Co-authored-by: Marco Dinis <marco.dinis@goteleport.com>

* Update lib/srv/server/installer/defaultinstallers.go

* apply edoard's feedback + write script to file

* Execute the downloaded shell script

* Add snapshot tests

* fixup! Add snapshot tests

---------

Co-authored-by: Marco Dinis <marco.dinis@goteleport.com>

* Fix error after rebase

* Fix test after rebase

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Co-authored-by: Marco Dinis <marco.dinis@goteleport.com>
Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>
Co-authored-by: Alan Parra <alan.parra@goteleport.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Co-authored-by: Bartosz Leper <bartosz.leper@goteleport.com>
Co-authored-by: Tiago Silva <tiago.silva@goteleport.com>

* [v17] Modulate install script when managed updates v2 are off (#52609)

* Modulate install script when managed updates v2 are off

* fixup! Modulate install script when managed updates v2 are off

* Address Stephen's feedback

* Set the autoupdate singleton names (#52751)

* Add autoupdate events to web UI (#52748) (#52838)

* Add autoupdate events to web UI

* lint

* Fix backport to include the label fix

* lint

* fix tests

* Add teleport-update binary scaffolding and disable command (#46418)

* Add main.go

* wip

* group flag

* wip

* wip

* mvp

* wip

* separate files

* cleanup

* jitter

* scaffold only

* remove teleport changes

* remove teleport changes - group

* test

* test lock

* remove edition

* feedback

* clarify default data dir

* cleanup

* move version to status

* consistent naming for update.yaml

* improve lock test

* explain lint

* use shared locking logic

* fix test

* Move disable logic to lib

* feedback

* switch to default transport

* [teleport-update] Add enable command (#47565)

* Add enable scaffold

* add installer

* refactor

* add enable tests

* clean up download logic

* Finish installer tests

* cleanup

* fix flags

* fix errors

* logging

* cleanup

* fix test

* Fix download size logic

* remove agent prefixes

* namespace package

* rename file

* feedback

* fips and ent support

* hide force version

* feedback

* feedback 2

* fix test

* move enterprise/fips to webapi

* Fix interface

* RFD 0184: Automatic Updates for Teleport Agents (#47126)

* Create 0169-auto-updates-linux-agents.md

* Fix github handle

* Fix Github handle

* Clarify jitter flag

* Remove time question

* Update rfd/0169-auto-updates-linux-agents.md

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>

* Update rfd/0169-auto-updates-linux-agents.md

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>

* Update rfd/0169-auto-updates-linux-agents.md

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>

* Update 0169-auto-updates-linux-agents.md

* Update 0169-auto-updates-linux-agents.md

* Update 0169-auto-updates-linux-agents.md

* add editions

* Installers and docs

* Update 0169-auto-updates-linux-agents.md

* Update 0169-auto-updates-linux-agents.md

* Update 0169-auto-updates-linux-agents.md

* Update 0169-auto-updates-linux-agents.md

* Update 0169-auto-updates-linux-agents.md

* Downgrades

* Feedback

* Update 0169-auto-updates-linux-agents.md

* Remove last working copy of teleport

* add step to ensure free disk space

* Typos

* Update 0169-auto-updates-linux-agents.md

* Update 0169-auto-updates-linux-agents.md

* feedback

* Update 0169-auto-updates-linux-agents.md

* Update 0169-auto-updates-linux-agents.md

* apt purge

* Only enable auto-upgrades if successful

* reentrant lock

* reset

* Update 0169-auto-updates-linux-agents.md

* add note on backups

* Update 0169-auto-updates-linux-agents.md

* Update 0169-auto-updates-linux-agents.md

* Clarify restore/rollback process and validations

* Added section on logging

* Add schedules

* immediate schedule + note on cycles and chains

* more details, more tctl commands

* Update 0169-auto-updates-linux-agents.md

* scalability

* df

* content-length

* cache init

* binary

* more rollout mechanism changes

* scalability

* more scalability

* use 100kib pages for plan

* Add RPCs, tweak API design

* clarify wording

* wording

* Update rfd/0169-auto-updates-linux-agents.md

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* Update rfd/0169-auto-updates-linux-agents.md

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* linting

* Move all RPCs into autoupdate/v1

* Move groups to MVP

* note about checksum

* typos, consistency

* clarify binary is teleport-update, package is teleport-ent-updater

* switch from df to unix.Statfs

* security feedback + naming adjustments

* tweak rollout paging

* tweak rollout paging again

* feedback

* adjust update.yaml to match implementation feedback

* wip - new model

* canaries

* canary 2

* describe state, transitions, and proxy response

* rpcs

* finish rpcs

* minor tweaks

* Add user stories

* Put new requirements at the top + edit UX + add TODOs

* Edition work

* cleanup + swap phases 1 and 2

* Move protobuf

* Add installation scenarios

* cleanup + move backpressure formulas

* more cleanup

* rename to unused number

* fix title

* more cleanup

* correct inconsistencies

* fix more inconsistencies

* missing proxy flag

* typo

* Add CLI reference

* feedback

* alerts note

* typos

* Update rfd/0184-agent-auto-updates.md

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* clarify canary logic

* Update rfd/0184-agent-auto-updates.md

Co-authored-by: Hugo Shaka <hugo.hervieux@goteleport.com>

* Support for multiple installations / tarball

* Address reviewer's feedback

- Rephrase the UX section to not assume prior canary knowledge
- Explicit how the canaries are picked, the limitations, and potential
  improvements
- replace node with instance to avoid confusion between ssh nodes and
  generic teleport agent instances
- Explicit how the previous updater interacts with the new one
- More explicit names for command line args

* agent_plan -> agent_rollout + reuse autoupdate_config

* align tool version

* Move package system dir

* add time-based strategy

* rename previous-must-succeed -> halt-on-failure

---------

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>
Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Co-authored-by: hugoShaka <hugo.hervieux@goteleport.com>

* [v17] RFD 184: Agent Automatic Updates, teleport-update (#52372)

* [teleport-update] Add linking into /usr/local (#47879)

* clean up download logic

* Finish installer tests

* fips and ent support

* feedback

* move enterprise/fips to webapi

* wip

* wip2

* add cleanup

* fix extract

* wip

* fix tests

* remove safety

* cleanup

* cleanup extract

* cleanup

* cleanup

* fix bugs

* cleanup

* [teleport-update] Use new webapi fields to find version (#47961)

* Adapt teleport-update to new webapi endpoints

* feedback

* [teleport-update] Add support for reloading the agent & reverting symlinks on failed reload (#47929)

* wip

* cleanup

* comments

* test wip

* test link revert

* tests

* cleanup

* cleanup more

* comments

* comments

* errors

* comments

* linting

* fix bugs

* fix typo

* cleanup

* cleanup

* fix revert

* lint

* feedback

* fix

* fix test

* clarify comment

* use afterfunc

* [teleport-update] Add update subcommand (#48244)

* Add update subcommand

* fix

* lint

* add command

* warn on known edition

* warn on unknown edition for update

* [teleport-update] Add link subcommand (#48712)

* wip

* refactor

* docs

* updater

* add link command

* test LinkPackage

* cleanup

* fix enterprise paths

* fix systemd linking

* typo

* comment

* comments

* typo

* feedback

* adjust systemd service locations

* cleanup tests, adjust service link path

* [teleport-update] PID-based failure detection and rollback (#49175)

* Extract from other PR

* comments

* string

* [teleport-update] Add systemd setup (#49174)

* service and timer

* comments

* feedback

* feedback

* [teleport-update] Add unlink-package command (#49250)

* unlink

* test

* lock type

* comments

* cleanup

* Update lib/autoupdate/agent/installer.go

Co-authored-by: Hugo Shaka <hugo.hervieux@goteleport.com>

---------

Co-authored-by: Hugo Shaka <hugo.hervieux@goteleport.com>

* [teleport-update] Add support for version pinning (#49307)

* pinning

* cleanup

* unskip

* cleanup

* unpin

* typo

* [teleport-update] status subcommand (#49308)

* status

* cleanup

* comments

* cleanup output by removing optional fields

* rebase fix

* [teleport-update] Uninstall subcommand (#49341)

* Uninstall

* tests

* comment

* Short-circuit link package on pinned

* log

* move error

* Update lib/autoupdate/agent/process.go

Co-authored-by: Hugo Shaka <hugo.hervieux@goteleport.com>

* Update lib/autoupdate/agent/process.go

Co-authored-by: Hugo Shaka <hugo.hervieux@goteleport.com>

* Update lib/autoupdate/agent/process.go

Co-authored-by: Hugo Shaka <hugo.hervieux@goteleport.com>

* Update lib/autoupdate/agent/process.go

Co-authored-by: Hugo Shaka <hugo.hervieux@goteleport.com>

* fix

---------

Co-authored-by: Hugo Shaka <hugo.hervieux@goteleport.com>

* [teleport-update] Protect against disk space leaks (#49309)

* cleanup unused

* cleanup

* cleanup

* [teleport-update] Show warning instead of return error for link/unlink (#49334)

* Add warning instead of return error for link/unlink

* Add test for sync call with ErrNotSupported

* Change warning message

* [teleport-update] Isolated installation suffix (#49364)

* namespacing

* words

* cli

* fix

* err

* use structured logs consistently

* comments

* bugs

* test

* switch to new paths

* test

* adjust

* reserved

* cleanup

* cleanup

* docs

* fix uninstall

* test

* simplify init

* cleanup

* namespace -> install-suffix

* log

* [teleport-update] Fix usage of trace (#49388)

* fix trace

* rebase

* [teleport-update] Support for Enterprise/FIPS migration (#49451)

* store ent/fips data

cleanup

formatting

revert updater rename

cleanup

Update lib/autoupdate/agent/config.go

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

feedback

* feedback

* feedback

* lint

* [teleport-update] Display download progress and stats (#49805)

* download progress

* typo

* sub -> since

* time -> duration

* [teleport-update] update --now (#49807)

* update --now

* testdata

* [teleport-update] Adjust download progress log output (#49845)

* adjust logger

* fix

* fix

* Extended binary validations (#49748)

* [teleport-update] needrestart and systemd drop-in (#49806)

* wip

* Add more config

* nit

* feedback

* Fix duplicate teleport-update short command (#50304)

* [teleport-update] Version reporting and deprecated upgrader management (#50266)

* wip

* telemetry

* abs

* fix

* tests

* Disable deprecated timer

* keep schedule on non-suffixed

* Update maintenance.go

* Update lib/autoupdate/agent/setup.go

* update warnings

* feedback pt 1

* feedback pt 2

* headers

* [teleport-update] Remove warning when running Teleport on platforms without systemd (#51465)

* improve detection logic on non-systemd platforms

* adjust

* remove OS check

* [teleport-update] common MakeURL with ability to override BaseURL (#51383)

* Add templates for client tools auto-update download url

* Change to base url setting by env

MakeURL moved to common function to be general for both, agent and client tools

* Reuse MakeURL moved to common package

* Fix linter warning

* Add common env variable to override base url

* Remove template from interface

* Make template exported
Change a stale comment

* Remove unused code

* [teleport-update] Adjustments for SELinux (#51474)

* selinux fixes

* extra checks

* lint

* lint

* cleanup

* better cleanup

* fix rebase

* [teleport-update] Add --overwrite flag to replace tarball installations (#51579)

* add --overwrite flag

* extra warning

* [teleport-update] Only use CDN for community / enterprise editions (#51726)

* Only use CDN for community / enterprise

* wording

* [teleport-update] Warn instead of erroring when disabling the deprecated updater (#51759)

* Warn instead of erroring when disabling old updater

* Update lib/service/service.go

* Update lib/service/service.go

* [teleport-update] Adjust non-critical SELinux contexts (#51793)

* correct selinux contexts

* Update lib/autoupdate/agent/installer.go

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* Update lib/autoupdate/agent/installer.go

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

* [teleport-update] Add proper healthcheck for agents (#51613)

* Add socket readiness monitor

* cleanup

* add 404 check

* check

* better cleanup

* fix bug

* typo

* fix 404

* improve logging

* cleanup

* disable socket redirect

* avoid race condition with socket removal

* verify PID

* cleanup

* Update lib/autoupdate/agent/process.go

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* feedback

* fix subtle race condition

* debugging

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* [teleport-update] Allow teleport-update uninstall to succeed with non-packaged installs (#51576)

* Treat missing source bin dir same as missing binaries

* prevent linking package outside /usr/local/bin

* Apply suggestions from code review

* [teleport-update] use new updater to reload and verify Teleport (#51734)

* wip

* finish implementation

* fix tests

* test setup

* remove stale data

* bug

* spelling

* pass log format and debug through

* feedback

* [teleport-update] Read proxy from teleport.yaml to improve UX (#51633)

* derive proxy from config

* fix parsing

* cleanup

* require force for uninstall (#51973)

* [teleport-update] add insecure flag for testing (#52019)

* insecure flag

* fmt

* [teleport-update] skip updater setup when systemd is missing (#52022)

* skip updater installation when systemd is missing

* test

* wording

* [teleport-update] Ensure stable interface between versions of teleport-update (#52152)

* refactor data dir

* finish refactor

* fix path

* cleanup

* more tests

* lint

* prevent notice failure without systemd

* feedback

* url

* revert log level change (#52416)

---------

Co-authored-by: Hugo Shaka <hugo.hervieux@goteleport.com>
Co-authored-by: Vadym Popov <vadym.popov@goteleport.com>
Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

* [v17] [teleport-update] Fix usage of default $PATH dir, overrides, and hanging (#52608)

* Fix usage of default path

* fix other overrides

* fix hang on start

* [v17] [teleport-update] Set umask 0022 for teleport-update to avoid errors on enable (#52755)

* Set umask 0022 for teleport-update

* init -> main

* refactor

* move const

* add flag

* missed not

* fix inequality

* remove flag

* dead code

* docs

* docs 2

* feedback

* [v17] [teleport-update] Support for CentOS 7 (#53017)

* support systemd down to 219

* comments

* Apply suggestions from code review

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Missed check on additional use of IsPresent

* adjustments from testing various versions of centos7

* Typo

* Use dedicated error for version incompat

---------

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* [v17] [teleport-update] Improve clarity of error logs and address UX edge cases (#53048)

* Usability fixes

* cancel jitter

* root + fix logs

* check extra case

* cleanup

* extra warning

* tests

* feedback

* add newlines

* adjust message

* consistent error type

* update UI snapshots

* [v17] Backport packaging restructuring and teleport-update build (#52361)

* [teleport-update] Add Makefile build target (#48531)

* Add build target for teleport-update

* Set CGO_ENABLED=0 for building teleport-update

* [teleport-update] Add teleport-update to build and archive (#48839)

* Add teleport-update to build and archive

* Add teleport-update to install scripts

* Add build flags without buildmode pie

* Add helper message for install.sh script

* Exclude teleport-update from darwin platform

* Add teleport-update to rpm and deb packages

* Remove teleport-update from deb, rpm packages
Add comment for the buildflags

* [teleport-update] Move teleport binaries to new path {deb,rpm} (#49110)

* Move teleport binaries to new path

* Use link/unlink command to manage links
Move teleport.service to new path

* Move teleport binaries under standard path for distroless
Cleanup

* Fix wrong move path

* Create missing directory

* Rename link/unlink commands

* Exclude teleport-update from docker image
Systemd reload now managed by teleport-update
Make safe unlink not to block package removal

* Add teleport-update to AMI image build

* Fix RPM build, fpm automatically manage scripts

* Fix AMI build, add missing teleport.service

* Move binaries to /opt/teleport/system

* Add check to installation script when we copy files from tarball (#50368)

* bump e

* Fix RPM linking logic (#52704)

* Use quoting style supported by pre-2015 systemd (#53179) (#53196)

* [teleport-update] Additional log message and UX cleanup (#53180) (#53197)

* More teleport-update UX cleanup

* cleanup overwrite error

* cleanup

* more cleanup

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Co-authored-by: Marco Dinis <marco.dinis@goteleport.com>
Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>
Co-authored-by: Alan Parra <alan.parra@goteleport.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Co-authored-by: Bartosz Leper <bartosz.leper@goteleport.com>
Co-authored-by: Tiago Silva <tiago.silva@goteleport.com>
Co-authored-by: Russell Jones <russjones@users.noreply.github.com>
Co-authored-by: Vadym Popov <vadym.popov@goteleport.com>
Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Fix proto resource 153 marshalling for autoupdate_* resources (#50688)

* Fix proto resource 153 marshalling

* Update tool/tctl/common/collection_test.go

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* Update tool/tctl/common/collection_test.go

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* Address feedback

- Change from Resource153AdapterV2 to ProtoResource153Adapter
- fix test failures and unmarshal proto resources properly
- add a failing round-trip proto 153 test case
- bonus: fix the table tesst reosurce create that did not support
  running a single row

* Apply suggestions from code review

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* lint

---------

Co-authored-by: Alan Parra <alan.parra@goteleport.com>

* Craft a teleport-update based installer compatible iwth v15 code

* fix frontend lint issue (not introduced by us?)

* remove AGPL tests in updater as v15 oss is not AGPL

* [teleport-update] Add local updater metadata (#53602) (#53829)

* add metadata

* newline

* cleanup status error

* refactor status error

* fix print

* order

* fix test on linux

* Truncate time to ms

* add host param to request

* jitter locally

* rename host to id

* rename func var

* [v16] [teleport-update] Stop writing updater ID from teleport-update (#54012)

* new strategy: use deterministic boot-persistent id

* add error

* check id length

* unexport machine id

* Set group to 'default' if unset + avoid setting default group in config (#54049)

* [v16] [teleport-update] Change strategy for disabling teleport-upgrade timer (#54086)

* Change strategy for disabling old upgrader

* logging

* remove file

* remove const

* cleanup

* comment about namespaced installs

* re-remove AGPL-related tests as there's no AGPL Teleport v15

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Co-authored-by: Marco Dinis <marco.dinis@goteleport.com>
Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>
Co-authored-by: Alan Parra <alan.parra@goteleport.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Co-authored-by: Bartosz Leper <bartosz.leper@goteleport.com>
Co-authored-by: Tiago Silva <tiago.silva@goteleport.com>
Co-authored-by: Russell Jones <russjones@users.noreply.github.com>
Co-authored-by: Vadym Popov <vadym.popov@goteleport.com>
Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* backport testutils

* Use vanilla slog logger in teleport update

* go mod tidy

* remove usage of new go features

* run yarn prettier

* fix autoupdate agent rollout tests pt.1

* add 'autoupdate' to cspell

* add good ol' tt := tt in parallel test loop

* remove missing debug service in test

* gci + remove dead code

* fix bad conflict resolution in script

* make oneoff sh-compliant again

* fix oneoff tests

* tt := tt

* lint

* [teleport-update] Run FIPS teleport with --fips flag (#54529)

* fix fips bug with systemd service

* fix bugs, add testing flags

* fix tests

---------

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Co-authored-by: Stephen Levine <stephen.levine@goteleport.com>
Co-authored-by: Marco Dinis <marco.dinis@goteleport.com>
Co-authored-by: Alan Parra <alan.parra@goteleport.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Co-authored-by: Bartosz Leper <bartosz.leper@goteleport.com>
Co-authored-by: Tiago Silva <tiago.silva@goteleport.com>
Co-authored-by: Russell Jones <russjones@users.noreply.github.com>
Co-authored-by: Vadym Popov <vadym.popov@goteleport.com>
Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/branch/v17 no-changelog Indicates that a PR does not require a changelog entry size/sm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants