Skip to content

Fix major version check for stateless environment#52837

Merged
vapopov merged 20 commits intomasterfrom
vapopov/major-version-check-v2
May 5, 2025
Merged

Fix major version check for stateless environment#52837
vapopov merged 20 commits intomasterfrom
vapopov/major-version-check-v2

Conversation

@vapopov
Copy link
Copy Markdown
Contributor

@vapopov vapopov commented Mar 6, 2025

In this PR local process storage replaced with the backend storage with proper resource serialization. For current self-hosted clusters preserved last known version must be migrated to backend storage.

Related: #49848
Changelog: Fixed major version check for stateless environment

@vapopov
Copy link
Copy Markdown
Contributor Author

vapopov commented Mar 6, 2025

We would probably need to check inventory store to identify version sets for the auth servers, for instance if we have in cluster auth1:v17.0.0, auth2:v18.0.0 - auth1 should be restricted to be downgraded to v16, and auth2 to be upgraded to v19. Not clear if we able to retrieve full information from inventory stream in auth initialization step. tbd

@vapopov vapopov force-pushed the vapopov/major-version-check-v2 branch from 08b234a to 7f24cf5 Compare March 6, 2025 10:57
@vapopov vapopov requested review from espadolini and hugoShaka March 11, 2025 01:51
@vapopov
Copy link
Copy Markdown
Contributor Author

vapopov commented Mar 11, 2025

Also was trying in parallel PR to reuse current PresenceService which stores the Teleport version in auth_server spec (to not create additional resource), but LocalSupervisor seems like keep updating this info at every start even if auth service is failed to init.

@vapopov vapopov marked this pull request as ready for review March 11, 2025 06:45
@github-actions github-actions Bot requested review from kiosion and probakowski March 11, 2025 06:45
Copy link
Copy Markdown
Contributor

@hugoShaka hugoShaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, during the startup:

  • the auth infos are listed on startup to discover the lowest and highest major versions
  • teleport writes its own authinfo resource with a TTL of 24h

I see several issues with the current approach:

  • the auth info write happens on startup, if no auth got restarted in the last 24 hours, all authinfo resources are expired and the check becomes useless as it cannot figure the previous auth version
  • it is not possible to perform sequentially two major updates the same day

Note: there's a variant of the first issue where no auth were running in the last 24 hours, this often happens when people forget to clear their dynamo backend and a new cluster reuses the old backend. So making the authInfo writes periodic might not completely solve the issue.

I don't have another solution to suggest yet but I would expect the version check to have the following properties:

  • consistently works even if the cluster was downscale for a week
  • supports every officially supported version transitions (e.g. updating from v15 to v16 and v16 to v17 the same day, as long as no 2 auths are running concurrently)

@vapopov
Copy link
Copy Markdown
Contributor Author

vapopov commented Mar 12, 2025

@hugoShaka I was mostly worried case when after downscale you wont be able to update other auth instances because of existing records of downscaled instances, this is why expiry is set. We might just make it simple as possible and always check per host id instead of range of versions (but this don't prevent the case when we have v17 and v16 instances running in cluster to downgrade v16 one to v15, there might be the issues with roles registration in storage for instance, like deprecating or adding new role in new major version).

Another option I thought about is to use inventory storage and check the version for only online nodes (this one requires time to propagate hello messages from inventory stream, so we need to delay auth initialization). Correct me if I wrong.

@hugoShaka
Copy link
Copy Markdown
Contributor

We might just make it simple as possible and always check per host id instead of range of versions

This will not work on stateless deployments as hostIDs are not persisted.

@hugoShaka
Copy link
Copy Markdown
Contributor

hugoShaka commented Mar 12, 2025

I wonder if a single non-expiring cluster-wide BackendInfo resource would work. Something like:

  • auth starts
  • auth gets BackendInfo
    • if BackendInfo version is too low we crash
    • if BackendInfo is lower than us, we update backend info
    • if BackendInfo is higher than us, we don't change anything
  • auth continues to start and runs its migrations
  • (optionally) auth updates BackendInfo again to note that it finished applied its migrations

With optimistic locking we should be protected against races. The resource would represent pretty closely the backend state.

Another variant would be to create one BackendInfo per Teleport version, on startup we would list all backend infos and look at the highest. This would be closer to a regular database migration mechanism and the resource would represent the history of the backend migrations, which might be useful.

cc @espadolini @rosstimothy and @fspmarshall

@espadolini
Copy link
Copy Markdown
Contributor

You can't do conditional operations on key ranges (and even fetching key ranges is sort of wobbly and slightly inconsistent, time-wise), so the logic becomes a lot more awkward if we have to make decisions based on a range of items.

I'm +1 for recording "the version" of the cluster in a single item and conditionally updating it at auth start, with the understanding that this only acts as a lint and not as a hard compatibility guarantee (the only way to truly do that would be to make every read and write an atomic operation that also checks the version item, AFAICT). We could reuse --skip-version-check (which only currently works on proxies and peripheral agents) to bypass the check to allow for downgrading, too.

@kiosion kiosion removed their request for review April 8, 2025 23:49
Copy link
Copy Markdown
Contributor

@hugoShaka hugoShaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks way better, thanks for taking the time to do all the changes again.

cc @espadolini for another review 🙏

Comment thread lib/services/local/authinfo.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment on lines +106 to +108
if _, err := s.backed.Put(ctx, item); err != nil {
return trace.Wrap(err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unconditional put makes racing possible, someone might have updated the version in the meantime. This operation should use conditional updates based on revision. I think we have all the machinery available in the generic service struct.

Copy link
Copy Markdown
Contributor Author

@vapopov vapopov Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changing this one, but not sure if it is good to fail auth init because when we do parallel scale of auth instances and spawning same auth versions in cluster, they have to conditionally write to the same key and theoretically if they launched and the same second one of the instances must fail to write and restart

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time you're doing a read-modify-write you should handle the possibility of having to retry - it would be really weird to have to retry more than a handful of times here, so maybe give it a completely unscientific 5 attempts before giving up?

@hugoShaka hugoShaka self-requested a review April 9, 2025 15:39
Comment thread lib/services/local/authinfo.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment thread api/types/authinfo/info.go Outdated
Comment thread lib/auth/version.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment thread api/types/authinfo/info.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment thread api/proto/teleport/authinfo/v1/authinfo.proto Outdated
vapopov added 2 commits April 15, 2025 17:23
Change to use conditional update and retry
Fix spelling
@vapopov vapopov force-pushed the vapopov/major-version-check-v2 branch from 4c117df to 9215b65 Compare April 16, 2025 00:35
@vapopov
Copy link
Copy Markdown
Contributor Author

vapopov commented Apr 17, 2025

@espadolini @hugoShaka could you please have another look?

Comment thread api/proto/teleport/authinfo/v1/authinfo.proto Outdated
Comment thread api/proto/teleport/authinfo/v1/authinfo.proto Outdated
Comment thread api/types/authinfo/info.go Outdated
Comment thread api/types/authinfo/info.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment thread lib/services/local/authinfo.go Outdated
Comment on lines +91 to +92
// WriteTeleportVersion writes the last known Teleport version to the backend.
func (s *AuthInfoService) WriteTeleportVersion(ctx context.Context, version semver.Version) (err error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just going to write the version you can just use upsert, the point of conditional operations is that you read the current value, you make a decision based on the value and then you update the value as long as nothing else changed it. With this implementation, a cluster running 16.3.0 that happens to launch a 16.4.2 and a 17.3.3 auth at the same time might end up with a stored version of 16.4.2 if the write of the first auth ends up happening later than the write of the second.

This function should be called UpdateTeleportVersion or something along those lines, and it should take the revision of the previous auth_info resource.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, I've moved retry logic to version check function, where we request auth_info resource with revision and retry if its already created or already updated by another process

@vapopov vapopov force-pushed the vapopov/major-version-check-v2 branch from 17f4eef to 3d20bfc Compare May 1, 2025 09:37
@vapopov vapopov force-pushed the vapopov/major-version-check-v2 branch from 3d20bfc to 140da4d Compare May 1, 2025 09:44
Add cleanup of the version item from process storage after successful migration
@vapopov vapopov force-pushed the vapopov/major-version-check-v2 branch from 7e37261 to 8956cef Compare May 2, 2025 01:35
@vapopov vapopov force-pushed the vapopov/major-version-check-v2 branch from f96ac71 to 3aa4da5 Compare May 2, 2025 04:29
@vapopov
Copy link
Copy Markdown
Contributor Author

vapopov commented May 2, 2025

@espadolini @hugoShaka hope I've addressed all your comments, if I miss something let me know

Comment thread lib/auth/version.go Outdated
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
@vapopov vapopov added this pull request to the merge queue May 5, 2025
Merged via the queue into master with commit a78cb0e May 5, 2025
43 checks passed
@vapopov vapopov deleted the vapopov/major-version-check-v2 branch May 5, 2025 23:12
@backport-bot-workflows
Copy link
Copy Markdown
Contributor

@vapopov See the table below for backport results.

Branch Result
branch/v15 Failed
branch/v16 Failed
branch/v17 Failed

vapopov added a commit that referenced this pull request May 8, 2025
* Added auth information resource with persisting teleport version

* Check the set of the auth info to compare with min/max versions in cluster

* Store only one entity for the cluster version

* Add CRUD endpoints
Change to use conditional update and retry
Fix spelling

* Add validation for version, sub kind, kind, name

* Rename to authinfo.go

* Assigning error after creating new resource

* Move retry logic to version check helper

* Call create/update depends on if resource is created already
Read local database once without retry

* Restrict major version downgrade

* Make `--skip-version-check` available for major upgrade check

* Fix linter warnings

* Add logs and make skip more safe in case of broke item in backend

* Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage)

* Rename AuthInfo to BackendInfo
Add cleanup of the version item from process storage after successful migration

* Make skip-version-check to upsert the version in backend

* Update lib/auth/version.go

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
vapopov added a commit that referenced this pull request May 8, 2025
* Added auth information resource with persisting teleport version

* Check the set of the auth info to compare with min/max versions in cluster

* Store only one entity for the cluster version

* Add CRUD endpoints
Change to use conditional update and retry
Fix spelling

* Add validation for version, sub kind, kind, name

* Rename to authinfo.go

* Assigning error after creating new resource

* Move retry logic to version check helper

* Call create/update depends on if resource is created already
Read local database once without retry

* Restrict major version downgrade

* Make `--skip-version-check` available for major upgrade check

* Fix linter warnings

* Add logs and make skip more safe in case of broke item in backend

* Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage)

* Rename AuthInfo to BackendInfo
Add cleanup of the version item from process storage after successful migration

* Make skip-version-check to upsert the version in backend

* Update lib/auth/version.go

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
vapopov added a commit that referenced this pull request May 8, 2025
* Added auth information resource with persisting teleport version

* Check the set of the auth info to compare with min/max versions in cluster

* Store only one entity for the cluster version

* Add CRUD endpoints
Change to use conditional update and retry
Fix spelling

* Add validation for version, sub kind, kind, name

* Rename to authinfo.go

* Assigning error after creating new resource

* Move retry logic to version check helper

* Call create/update depends on if resource is created already
Read local database once without retry

* Restrict major version downgrade

* Make `--skip-version-check` available for major upgrade check

* Fix linter warnings

* Add logs and make skip more safe in case of broke item in backend

* Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage)

* Rename AuthInfo to BackendInfo
Add cleanup of the version item from process storage after successful migration

* Make skip-version-check to upsert the version in backend

* Update lib/auth/version.go

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
vapopov added a commit that referenced this pull request May 8, 2025
* Added auth information resource with persisting teleport version

* Check the set of the auth info to compare with min/max versions in cluster

* Store only one entity for the cluster version

* Add CRUD endpoints
Change to use conditional update and retry
Fix spelling

* Add validation for version, sub kind, kind, name

* Rename to authinfo.go

* Assigning error after creating new resource

* Move retry logic to version check helper

* Call create/update depends on if resource is created already
Read local database once without retry

* Restrict major version downgrade

* Make `--skip-version-check` available for major upgrade check

* Fix linter warnings

* Add logs and make skip more safe in case of broke item in backend

* Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage)

* Rename AuthInfo to BackendInfo
Add cleanup of the version item from process storage after successful migration

* Make skip-version-check to upsert the version in backend

* Update lib/auth/version.go

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
vapopov added a commit that referenced this pull request May 8, 2025
* Added auth information resource with persisting teleport version

* Check the set of the auth info to compare with min/max versions in cluster

* Store only one entity for the cluster version

* Add CRUD endpoints
Change to use conditional update and retry
Fix spelling

* Add validation for version, sub kind, kind, name

* Rename to authinfo.go

* Assigning error after creating new resource

* Move retry logic to version check helper

* Call create/update depends on if resource is created already
Read local database once without retry

* Restrict major version downgrade

* Make `--skip-version-check` available for major upgrade check

* Fix linter warnings

* Add logs and make skip more safe in case of broke item in backend

* Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage)

* Rename AuthInfo to BackendInfo
Add cleanup of the version item from process storage after successful migration

* Make skip-version-check to upsert the version in backend

* Update lib/auth/version.go

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
vapopov added a commit that referenced this pull request May 8, 2025
* Added auth information resource with persisting teleport version

* Check the set of the auth info to compare with min/max versions in cluster

* Store only one entity for the cluster version

* Add CRUD endpoints
Change to use conditional update and retry
Fix spelling

* Add validation for version, sub kind, kind, name

* Rename to authinfo.go

* Assigning error after creating new resource

* Move retry logic to version check helper

* Call create/update depends on if resource is created already
Read local database once without retry

* Restrict major version downgrade

* Make `--skip-version-check` available for major upgrade check

* Fix linter warnings

* Add logs and make skip more safe in case of broke item in backend

* Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage)

* Rename AuthInfo to BackendInfo
Add cleanup of the version item from process storage after successful migration

* Make skip-version-check to upsert the version in backend

* Update lib/auth/version.go

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
github-merge-queue Bot pushed a commit that referenced this pull request May 9, 2025
* Added auth information resource with persisting teleport version

* Check the set of the auth info to compare with min/max versions in cluster

* Store only one entity for the cluster version

* Add CRUD endpoints
Change to use conditional update and retry
Fix spelling

* Add validation for version, sub kind, kind, name

* Rename to authinfo.go

* Assigning error after creating new resource

* Move retry logic to version check helper

* Call create/update depends on if resource is created already
Read local database once without retry

* Restrict major version downgrade

* Make `--skip-version-check` available for major upgrade check

* Fix linter warnings

* Add logs and make skip more safe in case of broke item in backend

* Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage)

* Rename AuthInfo to BackendInfo
Add cleanup of the version item from process storage after successful migration

* Make skip-version-check to upsert the version in backend

* Update lib/auth/version.go



---------

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants