Fix major version check for stateless environment#52837
Conversation
|
We would probably need to check inventory store to identify version sets for the auth servers, for instance if we have in cluster |
08b234a to
7f24cf5
Compare
|
Also was trying in parallel PR to reuse current |
hugoShaka
left a comment
There was a problem hiding this comment.
If I understand correctly, during the startup:
- the auth infos are listed on startup to discover the lowest and highest major versions
- teleport writes its own authinfo resource with a TTL of 24h
I see several issues with the current approach:
- the auth info write happens on startup, if no auth got restarted in the last 24 hours, all authinfo resources are expired and the check becomes useless as it cannot figure the previous auth version
- it is not possible to perform sequentially two major updates the same day
Note: there's a variant of the first issue where no auth were running in the last 24 hours, this often happens when people forget to clear their dynamo backend and a new cluster reuses the old backend. So making the authInfo writes periodic might not completely solve the issue.
I don't have another solution to suggest yet but I would expect the version check to have the following properties:
- consistently works even if the cluster was downscale for a week
- supports every officially supported version transitions (e.g. updating from v15 to v16 and v16 to v17 the same day, as long as no 2 auths are running concurrently)
|
@hugoShaka I was mostly worried case when after downscale you wont be able to update other auth instances because of existing records of downscaled instances, this is why expiry is set. We might just make it simple as possible and always check per host id instead of range of versions (but this don't prevent the case when we have v17 and v16 instances running in cluster to downgrade v16 one to v15, there might be the issues with roles registration in storage for instance, like deprecating or adding new role in new major version). Another option I thought about is to use inventory storage and check the version for only online nodes (this one requires time to propagate |
This will not work on stateless deployments as hostIDs are not persisted. |
|
I wonder if a single non-expiring cluster-wide BackendInfo resource would work. Something like:
With optimistic locking we should be protected against races. The resource would represent pretty closely the backend state. Another variant would be to create one BackendInfo per Teleport version, on startup we would list all backend infos and look at the highest. This would be closer to a regular database migration mechanism and the resource would represent the history of the backend migrations, which might be useful. cc @espadolini @rosstimothy and @fspmarshall |
|
You can't do conditional operations on key ranges (and even fetching key ranges is sort of wobbly and slightly inconsistent, time-wise), so the logic becomes a lot more awkward if we have to make decisions based on a range of items. I'm +1 for recording "the version" of the cluster in a single item and conditionally updating it at auth start, with the understanding that this only acts as a lint and not as a hard compatibility guarantee (the only way to truly do that would be to make every read and write an atomic operation that also checks the version item, AFAICT). We could reuse |
hugoShaka
left a comment
There was a problem hiding this comment.
This looks way better, thanks for taking the time to do all the changes again.
cc @espadolini for another review 🙏
| if _, err := s.backed.Put(ctx, item); err != nil { | ||
| return trace.Wrap(err) | ||
| } |
There was a problem hiding this comment.
The unconditional put makes racing possible, someone might have updated the version in the meantime. This operation should use conditional updates based on revision. I think we have all the machinery available in the generic service struct.
There was a problem hiding this comment.
I'm changing this one, but not sure if it is good to fail auth init because when we do parallel scale of auth instances and spawning same auth versions in cluster, they have to conditionally write to the same key and theoretically if they launched and the same second one of the instances must fail to write and restart
There was a problem hiding this comment.
Every time you're doing a read-modify-write you should handle the possibility of having to retry - it would be really weird to have to retry more than a handful of times here, so maybe give it a completely unscientific 5 attempts before giving up?
Change to use conditional update and retry Fix spelling
4c117df to
9215b65
Compare
|
@espadolini @hugoShaka could you please have another look? |
| // WriteTeleportVersion writes the last known Teleport version to the backend. | ||
| func (s *AuthInfoService) WriteTeleportVersion(ctx context.Context, version semver.Version) (err error) { |
There was a problem hiding this comment.
If this is just going to write the version you can just use upsert, the point of conditional operations is that you read the current value, you make a decision based on the value and then you update the value as long as nothing else changed it. With this implementation, a cluster running 16.3.0 that happens to launch a 16.4.2 and a 17.3.3 auth at the same time might end up with a stored version of 16.4.2 if the write of the first auth ends up happening later than the write of the second.
This function should be called UpdateTeleportVersion or something along those lines, and it should take the revision of the previous auth_info resource.
There was a problem hiding this comment.
make sense, I've moved retry logic to version check function, where we request auth_info resource with revision and retry if its already created or already updated by another process
17f4eef to
3d20bfc
Compare
…n't support deletion (requires implementation for kube secrets storage)
3d20bfc to
140da4d
Compare
Add cleanup of the version item from process storage after successful migration
7e37261 to
8956cef
Compare
f96ac71 to
3aa4da5
Compare
|
@espadolini @hugoShaka hope I've addressed all your comments, if I miss something let me know |
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
* Added auth information resource with persisting teleport version * Check the set of the auth info to compare with min/max versions in cluster * Store only one entity for the cluster version * Add CRUD endpoints Change to use conditional update and retry Fix spelling * Add validation for version, sub kind, kind, name * Rename to authinfo.go * Assigning error after creating new resource * Move retry logic to version check helper * Call create/update depends on if resource is created already Read local database once without retry * Restrict major version downgrade * Make `--skip-version-check` available for major upgrade check * Fix linter warnings * Add logs and make skip more safe in case of broke item in backend * Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage) * Rename AuthInfo to BackendInfo Add cleanup of the version item from process storage after successful migration * Make skip-version-check to upsert the version in backend * Update lib/auth/version.go Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> --------- Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
* Added auth information resource with persisting teleport version * Check the set of the auth info to compare with min/max versions in cluster * Store only one entity for the cluster version * Add CRUD endpoints Change to use conditional update and retry Fix spelling * Add validation for version, sub kind, kind, name * Rename to authinfo.go * Assigning error after creating new resource * Move retry logic to version check helper * Call create/update depends on if resource is created already Read local database once without retry * Restrict major version downgrade * Make `--skip-version-check` available for major upgrade check * Fix linter warnings * Add logs and make skip more safe in case of broke item in backend * Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage) * Rename AuthInfo to BackendInfo Add cleanup of the version item from process storage after successful migration * Make skip-version-check to upsert the version in backend * Update lib/auth/version.go Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> --------- Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
* Added auth information resource with persisting teleport version * Check the set of the auth info to compare with min/max versions in cluster * Store only one entity for the cluster version * Add CRUD endpoints Change to use conditional update and retry Fix spelling * Add validation for version, sub kind, kind, name * Rename to authinfo.go * Assigning error after creating new resource * Move retry logic to version check helper * Call create/update depends on if resource is created already Read local database once without retry * Restrict major version downgrade * Make `--skip-version-check` available for major upgrade check * Fix linter warnings * Add logs and make skip more safe in case of broke item in backend * Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage) * Rename AuthInfo to BackendInfo Add cleanup of the version item from process storage after successful migration * Make skip-version-check to upsert the version in backend * Update lib/auth/version.go Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> --------- Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
* Added auth information resource with persisting teleport version * Check the set of the auth info to compare with min/max versions in cluster * Store only one entity for the cluster version * Add CRUD endpoints Change to use conditional update and retry Fix spelling * Add validation for version, sub kind, kind, name * Rename to authinfo.go * Assigning error after creating new resource * Move retry logic to version check helper * Call create/update depends on if resource is created already Read local database once without retry * Restrict major version downgrade * Make `--skip-version-check` available for major upgrade check * Fix linter warnings * Add logs and make skip more safe in case of broke item in backend * Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage) * Rename AuthInfo to BackendInfo Add cleanup of the version item from process storage after successful migration * Make skip-version-check to upsert the version in backend * Update lib/auth/version.go Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> --------- Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
* Added auth information resource with persisting teleport version * Check the set of the auth info to compare with min/max versions in cluster * Store only one entity for the cluster version * Add CRUD endpoints Change to use conditional update and retry Fix spelling * Add validation for version, sub kind, kind, name * Rename to authinfo.go * Assigning error after creating new resource * Move retry logic to version check helper * Call create/update depends on if resource is created already Read local database once without retry * Restrict major version downgrade * Make `--skip-version-check` available for major upgrade check * Fix linter warnings * Add logs and make skip more safe in case of broke item in backend * Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage) * Rename AuthInfo to BackendInfo Add cleanup of the version item from process storage after successful migration * Make skip-version-check to upsert the version in backend * Update lib/auth/version.go Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> --------- Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
* Added auth information resource with persisting teleport version * Check the set of the auth info to compare with min/max versions in cluster * Store only one entity for the cluster version * Add CRUD endpoints Change to use conditional update and retry Fix spelling * Add validation for version, sub kind, kind, name * Rename to authinfo.go * Assigning error after creating new resource * Move retry logic to version check helper * Call create/update depends on if resource is created already Read local database once without retry * Restrict major version downgrade * Make `--skip-version-check` available for major upgrade check * Fix linter warnings * Add logs and make skip more safe in case of broke item in backend * Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage) * Rename AuthInfo to BackendInfo Add cleanup of the version item from process storage after successful migration * Make skip-version-check to upsert the version in backend * Update lib/auth/version.go Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> --------- Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
* Added auth information resource with persisting teleport version * Check the set of the auth info to compare with min/max versions in cluster * Store only one entity for the cluster version * Add CRUD endpoints Change to use conditional update and retry Fix spelling * Add validation for version, sub kind, kind, name * Rename to authinfo.go * Assigning error after creating new resource * Move retry logic to version check helper * Call create/update depends on if resource is created already Read local database once without retry * Restrict major version downgrade * Make `--skip-version-check` available for major upgrade check * Fix linter warnings * Add logs and make skip more safe in case of broke item in backend * Remove deleting version item from process database, stateBackend doesn't support deletion (requires implementation for kube secrets storage) * Rename AuthInfo to BackendInfo Add cleanup of the version item from process storage after successful migration * Make skip-version-check to upsert the version in backend * Update lib/auth/version.go --------- Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
In this PR local process storage replaced with the backend storage with proper resource serialization. For current self-hosted clusters preserved last known version must be migrated to backend storage.
Related: #49848
Changelog: Fixed major version check for stateless environment