services: fix data integrity errors for Nomad native services #20590

tgross · 2024-05-15T14:55:37Z

This changeset fixes three potential data integrity issues between allocations and their Nomad native service registrations.

When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services. This makes it possible for the leader to successfully write the node update to Raft without removing services.

To address this, move the delete into the same Raft transaction. One minor caveat with this approach is the upgrade path: if the leader is upgraded first and a node is marked down during this window, older followers will have stale information until they are also upgraded. This is unavoidable without requiring the leader to unconditionally make an extra Raft write for every down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in data integrity for stale reads seems like a reasonable tradeoff.
When an allocation is marked client-terminal from the client in UpdateAllocsFromClient, we have an opportunity to ensure data integrity by deregistering services for that allocation.
When an allocation is deleted during eval garbage collection, we have an opportunity to ensure data integrity by deregistering services for that allocation. This is a cheap no-op if the allocation has been previously marked client-terminal.

This changeset does not address client-side retries for the originally reported issue, which will be done in a separate PR.

Ref: #16616

This changeset fixes three potential data integrity issues between allocations and their Nomad native service registrations. * When a node is marked down because it missed heartbeats, we remove Vault and Consul tokens (for the pre-Workload Identity workflows) after we've written the node update to Raft. This is unavoidably non-transactional because the Consul and Vault servers aren't in the same Raft cluster as Nomad itself. But we've unnecessarily mirrored this same behavior to deregister Nomad services. This makes it possible for the leader to successfully write the node update to Raft without removing services. To address this, move the delete into the same Raft transaction. One minor caveat with this approach is the upgrade path: if the leader is upgraded first and a node is marked down during this window, older followers will have stale information until they are also upgraded. This is unavoidable without requiring the leader to unconditionally make an extra Raft write for every down node until 2 LTS versions after Nomad 1.8.0. This temporary reduction in data integrity for stale reads seems like a reasonable tradeoff. * When an allocation is marked client-terminal from the client in `UpdateAllocsFromClient`, we have an opportunity to ensure data integrity by deregistering services for that allocation. * When an allocation is deleted during eval garbage collection, we have an opportunity to ensure data integrity by deregistering services for that allocation. This is a cheap no-op if the allocation has been previously marked client-terminal. This changeset does not address client-side retries for the originally reported issue, which will be done in a separate PR. Ref: #16616

shoenig

LGTM!

jrasell

LGTM, thanks @tgross!

When the allocation is stopped, we deregister the service in the alloc runner's `PreKill` hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload only logs errors and never retries them. Add a short retry loop to the `RemoveWorkload` method for Nomad services, so that transient errors give us an extra opportunity to deregister the service before the tasks are stopped, before we need to fall back to the data integrity improvements implemented in #20590. Ref: #16616

) When the allocation is stopped, we deregister the service in the alloc runner's `PreKill` hook. This ensures we delete the service registration and wait for the shutdown delay before shutting down the tasks, so that workloads can drain their connections. However, the call to remove the workload only logs errors and never retries them. Add a short retry loop to the `RemoveWorkload` method for Nomad services, so that transient errors give us an extra opportunity to deregister the service before the tasks are stopped, before we need to fall back to the data integrity improvements implemented in #20590. Ref: #16616

vercel bot deployed to Preview – nomad-storybook-and-ui May 15, 2024 14:58 View deployment

tgross added this to the 1.8.0 milestone May 15, 2024

tgross marked this pull request as ready for review May 15, 2024 15:33

tgross added theme/service-discovery theme/service-discovery/nomad type/bug backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line labels May 15, 2024

tgross requested review from jrasell, shoenig and Juanadelacuesta May 15, 2024 15:33

tgross mentioned this pull request May 15, 2024

Services not unregistered #16616

Closed

tgross force-pushed the ensure-nomad-services-deregistered branch from 2fcae6f to 0ef189b Compare May 15, 2024 15:39

tgross force-pushed the ensure-nomad-services-deregistered branch from 0ef189b to f5761e9 Compare May 15, 2024 15:39

vercel bot deployed to Preview – nomad-storybook-and-ui May 15, 2024 15:44 View deployment

shoenig approved these changes May 15, 2024

View reviewed changes

jrasell approved these changes May 15, 2024

View reviewed changes

tgross merged commit 6d806a9 into main May 15, 2024
19 checks passed

tgross deleted the ensure-nomad-services-deregistered branch May 15, 2024 15:56

This was referenced May 15, 2024

Backport of services: fix data integrity errors for Nomad native services into release/1.6.x #20591

Merged

Backport of services: fix data integrity errors for Nomad native services into release/1.7.x #20592

Merged

tgross mentioned this pull request May 15, 2024

services: retry failed Nomad service deregistrations from client #20596

Merged

This was referenced May 16, 2024

Backport of services: retry failed Nomad service deregistrations from client into release/1.6.x #20606

Merged

Backport of services: retry failed Nomad service deregistrations from client into release/1.7.x #20607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

services: fix data integrity errors for Nomad native services #20590

services: fix data integrity errors for Nomad native services #20590

tgross commented May 15, 2024

shoenig left a comment

jrasell left a comment

services: fix data integrity errors for Nomad native services #20590

services: fix data integrity errors for Nomad native services #20590

Conversation

tgross commented May 15, 2024

shoenig left a comment

Choose a reason for hiding this comment

jrasell left a comment

Choose a reason for hiding this comment