Fix multiple duplicate deletion attempts #198

schneems · 2023-02-28T16:54:24Z

When multiple processes are attempting to cull the old apps list to free up enough space (to get under the limit), each will try to delete the same app.

We already introduced a file based lock for multiple processes on the same machine. However the lock was not being used while clearing out old apps. That has been added.

This file locking doesn't fix the problem, but it limits it instead of N machines running M processes, you only have to account for N machines.

The next fix is that when a duplicate deletion is detected, instead of having all machines continue on as fast as possible, the machines that "lost" the race will sleep to allow the winner to continue removing apps. When it wakes up it will refresh its API list and try again. This will happen in a loop until there are no more apps over TTL or it becomes the winner and other apps must sleep. This sleep logic uses the shared reaper backoff algorithm shared with the Platform gem detailed in my blog https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/. When it begins to sleep, it will be a very small number, if there are lots of apps to be deleted and this triggers over and over, it will sleep longer and longer until it asymptotically reaches the max sleep value which is also the max TTL value.

The last change is that the @apps list (essentially a cache from the API) was not modified in this destroy_older_apps method so unless refresh_app_list was called every run to this method starts at the end and would trigger N AlreadyDeleted errors as it's in a data race with itself. This new code now removes an app from the list as they're being deleted to avoid this problem.

When multiple processes are attempting to cull the old apps list to free up enough space (to get under the limit), each will try to delete the same app. We already introduced a file based lock for multiple processes on the same machine. However the lock was not being used while clearing out old apps. That has been added. This file locking doesn't fix the problem, but it limits it instead of N machines running M processes, you only have to account for N machines. The next fix is that when a duplicate deletion is detected, instead of having all machines continue on as fast as possible, the machines that "lost" the race will sleep to allow the winner to continue removing apps. When it wakes up it will refresh its API list and try again. This will happen in a loop until there are no more apps over TTL or it becomes the winner and other apps must sleep. This sleep logic uses the shared reaper backoff algorithm shared with the Platform gem detailed in my blog https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/. When it begins to sleep, it will be a very small number, if there are lots of apps to be deleted and this triggers over and over, it will sleep longer and longer until it asymptotically reaches the max sleep value which is also the max TTL value. The last change is that the @apps list (essentially a cache from the API) was not modified in this `destroy_older_apps` method so unless `refresh_app_list` was called every run to this method starts at the end and would trigger N `AlreadyDeleted` errors as it's in a data race with itself. This new code now removes an app from the list as they're being deleted to avoid this problem.

edmorley

Thank you for fixing this!

This change does reduce the number of requests considerably, however if I'm understanding correctly, in the case of say processes=60, the 59 processes that didn't get the lock will sleep until each one in turn can get an exclusive lock. This means 60 requests will still be made just from that one CI job to the /apps endpoint (even ignoring any other jobs in the stack-combinations matrix etc), and all but the first will be redundant.

Is there a way to implement it such that any process that can't get an exclusive lock just gives up and moves straight onto creating apps? (At least when performing cleanup on startup; I suppose the "oh dear hit max apps case" might still need to block/sleep.) That way there wouldn't be 60 requests to /apps, only one. (There would of course still be duplication between CI jobs on different machines, but that's fine / can't be avoided anyway.)

If that's not easy to implement (I'm not familiar with the flock API), then another option would be to remove the "on startup" cleanup entirely, and instead have end users add a single hatchet destroy --older-than to the CI workflows prior to running the specs.

Thoughts?

Turns out there was a big flaw in my logic. That existing rescue will effectively abort execution every time instead of continuing iteration. I'm moving it inside the loop and introducing an `on_conflict` strategy that can either be `:sleep_and_continue` or `:stop_if_under_limit`. When a conflict is detected it always sleeps, but only refreshes the API if the limit hasn't yet been met. There's two main cases to consider - CLI: (calling `hatchet destroy --older-than=7`) In this case you want to to be extremely sure that ALL apps that are over TTL are deleted. For this case we use `:sleep_and_continue` - In the Hatchet App creation lifecycle: For every app we try to create, We first try to clean up old apps. The first time this triggers, it will hit the API and then cache the results. If there's an exception on app create then the app list is refreshed then we try to delete any old apps, but on conflict we can stop if we know there's enough spare capacity to create an app `:stop_if_under_limit`. After that, sleep if we're not yet down to the limit and retry the whole thing.

schneems · 2023-02-28T19:38:53Z

@edmorley thanks for the look. I also spotted another bug, detailed in the commit. To handle the situation you've brought up, I introduced the idea of a configurable on_conflict strategy that can either result in an API refresh or not.

edmorley

Thank you!

The original reason conflict strategies were added was to gate whether or not we hit the API again. All times a conflict is hit a sleep occurs, so in the strategy `:sleep_and_continue`, the least important part is the sleep. I renamed it to put the behavior of refreshing the API front and center. Also updated bin/hatchet to make it explicit instead of relying on defaults.

schneems added 2 commits February 28, 2023 10:55

v8.0.1

40d1101

schneems force-pushed the schneems/handle-race-condition branch from 3f284a8 to 40d1101 Compare February 28, 2023 16:56

schneems marked this pull request as ready for review February 28, 2023 16:59

schneems requested a review from edmorley February 28, 2023 16:59

edmorley reviewed Feb 28, 2023

View reviewed changes

schneems force-pushed the schneems/handle-race-condition branch 3 times, most recently from 71f1e4e to fef7d60 Compare February 28, 2023 19:23

schneems force-pushed the schneems/handle-race-condition branch from fef7d60 to a438aba Compare February 28, 2023 19:27

edmorley approved these changes Feb 28, 2023

View reviewed changes

schneems enabled auto-merge (squash) March 1, 2023 16:40

schneems disabled auto-merge March 1, 2023 16:41

schneems enabled auto-merge (squash) March 1, 2023 16:41

schneems merged commit 3f13327 into main Mar 1, 2023

schneems deleted the schneems/handle-race-condition branch March 1, 2023 16:42

edmorley mentioned this pull request Mar 2, 2023

Hatchet run failed with Hatchet::Reaper::AlreadyDeletedError #199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multiple duplicate deletion attempts #198

Fix multiple duplicate deletion attempts #198

schneems commented Feb 28, 2023

edmorley left a comment •

edited

Loading

schneems commented Feb 28, 2023

edmorley left a comment

Fix multiple duplicate deletion attempts #198

Fix multiple duplicate deletion attempts #198

Conversation

schneems commented Feb 28, 2023

edmorley left a comment • edited Loading

Choose a reason for hiding this comment

schneems commented Feb 28, 2023

edmorley left a comment

Choose a reason for hiding this comment

edmorley left a comment •

edited

Loading