Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple duplicate deletion attempts #198

Merged
merged 4 commits into from
Mar 1, 2023

Conversation

schneems
Copy link
Contributor

When multiple processes are attempting to cull the old apps list to free up enough space (to get under the limit), each will try to delete the same app.

We already introduced a file based lock for multiple processes on the same machine. However the lock was not being used while clearing out old apps. That has been added.

This file locking doesn't fix the problem, but it limits it instead of N machines running M processes, you only have to account for N machines.

The next fix is that when a duplicate deletion is detected, instead of having all machines continue on as fast as possible, the machines that "lost" the race will sleep to allow the winner to continue removing apps. When it wakes up it will refresh its API list and try again. This will happen in a loop until there are no more apps over TTL or it becomes the winner and other apps must sleep. This sleep logic uses the shared reaper backoff algorithm shared with the Platform gem detailed in my blog https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/. When it begins to sleep, it will be a very small number, if there are lots of apps to be deleted and this triggers over and over, it will sleep longer and longer until it asymptotically reaches the max sleep value which is also the max TTL value.

The last change is that the @apps list (essentially a cache from the API) was not modified in this destroy_older_apps method so unless refresh_app_list was called every run to this method starts at the end and would trigger N AlreadyDeleted errors as it's in a data race with itself. This new code now removes an app from the list as they're being deleted to avoid this problem.

When multiple processes are attempting to cull the old apps list to free up enough space (to get under the limit), each will try to delete the same app.

We already introduced a file based lock for multiple processes on the same machine. However the lock was not being used while clearing out old apps. That has been added.

This file locking doesn't fix the problem, but it limits it instead of N machines running M processes, you only have to account for N machines. 

The next fix is that when a duplicate deletion is detected, instead of having all machines continue on as fast as possible, the machines that "lost" the race will sleep to allow the winner to continue removing apps. When it wakes up it will refresh its API list and try again. This will happen in a loop until there are no more apps over TTL or it becomes the winner and other apps must sleep. This sleep logic uses the shared reaper backoff algorithm shared with the Platform gem detailed in my blog https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/. When it begins to sleep, it will be a very small number, if there are lots of apps to be deleted and this triggers over and over, it will sleep longer and longer until it asymptotically reaches the max sleep value which is also the max TTL value. 

The last change is that the @apps list (essentially a cache from the API) was not modified in this `destroy_older_apps` method so unless `refresh_app_list` was called every run to this method starts at the end and would trigger N `AlreadyDeleted` errors as it's in a data race with itself. This new code now removes an app from the list as they're being deleted to avoid this problem.
@schneems schneems force-pushed the schneems/handle-race-condition branch from 3f284a8 to 40d1101 Compare February 28, 2023 16:56
@schneems schneems marked this pull request as ready for review February 28, 2023 16:59
Copy link
Member

@edmorley edmorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing this!

This change does reduce the number of requests considerably, however if I'm understanding correctly, in the case of say processes=60, the 59 processes that didn't get the lock will sleep until each one in turn can get an exclusive lock. This means 60 requests will still be made just from that one CI job to the /apps endpoint (even ignoring any other jobs in the stack-combinations matrix etc), and all but the first will be redundant.

Is there a way to implement it such that any process that can't get an exclusive lock just gives up and moves straight onto creating apps? (At least when performing cleanup on startup; I suppose the "oh dear hit max apps case" might still need to block/sleep.) That way there wouldn't be 60 requests to /apps, only one. (There would of course still be duplication between CI jobs on different machines, but that's fine / can't be avoided anyway.)

If that's not easy to implement (I'm not familiar with the flock API), then another option would be to remove the "on startup" cleanup entirely, and instead have end users add a single hatchet destroy --older-than to the CI workflows prior to running the specs.

Thoughts?

@schneems schneems force-pushed the schneems/handle-race-condition branch 3 times, most recently from 71f1e4e to fef7d60 Compare February 28, 2023 19:23
Turns out there was a big flaw in my logic. That existing rescue will effectively abort execution every time instead of continuing iteration. 

I'm moving it inside the loop and introducing an `on_conflict` strategy that can either be `:sleep_and_continue` or `:stop_if_under_limit`. When a conflict is detected it always sleeps, but only refreshes the API if the limit hasn't yet been met.

There's two main cases to consider

- CLI: (calling `hatchet destroy --older-than=7`) In this case you want to to be extremely sure that ALL apps that are over TTL are deleted. For this case we use `:sleep_and_continue`
- In the Hatchet App creation lifecycle: For every app we try to create, We first try to clean up old apps. The first time this triggers, it will hit the API and then cache the results. If there's an exception on app create then the app list is refreshed then we try to delete any old apps, but on conflict we can stop if we know there's enough spare capacity to create an app `:stop_if_under_limit`. After that, sleep if we're not yet down to the limit and retry the whole thing.
@schneems schneems force-pushed the schneems/handle-race-condition branch from fef7d60 to a438aba Compare February 28, 2023 19:27
@schneems
Copy link
Contributor Author

@edmorley thanks for the look. I also spotted another bug, detailed in the commit. To handle the situation you've brought up, I introduced the idea of a configurable on_conflict strategy that can either result in an API refresh or not.

Copy link
Member

@edmorley edmorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

The original reason conflict strategies were added was to gate whether or not we hit the API again. All times a conflict is hit a sleep occurs, so in the strategy `:sleep_and_continue`, the least important part is the sleep. I renamed it to put the behavior of refreshing the API front and center.

Also updated bin/hatchet to make it explicit instead of relying on defaults.
@schneems schneems enabled auto-merge (squash) March 1, 2023 16:40
@schneems schneems disabled auto-merge March 1, 2023 16:41
@schneems schneems enabled auto-merge (squash) March 1, 2023 16:41
@schneems schneems merged commit 3f13327 into main Mar 1, 2023
@schneems schneems deleted the schneems/handle-race-condition branch March 1, 2023 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants