-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multiple duplicate deletion attempts #198
Conversation
When multiple processes are attempting to cull the old apps list to free up enough space (to get under the limit), each will try to delete the same app. We already introduced a file based lock for multiple processes on the same machine. However the lock was not being used while clearing out old apps. That has been added. This file locking doesn't fix the problem, but it limits it instead of N machines running M processes, you only have to account for N machines. The next fix is that when a duplicate deletion is detected, instead of having all machines continue on as fast as possible, the machines that "lost" the race will sleep to allow the winner to continue removing apps. When it wakes up it will refresh its API list and try again. This will happen in a loop until there are no more apps over TTL or it becomes the winner and other apps must sleep. This sleep logic uses the shared reaper backoff algorithm shared with the Platform gem detailed in my blog https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/. When it begins to sleep, it will be a very small number, if there are lots of apps to be deleted and this triggers over and over, it will sleep longer and longer until it asymptotically reaches the max sleep value which is also the max TTL value. The last change is that the @apps list (essentially a cache from the API) was not modified in this `destroy_older_apps` method so unless `refresh_app_list` was called every run to this method starts at the end and would trigger N `AlreadyDeleted` errors as it's in a data race with itself. This new code now removes an app from the list as they're being deleted to avoid this problem.
3f284a8
to
40d1101
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing this!
This change does reduce the number of requests considerably, however if I'm understanding correctly, in the case of say processes=60, the 59 processes that didn't get the lock will sleep until each one in turn can get an exclusive lock. This means 60 requests will still be made just from that one CI job to the /apps
endpoint (even ignoring any other jobs in the stack-combinations matrix etc), and all but the first will be redundant.
Is there a way to implement it such that any process that can't get an exclusive lock just gives up and moves straight onto creating apps? (At least when performing cleanup on startup; I suppose the "oh dear hit max apps case" might still need to block/sleep.) That way there wouldn't be 60 requests to /apps
, only one. (There would of course still be duplication between CI jobs on different machines, but that's fine / can't be avoided anyway.)
If that's not easy to implement (I'm not familiar with the flock
API), then another option would be to remove the "on startup" cleanup entirely, and instead have end users add a single hatchet destroy --older-than
to the CI workflows prior to running the specs.
Thoughts?
71f1e4e
to
fef7d60
Compare
Turns out there was a big flaw in my logic. That existing rescue will effectively abort execution every time instead of continuing iteration. I'm moving it inside the loop and introducing an `on_conflict` strategy that can either be `:sleep_and_continue` or `:stop_if_under_limit`. When a conflict is detected it always sleeps, but only refreshes the API if the limit hasn't yet been met. There's two main cases to consider - CLI: (calling `hatchet destroy --older-than=7`) In this case you want to to be extremely sure that ALL apps that are over TTL are deleted. For this case we use `:sleep_and_continue` - In the Hatchet App creation lifecycle: For every app we try to create, We first try to clean up old apps. The first time this triggers, it will hit the API and then cache the results. If there's an exception on app create then the app list is refreshed then we try to delete any old apps, but on conflict we can stop if we know there's enough spare capacity to create an app `:stop_if_under_limit`. After that, sleep if we're not yet down to the limit and retry the whole thing.
fef7d60
to
a438aba
Compare
@edmorley thanks for the look. I also spotted another bug, detailed in the commit. To handle the situation you've brought up, I introduced the idea of a configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
The original reason conflict strategies were added was to gate whether or not we hit the API again. All times a conflict is hit a sleep occurs, so in the strategy `:sleep_and_continue`, the least important part is the sleep. I renamed it to put the behavior of refreshing the API front and center. Also updated bin/hatchet to make it explicit instead of relying on defaults.
When multiple processes are attempting to cull the old apps list to free up enough space (to get under the limit), each will try to delete the same app.
We already introduced a file based lock for multiple processes on the same machine. However the lock was not being used while clearing out old apps. That has been added.
This file locking doesn't fix the problem, but it limits it instead of N machines running M processes, you only have to account for N machines.
The next fix is that when a duplicate deletion is detected, instead of having all machines continue on as fast as possible, the machines that "lost" the race will sleep to allow the winner to continue removing apps. When it wakes up it will refresh its API list and try again. This will happen in a loop until there are no more apps over TTL or it becomes the winner and other apps must sleep. This sleep logic uses the shared reaper backoff algorithm shared with the Platform gem detailed in my blog https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/. When it begins to sleep, it will be a very small number, if there are lots of apps to be deleted and this triggers over and over, it will sleep longer and longer until it asymptotically reaches the max sleep value which is also the max TTL value.
The last change is that the @apps list (essentially a cache from the API) was not modified in this
destroy_older_apps
method so unlessrefresh_app_list
was called every run to this method starts at the end and would trigger NAlreadyDeleted
errors as it's in a data race with itself. This new code now removes an app from the list as they're being deleted to avoid this problem.