-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove regex from 404 duplicate deletion check #210
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
schneems
force-pushed
the
schneems/404-means-not-found
branch
from
June 20, 2024 22:04
6e39548
to
5b3b1fa
Compare
## Context Hatchet creates apps when you need them (up to a limit) and then when that limit is hit, it will attempt to delete old apps. One of the bottlenecks that Hatchet faces is API limits. Hatchet relies on rate throttling https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/ to slow down and spread out API requests, but there's a point where too many requests and processes sleeping for too long can result in failures. Hatchet is also distributed as multiple test runs on multiple machines. So we have to assume that if our process is trying to delete apps, that other processes might as well and that creates race conditions. If two (or more) servers try to delete the same resource then one will get a 404 response. The reaper is responsible for keeping a list of apps https://github.com/heroku/hatchet/blob/8c80522eddcefaee79396353673a81a018b66af4/lib/hatchet/reaper.rb#L176 and also handling what to do when a "conflict" is found. https://github.com/heroku/hatchet/blob/8c80522eddcefaee79396353673a81a018b66af4/lib/hatchet/reaper.rb#L133 The default strategy is `:refresh_api_and_continue`. It will sleep for a period of time which allows any other machines that are also cleaning up to continue deleting apps. When it wakes up it queries the API to see how many apps are left and either stops (if enough have already been deleted) or continues trying to remove apps. Effectively it's trying to limit API requests, both for duplicate delete requests and for listing apps requests. ## 404 Previously I was trying to disambiguate between "404 the url has a typo in it" (or some other issue like DNS problem with routing), versus "404, you clearly got to a valid API endpoint but the resource couldn't be found". To guard against that I added in a regex against the message in the body. ## Problem API is in the process of updating error messages so this will no longer work. Internal link https://salesforce-internal.slack.com/archives/C1RS6AUDR/p1718894438197959. ## Change Remove the regex. Instead, when we receive a 404 response consider that the app is deleted. This will trigger the "sleep and refresh" logic. Consequences: If the service is unreachable for some reason (such as a DNS issue or if the URL changes we will also get a 404 response and will keep trying to update apps in a loop. With our request throttling each attempt would progressively take longer and longer and eventually the tests would timeout and fail. We could check the "id" field of the json body which should be `"not_found"` in this case, but it's also the same value for a bad URL ``` $ curl https://api.heroku.com/schemaz -H "Accept: application/vnd.heroku+json; version=3" { "id": "not_found", "message": "The requested API endpoint was not found. Are you using the right HTTP verb (i.e. `GET` vs. `POST`), and did you specify your intended version with the `Accept` header?" } ``` So that mechanism would help with DNS issues, but not with other 404 problems. It also introduces the possibility of json parsing failure and could also break in the future if this ID is revised or changed. I think retrying is safe (enough) here.
schneems
force-pushed
the
schneems/404-means-not-found
branch
from
June 20, 2024 22:08
5b3b1fa
to
430b888
Compare
schneems
changed the title
Use json ID equality instead of regex
Remove regex from 404 duplicate deletion check
Jun 20, 2024
edmorley
approved these changes
Jun 21, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree removing this check makes sense and reduces potential future churn from further API changes :-)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
Hatchet creates apps when you need them (up to a limit) and then when that limit is hit, it will attempt to delete old apps.
One of the bottlenecks that Hatchet faces is API limits. Hatchet relies on rate throttling https://www.schneems.com/2020/07/08/a-fast-car-needs-good-brakes-how-we-added-client-rate-throttling-to-the-platform-api-gem/ to slow down and spread out API requests, but there's a point where too many requests and processes sleeping for too long can result in failures.
Hatchet is also distributed as multiple test runs on multiple machines. So we have to assume that if our process is trying to delete apps, that other processes might as well and that creates race conditions. If two (or more) servers try to delete the same resource then one will get a 404 response.
The reaper is responsible for keeping a list of apps
hatchet/lib/hatchet/reaper.rb
Line 176 in 8c80522
hatchet/lib/hatchet/reaper.rb
Line 133 in 8c80522
The default strategy is
:refresh_api_and_continue
. It will sleep for a period of time which allows any other machines that are also cleaning up to continue deleting apps. When it wakes up it queries the API to see how many apps are left and either stops (if enough have already been deleted) or continues trying to remove apps.Effectively it's trying to limit API requests, both for duplicate delete requests and for listing apps requests.
404
Previously I was trying to disambiguate between "404 the url has a typo in it" (or some other issue like DNS problem with routing), versus "404, you clearly got to a valid API endpoint but the resource couldn't be found". To guard against that I added in a regex against the message in the body.
Problem
API is in the process of updating error messages so this will no longer work.
Internal link https://salesforce-internal.slack.com/archives/C1RS6AUDR/p1718894438197959.
Change
Remove the regex. Instead, when we receive a 404 response consider that the app is deleted. This will trigger the "sleep and refresh" logic.
Consequences: If the service is unreachable for some reason (such as a DNS issue or if the URL changes we will also get a 404 response and will keep trying to update apps in a loop. With our request throttling each attempt would progressively take longer and longer and eventually the tests would timeout and fail.
We could check the "id" field of the json body which should be
"not_found"
in this case, but it's also the same value for a bad URLSo that mechanism would help with DNS issues, but not with other 404 problems. It also introduces the possibility of json parsing failure and could also break in the future if this ID is revised or changed.
I think retrying is safe (enough) here.