-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
App Reaper Redesign #97
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
schneems
force-pushed
the
schneems/dont-kill-active-apps
branch
2 times, most recently
from
July 30, 2020 17:46
3abb6dc
to
f733e44
Compare
schneems
force-pushed
the
schneems/dont-kill-active-apps
branch
from
July 30, 2020 17:55
f733e44
to
d7ff396
Compare
edmorley
approved these changes
Aug 3, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the solution chosen here! :-)
## Fix Global Heroku app Limits Issue: #90 When the reaper was originally introduced there was a limit of 100 apps per Heroku account. As of today my account has 224 apps, because it is connected to multiple organizations. This means when I run with my local account, every time `reaper.cycle` is called it deletes a hatchet app even though it doesn't need to. The "fix" for this is to remove the HEROKU_APP_LIMIT setting and instead trying to detect when we've been going over the limit. This was accomplished by forcing the reaper to reap once when an exception is raised in the creation of an app. (Note reaping does not guarantee deletion of an app if none are "finished" or old enough). It can also trigger the issue with "minimum time to live" listed below if a parallel run was attempted locally. ## Fix Minimum time to live Issue: #89 The problem looks like this, in a deploy during a test on test run on CI will end up deleting an app currently being deployed, this ends up generating an error like this: ``` remote: ! No such app as hatchet-t-522c335bea. ``` The most common reason this happens is that there are more hatchet apps attempting to be deployed than are configured for the limit. For example: - A test suite can concurrently run 20 tests and has a HATCHET_APP_LIMIT=80. Each test run on CI executes the same tests against 3 stacks meaning each CI run executes 60 app deploys concurrently - If two test runs are executing concurrently the required number of apps is 2x 60 which is 120 which is over the limit of 80. When this happens a reaper will likely destroy an app as it is in progress. - Keep in mind when merging tests to master, the suite may be re-run, or when pushing to two different branches. If multiple engineers are working on PRs at the same time this can also happen... This problem is especially bad given the distributed nature of our tests, and given that tests have built in retries. It ends up really straining the system and taking a long time to exit if the developer doesn't notice and manually kill one of the extra test runs. This problem also means that the developer needs to always know the maximum number of concurrent apps that are likely to be deployed at a time against a given API key. This is extremely difficult to do in a distributed setting as illustrated above. The solution proposed in #89 is to introduce a minimum "time to live" for apps. The idea is that all tests should be able to finish within this window. If we guarantee that all apps live for at least that long, then even if we accidentally go over our assigned limits, then the system will slow down instead of failing. Originally implemented in #84 it was later reverted. The behavior is not great since most tests exit very quickly around 1-3 minutes. Forcing ALL apps to live for a minimum of 7 minutes effectively meant that a single CI run could not take fewer than seven minutes (if it's close to using the HATCHET_APP_LIMIT on each CI run). When merged in, it effectively took a CI run on master to execute for about 15+ minutes which was not acceptable. ## Fix Minimum time to live + "finished" (maintenance=true) tracking The solution to the "time to live" problem of increased test time is to find a more granular way to track when an app can be safely destroyed. While we still want to have a minimum time to live for apps currently being tested, the API call that returns our list of app info also returns some metadata status including whether or not the app is in maintenance mode. The solution then means when an app is done (App#teardown! is called) then an API call will be made to set it to maintenance mode. This will communicate to the reaper it can be cleaned immediately and it does not have to wait for seven minutes. This creates a two queue system with a fast "finished" queue of apps that can be deleted right away, and a slower "unfinished" queue. The unfinished queue will only be checked after the finished queue is drained. When we try to delete an "unifinished" app we first check to see how long it's been alive. If it's older than the TTL then we can remove it right away. Otherwise the system will throttle itself gradually to give the system a chance to "finish" some apps. It will check again in 10s, 20s, 40s, 80s, ... until it hits the minimum time to live for the oldest "unfinished" app at which time it will then reset. ## Extra debugging info for the reaper All outputs from the reaper now include extra debugging information about the status of the system. Previously these values were being logged inconsistently making some of the above bugs harder to reason about.
Co-authored-by: Ed Morley <[email protected]>
Co-authored-by: Ed Morley <[email protected]>
Co-authored-by: Ed Morley <[email protected]>
Co-authored-by: Ed Morley <[email protected]>
schneems
force-pushed
the
schneems/dont-kill-active-apps
branch
2 times, most recently
from
August 3, 2020 17:12
7083f83
to
4412914
Compare
It didn't make sense to see this code: ``` reaper_throttle.call(min_sleep: 5) do |sleep_for| expect(sleep_for).to eq(2) end ``` Based on the naming of the argument, I would expect the minimum value to be 5, but instead it's the maximum value. So...let's rename it to `max`! Problem solved.
schneems
force-pushed
the
schneems/dont-kill-active-apps
branch
from
August 3, 2020 17:23
4412914
to
4540fac
Compare
schneems
added a commit
to heroku/heroku-buildpack-python
that referenced
this pull request
Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107) - Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105) - Ignore 403 on app delete due to race condition (heroku/hatchet#101) - The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86) - Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94) - Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97) - Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97) - The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97) - App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97) - Calls to `git push heroku` are now rate throttled (heroku/hatchet#98) - Calls to `app.run` are now rate throttled (heroku/hatchet#99) - Deployment now raises and error when the release failed (heroku/hatchet#93)
schneems
added a commit
to heroku/heroku-buildpack-php
that referenced
this pull request
Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107) - Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105) - Ignore 403 on app delete due to race condition (heroku/hatchet#101) - The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86) - Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94) - Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97) - Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97) - The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97) - App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97) - Calls to `git push heroku` are now rate throttled (heroku/hatchet#98) - Calls to `app.run` are now rate throttled (heroku/hatchet#99) - Deployment now raises and error when the release failed (heroku/hatchet#93)
edmorley
pushed a commit
to heroku/heroku-buildpack-python
that referenced
this pull request
Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107) - Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105) - Ignore 403 on app delete due to race condition (heroku/hatchet#101) - The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86) - Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94) - Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97) - Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97) - The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97) - App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97) - Calls to `git push heroku` are now rate throttled (heroku/hatchet#98) - Calls to `app.run` are now rate throttled (heroku/hatchet#99) - Deployment now raises and error when the release failed (heroku/hatchet#93) [skip changelog]
schneems
added a commit
to heroku/heroku-buildpack-jvm-common
that referenced
this pull request
Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107) - Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105) - Ignore 403 on app delete due to race condition (heroku/hatchet#101) - The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86) - Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94) - Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97) - Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97) - The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97) - App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97) - Calls to `git push heroku` are now rate throttled (heroku/hatchet#98) - Calls to `app.run` are now rate throttled (heroku/hatchet#99) - Deployment now raises and error when the release failed (heroku/hatchet#93)
schneems
added a commit
to heroku/heroku-buildpack-scala
that referenced
this pull request
Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107) - Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105) - Ignore 403 on app delete due to race condition (heroku/hatchet#101) - The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86) - Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94) - Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97) - Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97) - The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97) - App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97) - Calls to `git push heroku` are now rate throttled (heroku/hatchet#98) - Calls to `app.run` are now rate throttled (heroku/hatchet#99) - Deployment now raises and error when the release failed (heroku/hatchet#93)
schneems
added a commit
to heroku/heroku-buildpack-clojure
that referenced
this pull request
Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107) - Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105) - Ignore 403 on app delete due to race condition (heroku/hatchet#101) - The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86) - Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94) - Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97) - Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97) - The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97) - App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97) - Calls to `git push heroku` are now rate throttled (heroku/hatchet#98) - Calls to `app.run` are now rate throttled (heroku/hatchet#99) - Deployment now raises and error when the release failed (heroku/hatchet#93)
schneems
added a commit
to heroku/heroku-buildpack-java
that referenced
this pull request
Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107) - Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105) - Ignore 403 on app delete due to race condition (heroku/hatchet#101) - The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86) - Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94) - Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97) - Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97) - The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97) - App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97) - Calls to `git push heroku` are now rate throttled (heroku/hatchet#98) - Calls to `app.run` are now rate throttled (heroku/hatchet#99) - Deployment now raises and error when the release failed (heroku/hatchet#93)
schneems
added a commit
to heroku/heroku-buildpack-nodejs
that referenced
this pull request
Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107) - Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105) - Ignore 403 on app delete due to race condition (heroku/hatchet#101) - The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86) - Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94) - Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97) - Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97) - The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97) - App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97) - Calls to `git push heroku` are now rate throttled (heroku/hatchet#98) - Calls to `app.run` are now rate throttled (heroku/hatchet#99) - Deployment now raises and error when the release failed (heroku/hatchet#93)
dryan
pushed a commit
to dryan/heroku-buildpack-python
that referenced
this pull request
Nov 19, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107) - Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105) - Ignore 403 on app delete due to race condition (heroku/hatchet#101) - The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86) - Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94) - Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97) - Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97) - The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97) - App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97) - Calls to `git push heroku` are now rate throttled (heroku/hatchet#98) - Calls to `app.run` are now rate throttled (heroku/hatchet#99) - Deployment now raises and error when the release failed (heroku/hatchet#93) [skip changelog]
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix Global Heroku app Limits
Issue: #90
When the reaper was originally introduced there was a limit of 100 apps per Heroku account. As of today my account has 224 apps, because it is connected to multiple organizations. This means when I run with my local account, every time
reaper.cycle
is called it deletes a hatchet app even though it doesn't need to.The "fix" for this is to remove the HEROKU_APP_LIMIT setting and instead trying to detect when we've been going over the limit.
This was accomplished by forcing the reaper to reap once when an exception is raised in the creation of an app. (Note reaping does not guarantee deletion of an app if none are "finished" or old enough). It can also trigger the issue with "minimum time to live" listed below if a parallel run was attempted locally.
Fix Minimum time to live
Issue: #89
The problem looks like this, in a deploy during a test on test run on CI will end up deleting an app currently being deployed, this ends up generating an error like this:
The most common reason this happens is that there are more hatchet apps attempting to be deployed than are configured for the limit. For example:
This problem is especially bad given the distributed nature of our tests, and given that tests have built in retries. It ends up really straining the system and taking a long time to exit if the developer doesn't notice and manually kill one of the extra test runs.
This problem also means that the developer needs to always know the maximum number of concurrent apps that are likely to be deployed at a time against a given API key. This is extremely difficult to do in a distributed setting as illustrated above.
The solution proposed in #89 is to introduce a minimum "time to live" for apps. The idea is that all tests should be able to finish within this window. If we guarantee that all apps live for at least that long, then even if we accidentally go over our assigned limits, then the system will slow down instead of failing.
Originally implemented in #84 it was later reverted. The behavior is not great since most tests exit very quickly around 1-3 minutes. Forcing ALL apps to live for a minimum of 7 minutes effectively meant that a single CI run could not take fewer than seven minutes (if it's close to using the HATCHET_APP_LIMIT on each CI run). When merged in, it effectively took a CI run on master to execute for about 15+ minutes which was not acceptable.
Fix Minimum time to live + "finished" (maintenance=true) tracking
The solution to the "time to live" problem of increased test time is to find a more granular way to track when an app can be safely destroyed. While we still want to have a minimum time to live for apps currently being tested, the API call that returns our list of app info also returns some metadata status including whether or not the app is in maintenance mode.
The solution then means when an app is done (App#teardown! is called) then an API call will be made to set it to maintenance mode. This will communicate to the reaper it can be cleaned immediately and it does not have to wait for seven minutes.
This creates a two queue system with a fast "finished" queue of apps that can be deleted right away, and a slower "unfinished" queue. The unfinished queue will only be checked after the finished queue is drained. When we try to delete an "unifinished" app we first check to see how long it's been alive. If it's older than the TTL then we can remove it right away. Otherwise the system will throttle itself gradually to give the system a chance to "finish" some apps. It will check again in 10s, 20s, 40s, 80s, ... until it hits the minimum time to live for the oldest "unfinished" app at which time it will then reset.
Extra debugging info for the reaper
All outputs from the reaper now include extra debugging information about the status of the system. Previously these values were being logged inconsistently making some of the above bugs harder to reason about.