Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App Reaper Redesign #97

Merged
merged 6 commits into from
Aug 3, 2020
Merged

App Reaper Redesign #97

merged 6 commits into from
Aug 3, 2020

Conversation

schneems
Copy link
Contributor

@schneems schneems commented Jul 30, 2020

Fix Global Heroku app Limits

Issue: #90

When the reaper was originally introduced there was a limit of 100 apps per Heroku account. As of today my account has 224 apps, because it is connected to multiple organizations. This means when I run with my local account, every time reaper.cycle is called it deletes a hatchet app even though it doesn't need to.

The "fix" for this is to remove the HEROKU_APP_LIMIT setting and instead trying to detect when we've been going over the limit.

This was accomplished by forcing the reaper to reap once when an exception is raised in the creation of an app. (Note reaping does not guarantee deletion of an app if none are "finished" or old enough). It can also trigger the issue with "minimum time to live" listed below if a parallel run was attempted locally.

Fix Minimum time to live

Issue: #89

The problem looks like this, in a deploy during a test on test run on CI will end up deleting an app currently being deployed, this ends up generating an error like this:

remote: !  No such app as hatchet-t-522c335bea.

The most common reason this happens is that there are more hatchet apps attempting to be deployed than are configured for the limit. For example:

  • A test suite can concurrently run 20 tests and has a HATCHET_APP_LIMIT=80. Each test run on CI executes the same tests against 3 stacks meaning each CI run executes 60 app deploys concurrently
    • If two test runs are executing concurrently the required number of apps is 2x 60 which is 120 which is over the limit of 80. When this happens a reaper will likely destroy an app as it is in progress.
    • Keep in mind when merging tests to master, the suite may be re-run, or when pushing to two different branches. If multiple engineers are working on PRs at the same time this can also happen...

This problem is especially bad given the distributed nature of our tests, and given that tests have built in retries. It ends up really straining the system and taking a long time to exit if the developer doesn't notice and manually kill one of the extra test runs.

This problem also means that the developer needs to always know the maximum number of concurrent apps that are likely to be deployed at a time against a given API key. This is extremely difficult to do in a distributed setting as illustrated above.

The solution proposed in #89 is to introduce a minimum "time to live" for apps. The idea is that all tests should be able to finish within this window. If we guarantee that all apps live for at least that long, then even if we accidentally go over our assigned limits, then the system will slow down instead of failing.

Originally implemented in #84 it was later reverted. The behavior is not great since most tests exit very quickly around 1-3 minutes. Forcing ALL apps to live for a minimum of 7 minutes effectively meant that a single CI run could not take fewer than seven minutes (if it's close to using the HATCHET_APP_LIMIT on each CI run). When merged in, it effectively took a CI run on master to execute for about 15+ minutes which was not acceptable.

Fix Minimum time to live + "finished" (maintenance=true) tracking

The solution to the "time to live" problem of increased test time is to find a more granular way to track when an app can be safely destroyed. While we still want to have a minimum time to live for apps currently being tested, the API call that returns our list of app info also returns some metadata status including whether or not the app is in maintenance mode.

The solution then means when an app is done (App#teardown! is called) then an API call will be made to set it to maintenance mode. This will communicate to the reaper it can be cleaned immediately and it does not have to wait for seven minutes.

This creates a two queue system with a fast "finished" queue of apps that can be deleted right away, and a slower "unfinished" queue. The unfinished queue will only be checked after the finished queue is drained. When we try to delete an "unifinished" app we first check to see how long it's been alive. If it's older than the TTL then we can remove it right away. Otherwise the system will throttle itself gradually to give the system a chance to "finish" some apps. It will check again in 10s, 20s, 40s, 80s, ... until it hits the minimum time to live for the oldest "unfinished" app at which time it will then reset.

Extra debugging info for the reaper

All outputs from the reaper now include extra debugging information about the status of the system. Previously these values were being logged inconsistently making some of the above bugs harder to reason about.

@schneems schneems force-pushed the schneems/dont-kill-active-apps branch 2 times, most recently from 3abb6dc to f733e44 Compare July 30, 2020 17:46
@schneems schneems force-pushed the schneems/dont-kill-active-apps branch from f733e44 to d7ff396 Compare July 30, 2020 17:55
Copy link
Member

@edmorley edmorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the solution chosen here! :-)

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
spec/unit/reaper_spec.rb Outdated Show resolved Hide resolved
schneems and others added 5 commits August 3, 2020 12:01
## Fix Global Heroku app Limits

Issue: #90

When the reaper was originally introduced there was a limit of 100 apps per Heroku account. As of today my account has 224 apps, because it is connected to multiple organizations. This means when I run with my local account, every time `reaper.cycle` is called it deletes a hatchet app even though it doesn't need to.

The "fix" for this is to remove the HEROKU_APP_LIMIT setting and instead trying to detect when we've been going over the limit.

This was accomplished by forcing the reaper to reap once when an exception is raised in the creation of an app. (Note reaping does not guarantee deletion of an app if none are "finished" or old enough). It can also trigger the issue with "minimum time to live" listed below if a parallel run was attempted locally.

## Fix Minimum time to live

Issue: #89

The problem looks like this, in a deploy during a test on test run on CI will end up deleting an app currently being deployed, this ends up generating an error like this:

```
remote: !  No such app as hatchet-t-522c335bea.
```

The most common reason this happens is that there are more hatchet apps attempting to be deployed than are configured for the limit. For example:

- A test suite can concurrently run 20 tests and has a HATCHET_APP_LIMIT=80. Each test run on CI executes the same tests against 3 stacks meaning each CI run executes 60 app deploys concurrently
  - If two test runs are executing concurrently the required number of apps is 2x 60 which is 120 which is over the limit of 80. When this happens a reaper will likely destroy an app as it is in progress.
  - Keep in mind when merging tests to master, the suite may be re-run, or when pushing to two different branches. If multiple engineers are working on PRs at the same time this can also happen...

This problem is especially bad given the distributed nature of our tests, and given that tests have built in retries. It ends up really straining the system and taking a long time to exit if the developer doesn't notice and manually kill one of the extra test runs.

This problem also means that the developer needs to always know the maximum number of concurrent apps that are likely to be deployed at a time against a given API key. This is extremely difficult to do in a distributed setting as illustrated above.

The solution proposed in #89 is to introduce a minimum "time to live" for apps. The idea is that all tests should be able to finish within this window. If we guarantee that all apps live for at least that long, then even if we accidentally go over our assigned limits, then the system will slow down instead of failing.

Originally implemented in #84 it was later reverted. The behavior is not great since most tests exit very quickly around 1-3 minutes. Forcing ALL apps to live for a minimum of 7 minutes effectively meant that a single CI run could not take fewer than seven minutes (if it's close to using the HATCHET_APP_LIMIT on each CI run). When merged in, it effectively took a CI run on master to execute for about 15+ minutes which was not acceptable.

## Fix Minimum time to live + "finished" (maintenance=true) tracking

The solution to the "time to live" problem of increased test time is to find a more granular way to track when an app can be safely destroyed. While we still want to have a minimum time to live for apps currently being tested, the API call that returns our list of app info also returns some metadata status including whether or not the app is in maintenance mode.

The solution then means when an app is done (App#teardown! is called) then an API call will be made to set it to maintenance mode. This will communicate to the reaper it can be cleaned immediately and it does not have to wait for seven minutes.

This creates a two queue system with a fast "finished" queue of apps that can be deleted right away, and a slower "unfinished" queue. The unfinished queue will only be checked after the finished queue is drained. When we try to delete an "unifinished" app we first check to see how long it's been alive. If it's older than the TTL then we can remove it right away. Otherwise the system will throttle itself gradually to give the system a chance to "finish" some apps. It will check again in 10s, 20s, 40s, 80s, ... until it hits the minimum time to live for the oldest "unfinished" app at which time it will then reset.

## Extra debugging info for the reaper

All outputs from the reaper now include extra debugging information about the status of the system. Previously these values were being logged inconsistently making some of the above bugs harder to reason about.
Co-authored-by: Ed Morley <[email protected]>
Co-authored-by: Ed Morley <[email protected]>
Co-authored-by: Ed Morley <[email protected]>
@schneems schneems force-pushed the schneems/dont-kill-active-apps branch 2 times, most recently from 7083f83 to 4412914 Compare August 3, 2020 17:12
It didn't make sense to see this code:

```
      reaper_throttle.call(min_sleep: 5) do |sleep_for|
        expect(sleep_for).to eq(2)
      end
```

Based on the naming of the argument, I would expect the minimum value to be 5, but instead it's the maximum value. So...let's rename it to `max`! Problem solved.
@schneems schneems force-pushed the schneems/dont-kill-active-apps branch from 4412914 to 4540fac Compare August 3, 2020 17:23
@schneems schneems merged commit 9754f1f into main Aug 3, 2020
@schneems schneems deleted the schneems/dont-kill-active-apps branch August 3, 2020 17:25
schneems added a commit to heroku/heroku-buildpack-python that referenced this pull request Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107)
- Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105)
- Ignore  403 on app delete due to race condition (heroku/hatchet#101)
- The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86)
- Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94)
- Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97)
- Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97)
- The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97)
- App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97)
- Calls to `git push heroku` are now rate throttled (heroku/hatchet#98)
- Calls to `app.run` are now rate throttled (heroku/hatchet#99)
- Deployment now raises and error when the release failed (heroku/hatchet#93)
schneems added a commit to heroku/heroku-buildpack-php that referenced this pull request Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107)
- Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105)
- Ignore  403 on app delete due to race condition (heroku/hatchet#101)
- The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86)
- Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94)
- Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97)
- Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97)
- The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97)
- App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97)
- Calls to `git push heroku` are now rate throttled (heroku/hatchet#98)
- Calls to `app.run` are now rate throttled (heroku/hatchet#99)
- Deployment now raises and error when the release failed (heroku/hatchet#93)
edmorley pushed a commit to heroku/heroku-buildpack-python that referenced this pull request Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107)
- Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105)
- Ignore  403 on app delete due to race condition (heroku/hatchet#101)
- The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86)
- Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94)
- Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97)
- Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97)
- The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97)
- App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97)
- Calls to `git push heroku` are now rate throttled (heroku/hatchet#98)
- Calls to `app.run` are now rate throttled (heroku/hatchet#99)
- Deployment now raises and error when the release failed (heroku/hatchet#93)

[skip changelog]
schneems added a commit to heroku/heroku-buildpack-jvm-common that referenced this pull request Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107)
- Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105)
- Ignore  403 on app delete due to race condition (heroku/hatchet#101)
- The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86)
- Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94)
- Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97)
- Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97)
- The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97)
- App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97)
- Calls to `git push heroku` are now rate throttled (heroku/hatchet#98)
- Calls to `app.run` are now rate throttled (heroku/hatchet#99)
- Deployment now raises and error when the release failed (heroku/hatchet#93)
schneems added a commit to heroku/heroku-buildpack-scala that referenced this pull request Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107)
- Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105)
- Ignore  403 on app delete due to race condition (heroku/hatchet#101)
- The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86)
- Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94)
- Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97)
- Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97)
- The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97)
- App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97)
- Calls to `git push heroku` are now rate throttled (heroku/hatchet#98)
- Calls to `app.run` are now rate throttled (heroku/hatchet#99)
- Deployment now raises and error when the release failed (heroku/hatchet#93)
schneems added a commit to heroku/heroku-buildpack-clojure that referenced this pull request Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107)
- Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105)
- Ignore  403 on app delete due to race condition (heroku/hatchet#101)
- The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86)
- Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94)
- Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97)
- Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97)
- The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97)
- App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97)
- Calls to `git push heroku` are now rate throttled (heroku/hatchet#98)
- Calls to `app.run` are now rate throttled (heroku/hatchet#99)
- Deployment now raises and error when the release failed (heroku/hatchet#93)
schneems added a commit to heroku/heroku-buildpack-java that referenced this pull request Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107)
- Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105)
- Ignore  403 on app delete due to race condition (heroku/hatchet#101)
- The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86)
- Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94)
- Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97)
- Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97)
- The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97)
- App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97)
- Calls to `git push heroku` are now rate throttled (heroku/hatchet#98)
- Calls to `app.run` are now rate throttled (heroku/hatchet#99)
- Deployment now raises and error when the release failed (heroku/hatchet#93)
schneems added a commit to heroku/heroku-buildpack-nodejs that referenced this pull request Aug 11, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107)
- Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105)
- Ignore  403 on app delete due to race condition (heroku/hatchet#101)
- The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86)
- Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94)
- Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97)
- Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97)
- The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97)
- App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97)
- Calls to `git push heroku` are now rate throttled (heroku/hatchet#98)
- Calls to `app.run` are now rate throttled (heroku/hatchet#99)
- Deployment now raises and error when the release failed (heroku/hatchet#93)
dryan pushed a commit to dryan/heroku-buildpack-python that referenced this pull request Nov 19, 2020
- ActiveSupport's Object#blank? and Object#present? are no longer provided by default (heroku/hatchet#107)
- Remove deprecated support for passing a block to `App#run` (heroku/hatchet#105)
- Ignore  403 on app delete due to race condition (heroku/hatchet#101)
- The hatchet.lock file can now be locked to "main" in addition to "master" (heroku/hatchet#86)
- Allow concurrent one-off dyno runs with the `run_multi: true` flag on apps (heroku/hatchet#94)
- Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (heroku/hatchet#97)
- Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (heroku/hatchet#97)
- The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (heroku/hatchet#97)
- App#deploy without a block will no longer run `teardown!` automatically (heroku/hatchet#97)
- Calls to `git push heroku` are now rate throttled (heroku/hatchet#98)
- Calls to `app.run` are now rate throttled (heroku/hatchet#99)
- Deployment now raises and error when the release failed (heroku/hatchet#93)

[skip changelog]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants