Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix logic for reaping apps based on total app limits #90

Closed
schneems opened this issue Jul 9, 2020 · 3 comments
Closed

Fix logic for reaping apps based on total app limits #90

schneems opened this issue Jul 9, 2020 · 3 comments

Comments

@schneems
Copy link
Contributor

schneems commented Jul 9, 2020

When I originally wrote hatchet teams were not a thing and personal accounts were limited to 100 apps. So the default limit says "do not go over 100 apps" . However this logic is severely flawed since an account can have more than 100 apps returned from the API call via app.list of the platform-api client. For instance:

Warning: Reached Heroku app limit of 100. hatchet app count: 1, all app count: 224

So what happens in this scenario is that Hatchet thinks it only has room to create and use a single app. Due to the distributed nature of Hatchet (running in parallel across multiple processes) each process is trying to destroy and create the same app which ends up creating a destroy loop since none of them let it live long enough to finish a test. This causes some pretty catastrophic test failures even in this scenario the account above has capacity to consume/create more apps.

@schneems
Copy link
Contributor Author

schneems commented Jul 9, 2020

We could grep the error message:

{"id":"invalid_params","message":"You've reached your app limit of 100 apps. Delete some apps or open a ticket at https://help.heroku.com/new/increase-limits to have this limit increased."} 

But...i'm not sure if the platform-api gem will expose it. I'll have to dig in more

@schneems
Copy link
Contributor Author

schneems commented Jul 9, 2020

Apparently there's two different outputs we need to account for:

You've reached the limit of 100 apps

schneems added a commit that referenced this issue Aug 3, 2020
## Fix Global Heroku app Limits

Issue: #90

When the reaper was originally introduced there was a limit of 100 apps per Heroku account. As of today my account has 224 apps, because it is connected to multiple organizations. This means when I run with my local account, every time `reaper.cycle` is called it deletes a hatchet app even though it doesn't need to.

The "fix" for this is to remove the HEROKU_APP_LIMIT setting and instead trying to detect when we've been going over the limit.

This was accomplished by forcing the reaper to reap once when an exception is raised in the creation of an app. (Note reaping does not guarantee deletion of an app if none are "finished" or old enough). It can also trigger the issue with "minimum time to live" listed below if a parallel run was attempted locally.

## Fix Minimum time to live

Issue: #89

The problem looks like this, in a deploy during a test on test run on CI will end up deleting an app currently being deployed, this ends up generating an error like this:

```
remote: !  No such app as hatchet-t-522c335bea.
```

The most common reason this happens is that there are more hatchet apps attempting to be deployed than are configured for the limit. For example:

- A test suite can concurrently run 20 tests and has a HATCHET_APP_LIMIT=80. Each test run on CI executes the same tests against 3 stacks meaning each CI run executes 60 app deploys concurrently
  - If two test runs are executing concurrently the required number of apps is 2x 60 which is 120 which is over the limit of 80. When this happens a reaper will likely destroy an app as it is in progress.
  - Keep in mind when merging tests to master, the suite may be re-run, or when pushing to two different branches. If multiple engineers are working on PRs at the same time this can also happen...

This problem is especially bad given the distributed nature of our tests, and given that tests have built in retries. It ends up really straining the system and taking a long time to exit if the developer doesn't notice and manually kill one of the extra test runs.

This problem also means that the developer needs to always know the maximum number of concurrent apps that are likely to be deployed at a time against a given API key. This is extremely difficult to do in a distributed setting as illustrated above.

The solution proposed in #89 is to introduce a minimum "time to live" for apps. The idea is that all tests should be able to finish within this window. If we guarantee that all apps live for at least that long, then even if we accidentally go over our assigned limits, then the system will slow down instead of failing.

Originally implemented in #84 it was later reverted. The behavior is not great since most tests exit very quickly around 1-3 minutes. Forcing ALL apps to live for a minimum of 7 minutes effectively meant that a single CI run could not take fewer than seven minutes (if it's close to using the HATCHET_APP_LIMIT on each CI run). When merged in, it effectively took a CI run on master to execute for about 15+ minutes which was not acceptable.

## Fix Minimum time to live + "finished" (maintenance=true) tracking

The solution to the "time to live" problem of increased test time is to find a more granular way to track when an app can be safely destroyed. While we still want to have a minimum time to live for apps currently being tested, the API call that returns our list of app info also returns some metadata status including whether or not the app is in maintenance mode.

The solution then means when an app is done (App#teardown! is called) then an API call will be made to set it to maintenance mode. This will communicate to the reaper it can be cleaned immediately and it does not have to wait for seven minutes.

This creates a two queue system with a fast "finished" queue of apps that can be deleted right away, and a slower "unfinished" queue. The unfinished queue will only be checked after the finished queue is drained. When we try to delete an "unifinished" app we first check to see how long it's been alive. If it's older than the TTL then we can remove it right away. Otherwise the system will throttle itself gradually to give the system a chance to "finish" some apps. It will check again in 10s, 20s, 40s, 80s, ... until it hits the minimum time to live for the oldest "unfinished" app at which time it will then reset.

## Extra debugging info for the reaper

All outputs from the reaper now include extra debugging information about the status of the system. Previously these values were being logged inconsistently making some of the above bugs harder to reason about.
@edmorley
Copy link
Member

@schneems Can this be closed now that #97 has merged? :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants