Fix logic for reaping apps based on total app limits #90

schneems · 2020-07-09T16:27:05Z

When I originally wrote hatchet teams were not a thing and personal accounts were limited to 100 apps. So the default limit says "do not go over 100 apps" . However this logic is severely flawed since an account can have more than 100 apps returned from the API call via app.list of the platform-api client. For instance:

Warning: Reached Heroku app limit of 100. hatchet app count: 1, all app count: 224

So what happens in this scenario is that Hatchet thinks it only has room to create and use a single app. Due to the distributed nature of Hatchet (running in parallel across multiple processes) each process is trying to destroy and create the same app which ends up creating a destroy loop since none of them let it live long enough to finish a test. This causes some pretty catastrophic test failures even in this scenario the account above has capacity to consume/create more apps.

The text was updated successfully, but these errors were encountered:

schneems · 2020-07-09T16:34:36Z

We could grep the error message:

{"id":"invalid_params","message":"You've reached your app limit of 100 apps. Delete some apps or open a ticket at https://help.heroku.com/new/increase-limits to have this limit increased."}

But...i'm not sure if the platform-api gem will expose it. I'll have to dig in more

schneems · 2020-07-09T16:37:14Z

Apparently there's two different outputs we need to account for:

You've reached the limit of 100 apps

## Fix Global Heroku app Limits Issue: #90 When the reaper was originally introduced there was a limit of 100 apps per Heroku account. As of today my account has 224 apps, because it is connected to multiple organizations. This means when I run with my local account, every time `reaper.cycle` is called it deletes a hatchet app even though it doesn't need to. The "fix" for this is to remove the HEROKU_APP_LIMIT setting and instead trying to detect when we've been going over the limit. This was accomplished by forcing the reaper to reap once when an exception is raised in the creation of an app. (Note reaping does not guarantee deletion of an app if none are "finished" or old enough). It can also trigger the issue with "minimum time to live" listed below if a parallel run was attempted locally. ## Fix Minimum time to live Issue: #89 The problem looks like this, in a deploy during a test on test run on CI will end up deleting an app currently being deployed, this ends up generating an error like this: ``` remote: ! No such app as hatchet-t-522c335bea. ``` The most common reason this happens is that there are more hatchet apps attempting to be deployed than are configured for the limit. For example: - A test suite can concurrently run 20 tests and has a HATCHET_APP_LIMIT=80. Each test run on CI executes the same tests against 3 stacks meaning each CI run executes 60 app deploys concurrently - If two test runs are executing concurrently the required number of apps is 2x 60 which is 120 which is over the limit of 80. When this happens a reaper will likely destroy an app as it is in progress. - Keep in mind when merging tests to master, the suite may be re-run, or when pushing to two different branches. If multiple engineers are working on PRs at the same time this can also happen... This problem is especially bad given the distributed nature of our tests, and given that tests have built in retries. It ends up really straining the system and taking a long time to exit if the developer doesn't notice and manually kill one of the extra test runs. This problem also means that the developer needs to always know the maximum number of concurrent apps that are likely to be deployed at a time against a given API key. This is extremely difficult to do in a distributed setting as illustrated above. The solution proposed in #89 is to introduce a minimum "time to live" for apps. The idea is that all tests should be able to finish within this window. If we guarantee that all apps live for at least that long, then even if we accidentally go over our assigned limits, then the system will slow down instead of failing. Originally implemented in #84 it was later reverted. The behavior is not great since most tests exit very quickly around 1-3 minutes. Forcing ALL apps to live for a minimum of 7 minutes effectively meant that a single CI run could not take fewer than seven minutes (if it's close to using the HATCHET_APP_LIMIT on each CI run). When merged in, it effectively took a CI run on master to execute for about 15+ minutes which was not acceptable. ## Fix Minimum time to live + "finished" (maintenance=true) tracking The solution to the "time to live" problem of increased test time is to find a more granular way to track when an app can be safely destroyed. While we still want to have a minimum time to live for apps currently being tested, the API call that returns our list of app info also returns some metadata status including whether or not the app is in maintenance mode. The solution then means when an app is done (App#teardown! is called) then an API call will be made to set it to maintenance mode. This will communicate to the reaper it can be cleaned immediately and it does not have to wait for seven minutes. This creates a two queue system with a fast "finished" queue of apps that can be deleted right away, and a slower "unfinished" queue. The unfinished queue will only be checked after the finished queue is drained. When we try to delete an "unifinished" app we first check to see how long it's been alive. If it's older than the TTL then we can remove it right away. Otherwise the system will throttle itself gradually to give the system a chance to "finish" some apps. It will check again in 10s, 20s, 40s, 80s, ... until it hits the minimum time to live for the oldest "unfinished" app at which time it will then reset. ## Extra debugging info for the reaper All outputs from the reaper now include extra debugging information about the status of the system. Previously these values were being logged inconsistently making some of the above bugs harder to reason about.

edmorley · 2020-10-11T13:55:44Z

@schneems Can this be closed now that #97 has merged? :-)

schneems mentioned this issue Jul 30, 2020

App Reaper Redesign #97

Merged

schneems closed this as completed Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix logic for reaping apps based on total app limits #90

Fix logic for reaping apps based on total app limits #90

schneems commented Jul 9, 2020 •

edited

Loading

schneems commented Jul 9, 2020

schneems commented Jul 9, 2020

edmorley commented Oct 11, 2020

Fix logic for reaping apps based on total app limits #90

Fix logic for reaping apps based on total app limits #90

Comments

schneems commented Jul 9, 2020 • edited Loading

schneems commented Jul 9, 2020

schneems commented Jul 9, 2020

edmorley commented Oct 11, 2020

schneems commented Jul 9, 2020 •

edited

Loading