App Reaper Redesign

## Fix Global Heroku app Limits Issue: #90 When the reaper was originally introduced there was a limit of 100 apps per Heroku account. As of today my account has 224 apps, because it is connected to multiple organizations. This means when I run with my local account, every time `reaper.cycle` is called it deletes a hatchet app even though it doesn't need to. The "fix" for this is to remove the HEROKU_APP_LIMIT setting and instead trying to detect when we've been going over the limit. This was accomplished by forcing the reaper to reap once when an exception is raised in the creation of an app. (Note reaping does not guarantee deletion of an app if none are "finished" or old enough). It can also trigger the issue with "minimum time to live" listed below if a parallel run was attempted locally. ## Fix Minimum time to live Issue: #89 The problem looks like this, in a deploy during a test on test run on CI will end up deleting an app currently being deployed, this ends up generating an error like this: ``` remote: ! No such app as hatchet-t-522c335bea. ``` The most common reason this happens is that there are more hatchet apps attempting to be deployed than are configured for the limit. For example: - A test suite can concurrently run 20 tests and has a HATCHET_APP_LIMIT=80. Each test run on CI executes the same tests against 3 stacks meaning each CI run executes 60 app deploys concurrently - If two test runs are executing concurrently the required number of apps is 2x 60 which is 120 which is over the limit of 80. When this happens a reaper will likely destroy an app as it is in progress. - Keep in mind when merging tests to master, the suite may be re-run, or when pushing to two different branches. If multiple engineers are working on PRs at the same time this can also happen... This problem is especially bad given the distributed nature of our tests, and given that tests have built in retries. It ends up really straining the system and taking a long time to exit if the developer doesn't notice and manually kill one of the extra test runs. This problem also means that the developer needs to always know the maximum number of concurrent apps that are likely to be deployed at a time against a given API key. This is extremely difficult to do in a distributed setting as illustrated above. The solution proposed in #89 is to introduce a minimum "time to live" for apps. The idea is that all tests should be able to finish within this window. If we guarantee that all apps live for at least that long, then even if we accidentally go over our assigned limits, then the system will slow down instead of failing. Originally implemented in #84 it was later reverted. The behavior is not great since most tests exit very quickly around 1-3 minutes. Forcing ALL apps to live for a minimum of 7 minutes effectively meant that a single CI run could not take fewer than seven minutes (if it's close to using the HATCHET_APP_LIMIT on each CI run). When merged in, it effectively took a CI run on master to execute for about 15+ minutes which was not acceptable. ## Fix Minimum time to live + "finished" (maintenance=true) tracking The solution to the "time to live" problem of increased test time is to find a more granular way to track when an app can be safely destroyed. While we still want to have a minimum time to live for apps currently being tested, the API call that returns our list of app info also returns some metadata status including whether or not the app is in maintenance mode. The solution then means when an app is done (App#teardown! is called) then an API call will be made to set it to maintenance mode. This will communicate to the reaper it can be cleaned immediately and it does not have to wait for seven minutes. This creates a two queue system with a fast "finished" queue of apps that can be deleted right away, and a slower "unfinished" queue. The unfinished queue will only be checked after the finished queue is drained. When we try to delete an "unifinished" app we first check to see how long it's been alive. If it's older than the TTL then we can remove it right away. Otherwise the system will throttle itself gradually to give the system a chance to "finish" some apps. It will check again in 10s, 20s, 40s, 80s, ... until it hits the minimum time to live for the oldest "unfinished" app at which time it will then reset. ## Extra debugging info for the reaper All outputs from the reaper now include extra debugging information about the status of the system. Previously these values were being logged inconsistently making some of the above bugs harder to reason about.
heroku · Aug 3, 2020 · 688076e · 688076e
1 parent eb0b1e9
commit 688076e
Show file tree

Hide file tree

Showing 8 changed files with 500 additions and 85 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,9 @@
 ## HEAD
 
+- Apps are now marked as being "finished" by enabling maintenance mode on them when `teardown!` is called. Finished apps can be reaped immediately (https://github.com/heroku/hatchet/pull/97)
+- Applications that are not marked as "finished" will be allowed to live for a HATCHET_ALIVE_TTL_MINUTES duration before they're deleted by the reaper to protect against deleting an app mid-deploy, default is seven minutes (https://github.com/heroku/hatchet/pull/97)
+- The HEROKU_APP_LIMIT env var no longer does anything, instead hatchet application reaping is manually executed if an app cannot be created (https://github.com/heroku/hatchet/pull/97)
+- App#deploy without a block will no longer run `teardown!` automatically (https://github.com/heroku/hatchet/pull/97)
 - Calls to `git push heroku` are now rate throttled (https://github.com/heroku/hatchet/pull/98)
 - Deployment now raises and error when the release failed (https://github.com/heroku/hatchet/pull/93)
 

diff --git a/README.md b/README.md
@@ -310,11 +310,12 @@ $ heroku run bash -a hatchet-t-bed73940a6
 And use that to debug. Hatchet deletes old apps on demand. You tell it what your limits are and it will stay within those limits:
 
 ```
-HEROKU_APP_LIMIT=100
 HATCHET_APP_LIMIT=20
 ```
 
-With these env vars, Hatchet will "reap" older hatchet apps when it sees there are 20 or more hatchet apps, or if you have 100 or more apps under your user account. For CI, it's recomment you increase the HATCHET_APP_LIMIT to 80-100. If these values are too low and you're running hatchet tests in parallel then one execution of the reaper from one test run might cause an app that is still being used for a test to be deleted.
+With these env vars, Hatchet will "reap" older hatchet apps when it sees there are 20 or more hatchet apps. For CI, it's recomment you increase the HATCHET_APP_LIMIT to 80-100. Hatchet will mark apps as safe for deletion once they've finished and the `teardown!` method has been called on them (it tracks this by enabling maintenance mode on apps). Hatchet only tracks its own apps. If your account has reached the maximum number of global Heroku apps, you'll need to manually remove some.
+
+If for some reason an app is not marked as being in maintenance mode it can be deleted, but only after it has been allowed to live for a period of time. This is configured by the `HATCHET_ALIVE_TTL_MINUTES` env var. For example if you set it for `7` then Hatchet will ensure any apps that are not marked as being in maintenace mode are allowed to live for at least seven minutes. This should give the app time to finish execution of the test so it is not deleted mid-deploy. When this deletion happens, you'll see a warning in your output. It could indicate you're not properly cleaning up and calling `teardown!` on some of your apps, or it could mean that you're attempting to execute more tests concurrently than your `HATCHET_APP_LIMIT` allows. This might happen if you have multiple CI runs executing at the same time.
 
 It's recommend you don't use your personal Heroku API key for running tests on a CI server since the hatchet apps count against your account maximum limits. Running tests using your account locally is fine for debugging one or two tests.
 
@@ -336,11 +337,10 @@ Hatchet::Runner.new("python_default").deploy do |app|
   expect(app.output).to match(/Installing pip/)
 
   # Redeploy with changed requirements file
-  run!(%Q{echo "" >> requirements.txt})
   run!(%Q{echo "pygments" >> requirements.txt})
-  run!(%Q{git add . ; git commit --allow-empty -m next})
+  app.commit!
 
-  app.push!
+  app.push! # <======= HERE
 
   expect(app.output).to match("Requirements file has been changed, clearing cached dependencies")
 end
@@ -350,7 +350,6 @@ end
 
 You can run an app against CI using the `run_ci` command (instead of `deploy`). You can re-run tests against the same app with the `run_again` command.
 
-
 ```ruby
 Hatchet::Runner.new("python_default").run_ci do |test_run|
   expect(test_run.output).to match("Downloading nose")
@@ -576,11 +575,28 @@ end
 
 - `app.in_directory_fork`: Runs the given block in a temp directory and inside of a forked process
 - `app.directory`: Returns the current temp directory the appp is in.
-- `app.deploy`: Your main method, takes a block to execute after the deploy is successful
+- `app.deploy`: Your main method, takes a block to execute after the deploy is successful. If no block is provided you must manually call `app.teardown! (see below for an example).
 - `app.output`: The output contents of the deploy
 - `app.platform_api`: Returns an instance of the [platform-api Heroku client](https://github.com/heroku/platform-api). If hatchet doesn't give you access to a part of Heroku that you need, you can likely do it with the platform-api client.
 - `app.push!`: Push code to your heroku app. Can be used inside of a `deploy` block to re-deploy.
 - `app.run_ci`: Runs Heroku CI against the app, returns a TestRun object in the block
+- `app.teardown!`: This method is called automatically when using `app.deploy` in block mode after the deploy block finishes. When called it will clean up resources, mark the app as being finished (by setting {maintenance" => true} on the app) so that the reaper knows it is safe to delete later. Here is an example of a test that creates and deploys an app manually, then later tears it down manually. If you deploy an application without calling `teardown!` then Hatchet will not know it is safe to delete and may keep it around for much longer than required for the test to finish.
+
+```ruby
+before(:all) do
+  @app = Hatchet::Runner.new("default_ruby")
+  @app.deploy
+end
+
+after(:all) do
+  @app.teardown! if @app
+end
+
+it "uses ruby" do
+  expect(@app.run("ruby -v")).to match("ruby")
+end
+
+```
 - `test_run.run_again`: Runs the app again in Heroku CI
 - `test_run.status`: Returns the status of the CI run (possible values are `:pending`, `:building`, `:creating`, `:succeeded`, `:failed`, `:errored`)
 - `test_run.output`: The output of a given test run
@@ -591,21 +607,21 @@ end
 HATCHET_BUILDPACK_BASE=https://github.com/heroku/heroku-buildpack-nodejs.git
 HATCHET_BUILDPACK_BRANCH=<branch name if you dont want hatchet to set it for you>
 HATCHET_RETRIES=2
-HEROKU_APP_LIMIT=100
 HATCHET_APP_LIMIT=(set to something low like 20 locally, set higher like 80-100 on CI)
 HEROKU_API_KEY=<redacted>
 HEROKU_API_USER=<[email protected]>
+HATCHET_ALIVE_TTL_MINUTES=7
 ```
 
 > The syntax to set an env var in Ruby is `ENV["HATCHET_RETRIES"] = "2"` all env vars are strings.
 
 - `HATCHET_BUILDPACK_BASE`: This is the URL where hatchet can find your buildpack. It must be public for Heroku to be able to use your buildpack.
 - `HATCHET_BUILDPACK_BRANCH`: By default Hatchet will use your current git branch name. If for some reason git is not available or you want to manually specify it like `ENV["HATCHET_BUILDPACK_BRANCH'] = ENV[`MY_CI_BRANCH`]` then you can.
 - `HATCHET_RETRIES` If the `ENV['HATCHET_RETRIES']` is set to a number, deploys are expected to work and automatically retry that number of times. Due to testing using a network and random failures, setting this value to `3` retries seems to work well. If an app cannot be deployed within its allotted number of retries, an error will be raised. The downside of a larger number is that your suite will keep running for much longer when there are legitimate failures.
-- `HEROKU_APP_LIMIT`: The maximum number of total apps hatchet will allow in the given account before running the reaper
 - `HATCHET_APP_LIMIT`: The maximum number of **hatchet** apps that hatchet will allow in the given account before running the reaper. For local execution keep this low as you don't want your account dominated by hatchet apps. For CI you want it to be much larger, 80-100 since it's not competiting with non-hatchet apps. Your test runner account needs to be a dedicated account.
 - `HEROKU_API_KEY`: The api key of your test account user. If you run locally without this set it will use your personal credentials.
 - `HEROKU_API_USER`: The email address of your user account. If you run locally without this set it will use your personal credentials.
+- `HATCHET_ALIVE_TTL_MINUTES`: The minimum time that hatchet appplications must be allowed to live on a given account if they're not marked by being in maintenance mode. For example if you set this value to 3 it guarantees that a Hatchet app will be allowed to live 3 minutes before Hatchet will try to delete it. Default is 7 minutes. Set to zero to disable.
 
 ## Basic
 
@@ -631,8 +647,7 @@ end
 
 - **spec/hatchet/buildpack_spec.rb**
 
-Rspec knows a file is a test file or not by the name. It looks for files that end in `_spec.rb` you can have as many as you want. I recommend putting them in a "spec/hatchet" sub-folder.
-
+Rspec knows a file is a test file or not by the name. It looks for files that end in `spec.rb` you can have as many as you want. I recommend putting them in a "spec/hatchet" sub-folder.
 
 - **File contents**
 
@@ -676,7 +691,6 @@ If you want to assert the opposite you can use `to_not`:
 expect(value).to_not eq(false)
 ```
 
-
 - **matcher syntax**
 
 In the above example the `eq` is called a "matcher". You're matching it against an object. In this case you're looking for equality `==`.
@@ -694,7 +708,6 @@ Rspec uses some "magic" to convert anything you pass to
 
 Since most values in hatchet are strings, the ones I use the most are:
 
-
 - Rspec matchers
   - include https://relishapp.com/rspec/rspec-expectations/v/3-2/docs/built-in-matchers/include-matcher#string-usage
   - match https://relishapp.com/rspec/rspec-expectations/v/3-2/docs/built-in-matchers/match-matcher
@@ -822,7 +835,6 @@ puts my_hash.inspect
 
 Blocks are a concept in Ruby for closure. Depending on how it's used it can be an anonomous method. It's always a method for passing around code. When you see `do |app|` that's the beginning of an implicit block. In addition to an implicit block you can create an explicit block using lambdas and procs. In hatchet, these are most likely to be used to update the app `before_deploy`. Here's an example of some syntax for creating various blocks.
 
-
 ```ruby
 before_deploy = -> { FileUtils.touch("foo.txt") } # This syntax is called a "stabby lambda"
 before_deploy = lambda { FileUtils.touch("foo.txt") } # This is a more verbose lambda
@@ -881,7 +893,6 @@ the command by going to the source code directory and running:
 
     $ ./bin/hatchet --help
 
-
 ## License
 
 MIT
diff --git a/lib/hatchet/app.rb b/lib/hatchet/app.rb
@@ -9,7 +9,7 @@ class App
     HATCHET_BUILDPACK_BRANCH = -> { ENV['HATCHET_BUILDPACK_BRANCH'] || ENV['HEROKU_TEST_RUN_BRANCH'] || Hatchet.git_branch }
     BUILDPACK_URL = "https://github.com/heroku/heroku-buildpack-ruby.git"
 
-    attr_reader :name, :stack, :directory, :repo_name, :app_config, :buildpacks
+    attr_reader :name, :stack, :directory, :repo_name, :app_config, :buildpacks, :reaper
 
     class FailedDeploy < StandardError; end
 
@@ -182,24 +182,26 @@ def not_debugging?
     alias :no_debug? :not_debugging?
 
     def deployed?
-      # !heroku.get_ps(name).body.detect {|ps| ps["process"].include?("web") }.nil?
       api_rate_limit.call.formation.list(name).detect {|ps| ps["type"] == "web"}
     end
 
     def create_app
       3.times.retry do
         begin
-          # heroku.post_app({ name: name, stack: stack }.delete_if {|k,v| v.nil? })
           hash = { name: name, stack: stack }
           hash.delete_if { |k,v| v.nil? }
-          api_rate_limit.call.app.create(hash)
+          heroku_api_create_app(hash)
         rescue => e
-          @reaper.cycle
+          @reaper.cycle(app_exception_message: e.message)
           raise e
         end
       end
     end
 
+    private def heroku_api_create_app(hash)
+      api_rate_limit.call.app.create(hash)
+    end
+
     def update_stack(stack_name)
       @stack = stack_name
       api_rate_limit.call.app.update(name, build_stack: @stack)
@@ -239,10 +241,9 @@ def push_without_retry!
 
     def teardown!
       return false unless @app_is_setup
-      if debugging?
-        puts "Debugging App:#{name}"
-        return false
-      end
+
+      @app_update_info = platform_api.app.update(name, { maintenance: true })
+
       @reaper.cycle
     end
 
@@ -285,18 +286,14 @@ def in_directory_fork(&block)
       end
     end
 
-    # creates a new app on heroku, "pushes" via anvil or git
-    # then yields to self so you can call self.run or
-    # self.deployed?
-    # Allow deploy failures on CI server by setting ENV['HATCHET_RETRIES']
     def deploy(&block)
       in_directory do
         self.setup!
         self.push_with_retry!
         block.call(self, api_rate_limit.call, output) if block_given?
       end
     ensure
-      self.teardown!
+      self.teardown! if block_given?
     end
 
     def push