thewca · FinnIckler · Sep 25, 2023 · Feb 8, 2024 · Feb 12, 2024 · dunkOnIT
diff --git a/knowledge_base/infrastructure.md b/knowledge_base/infrastructure.md
@@ -7,6 +7,9 @@ layout: default
 {: .help}
 > There is much more information about our infrastructure that still needs to be added
 
+# Containerized Services
+WCA Services are containerized using Docker containers. You can find more about the Setup (here)[./infrastructure/containers.md]
+
 # Caching and Database Optimizations
 
 This is a list of things we are currently caching or using the Read replica for instead of using the primary database:

diff --git a/knowledge_base/infrastructure/containers.md b/knowledge_base/infrastructure/containers.md
@@ -0,0 +1,26 @@
+---
+title: Containerized Services
+parent: Infrastructure
+layout: default
+---
+
+# Images
+> Don't manually push images to the ECR Repository, this might cause the Rails Image to get out of sync with the Sidekiq image
+> Instead always use the GitHub Workflows which pushes both at the same time
+> 
+There are two Dockerfiles, one for the main website and one for the Sidekiq Server which differ marginally by dependencies and entrypoint.
+Dockerfiles are the same for staging and production, but we still tag them separately so we can test changes on staging.
+# Services
+## Production
+Production is divided into two services, the main Rails Service and an auxiliary service running Sidekiq and PHPMyAdmin.
+This is done, so we can scale production while only scaling the Rails Container.
+## Staging
+Staging only runs a single service containing all 3 containers.
+# Deployment
+## Production
+GitHub Releases are used to build new images for production.
+To ensure we don't deploy broken code and a smooth transition we use Blue Green Deployment.
+Whenever a new image is pushed to the `latest` tag in our `wca-on-rails` image repository a Pipeline is triggered which starts a new task in our production service.
+When the service successfully comes up healthy and stays healthy for 5 minutes, the traffic is diverted to the new tasks and the old one is terminated.
+## Staging
+Staging is deployed manually and always uses the last image tagged with the `staging` tag in our `wca-on-rails` image repository. To deploy staging you can type `@the-wca-bot deploy staging` into the PR you want to deploy to staging, 
diff --git a/wst-processes/internal_processes.md b/wst-processes/internal_processes.md
@@ -16,60 +16,6 @@ QUICK LINK: https://s3.console.aws.amazon.com/s3/object/wca-backups?region=us-we
 3. Toggle "Show versions" to see a version history, and download the one you need
 4. Bear in mind that you can just grep this file for basic data - you don't always need to load it into a SQL client
 
-
-## Deprecated - tarsnap backups:
-
-We backup the production database to tarsnap weekly.
-
-The easiest way to get to them is by sshing to the production server and running:
-```bash
-tarsnap --list-archives | sort
-```
-
->...  
-wca-backup-20170529_000922  
-wca-backup-20170605_001107  
-wca-backup-20170612_000909  
-wca-backup-20170619_000838  
-wca-backup-20170626_000934  
-wca-backup-20170703_001047  
-wca-backup-20170710_001104  
-wca-backup-20170717_001036  
-wca-backup-20170724_001003  
-wca-backup-20170731_000945
-
-
-To get the contents of the database from June 12th, you can run the following command:
-
-```bash
-tarsnap -Oxf wca-backup-20170612_000909 secrets/wca_db/cubing.sql > /tmp/cubing.sql
-```
-
-Now that you have this file the safest is probably to use it locally (do not import it onto staging, or else the whole world will be able to access it via phpMyAdmin on staging!).
-
-Before moving the file to another machine it's best to tar it so that it shrinks to a reasonable size:
-```bash
-/tmp @production> cd /tmp
-/tmp @production> tar -czvf cubing.sql.tar.gz cubing.sql
-```
-
-Over on your laptop now, you can scp, extract the file and import it into a newly created database:
-
-```bash
-/tmp @yourlaptop> scp worldcubeassociation.org:/tmp/cubing.sql.tar.gz .
-cubing.sql.tar.gz                                                       100%  140MB  28.0MB/s   00:05
-/tmp @yourlaptop> tar xvf cubing.sql.tar.gz
-cubing.sql
-/tmp @yourlaptop> mysql
-mysql> create database wca_backup_20170612_000909;
-Query OK, 1 row affected (0.01 sec)
-
-mysql> use wca_backup_20170612_000909;
-Database changed
-mysql> source cubing.sql;
-```
-
-
 # Add/Remove Someone to/from WST
 
 1. Add them to the Software Team on the WCA website: <https://www.worldcubeassociation.org/teams/2/edit>.

diff --git a/wst-processes/managing_deployments.md b/wst-processes/managing_deployments.md
@@ -4,33 +4,24 @@ parent: WST Processes
 layout: default
 ---
 
-{: .warning}
-> Large portions of this are based on the old Github Wiki, much of which may now be out of date.
-
-{: .warning}
-> ⚠️ **If you want to try out the commands/tasks described in this document (or any other) please
-use the staging server for that, not the production one. You can easily distinguish between
-these two by looking at the command line prompt (`~ @staging> ` vs `~ @production> `).**
-
-
-
 # SSH to production
 
-This process will work for any of our instances running on EC2 servers.
+This process will automatically connect to the running production container. 
+Make sure you have the AWS CLI and the Session Manager Plugin installed
 
-1. Go to [running instances](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Instances:instanceState=running) and select the prod one
-2. Click on "Connect" in top-right of the screen 
-3. Click "Connect" under the "Session Manager" tab (this is the tab which loads by default)
+```shell
+aws ecs execute-command --cluster wca-on-rails --task $(aws ecs list-tasks --cluster wca-on-rails --service-name wca-on-rails-prod --query 'taskArns[*]' --output text) --container rails-production --interactive --command '/bin/bash'
+```
 
 # Troubleshooting Production
 
-The website is mostly Rails and Javascript, with a layer of nginx in front of them. There's nothing special about our nginx setup, so the internet is your friend if it looks like there's an nginx problem. It's very rare that anyone goes wrong with nginx though.
+The website is mostly Rails and Javascript.
 
 Rails pages are newer bootstrappy looking pages that have a navbar with sign in and sign out functionality.
 
 ## Debugging Rails
 
-We run our Rails code on Puma. Rails logs are at `/home/cubing/worldcubeassociation.org/WcaOnRails/log/production.log`.
+We run our Rails code on Unicorn. Rails logs are in Cloudwatch as part of the `wca-on-rails` cluster, the `wca-on-rails-prod` service and `rails-production` container
 
 [New Relic](https://one.newrelic.com/nr1-core/apm/overview/MTA2ODk1NHxBUE18QVBQTElDQVRJT058MTAzODUxNjM?account=1068954&duration=1800000&state=5b1b3f6a-f807-c41e-09cf-3c258213c127) can also be a useful tool for getting some insight into Rails - particularly for searching logs.
 
@@ -51,205 +42,40 @@ some example decent configs are
 
 # Useful Commands to Know
 
-How to run migrations on production or staging after deploying the latest code:
-
-`~/worldcubeassociation.org/WcaOnRails @production> RAILS_ENV=production bin/rake db:migrate`
-
 # General information
 
 ## Secrets
-{: .help}
-> WST recently transferred secrets management to Vault - this section needs to be updated.
-
-- Production secrets are stored in an encrypted chef [data bag](https://docs.chef.io/data_bags.html) at `chef/data_bags/secrets/production.json`.
-  - Show secrets: `knife data bag show secrets production -c /etc/chef/solo.rb --secret-file secrets/my_secret_key`
-  - Edit secrets: `knife data bag edit secrets production -c /etc/chef/solo.rb --secret-file secrets/my_secret_key`
-
-**Note:** in order to do this locally, you need to fetch the `secrets/my_secret_key` key from production, as well as fetch and edit the chef configuration file `/etc/chef/solo.rb` from production.
-
-## Vagrant
-
-{: .help}
-> This section may well be years out of date - review by senior members needed.
-
-The following sections assumes you are familiar with basic unix shell operations.
-
-### Accessing the VM
-
-When you put the VM up using `vagrant up noregs`, it will start the Rails application in a [screen](https://www.gnu.org/software/screen/) session, making the website available on [http://localhost:2331](http://localhost:2331).
-
-You can ssh to the VM using `vagrant ssh noregs`.
-
-### Accessing the screen session
-
-Once you ssh-ed to the VM, you can resume the screen session using :
-`screen -r`
-
-From there you can use `^a "` (Ctrl-A + ") to list the windows in the session. The shell running the Rails application is the one named "run".
-
-### Updating the database
-
-If the last changes you fetched change the db schema, you will have to migrate the application.
-
-For this you have to go to the application's root directory ("WcaOnRails") of the git repository.
-It's a shared folder between the host and the VM, and it's mounted on `/vagrant/WcaOnRails` inside the VM.
-(You can also just switch to the "dev" window in the screen session, which is shell in this directory)
-
-From there you see what migrations are pending using :
-
-```
-RACK_ENV=development DATABASE_URL=mysql2://root:pentagon-pouncing-flared-trusted@localhost/cubing bundle exec rake db:migrate:status
-```
-
-And apply them using :
-
-```
-RACK_ENV=development DATABASE_URL=mysql2://root:pentagon-pouncing-flared-trusted@localhost/cubing bundle exec rake db:migrate
-```
-
-
-### Restarting the application
-
-This is necessary if you changed the configuration files.
-You can just stop the running application in the shell and go back in the history to get the command, for the record here it is :
-
-```
-RACK_ENV=development DATABASE_URL=mysql2://root:pentagon-pouncing-flared-trusted@localhost/cubing bundle exec rails server
-```
-
-### Running tests and filling the database
-
-Wonder if you changes will pass the travis build ?
-
-Go to the application directory and set up a test database:
-
-```
-RACK_ENV=test DATABASE_URL=mysql2://root:pentagon-pouncing-flared-trusted@localhost/cubing_test bundle exec rake db:reset
-```
-
-Then go to the application directory and run the testsuite yourself using :
-
-```
-RACK_ENV=test DATABASE_URL=mysql2://root:pentagon-pouncing-flared-trusted@localhost/cubing_test bundle exec rspec
-```
-
-### Looking up routes
-
-If you modified the application routes, you probably want to check the existing routes and check that you didn't break anything.
-From the application directory you can display the current routes using :
-
-```
-bundle exec rake routes
-```
-
-### Debugging
-
-An easy way to inspect the application state is to use `byebug`.
-You can basically put "byebug" somewhere in the code (preferably where you think it breaks :wink: ), and the application will stop on this point and open a `byebug` prompt in the application shell.
-If you already know `gdb` the features and usage are similar, please take a look at the [guide](https://github.com/deivid-rodriguez/byebug/blob/master/GUIDE.md) for a detailed overview.
-
-### Clearing Cache
-
-The server uses filestore for caching and we'll need to periodically clear the cache.
-
-Running `df -h` will show 100% usage on `/`. When this happens, we'll  need to run the Rake command:  `RACK_ENV=production bin/rake tmp:cache:clear` 
-
-
+Secrets are stored in Hashicorp Vault. Our ECS tasks authenticate to Vault using their IAM Task Roles. 
+These roles are mapped to a specific set of permissions.
+You can read more about our Vault setup in the [Google Doc](https://docs.google.com/document/d/1ZszGawG70oZaTrXu5-gKxfS8HwwKG_0U9b-CQjKQ6RE/edit#heading=h.t6tsp7w7uomr)
 
 # Operating the Website
 
 Here you can find the most common operations necessary to keep the website running.
 
+## Staging
 
-## Checking the state of the app
-
-{: . help}
-> This is highly out of date, and requires a rewrite.
-
-The website is essentially a system process responsible for handling incoming requests.
-To see if the process(es) are running you can run the following command (list all processes then filter the relevant ones):
-
-```shell
-~ @staging> ps aux | grep unicorn
-cubing    9115  0.0  4.2 531308 170704 ?       Sl   Dec06   0:07 unicorn master -D -c config/unicorn.rb
-cubing    9161  0.0  4.5 599932 182536 ?       Sl   Dec06   0:24 unicorn worker[1] -D -c config/unicorn.rb
-cubing    9165  0.0  4.5 599932 182368 ?       Sl   Dec06   0:24 unicorn worker[2] -D -c config/unicorn.rb
-cubing    9169  0.0  6.6 666520 268332 ?       Sl   Dec06   0:24 unicorn worker[3] -D -c config/unicorn.rb
-cubing   24181  0.0  4.2 599932 172732 ?       Sl   15:07   0:00 unicorn worker[0] -D -c config/unicorn.rb
-cubing   24607  0.0  0.0  10472   900 pts/4    S+   15:18   0:00 grep unicorn
-```
-
-The second column is the process identifier (useful when you want to manage the process, e.g. terminate it),
-whereas the last column is the actual command behind the process.
-
-As you can see there are several processes, four workers and one master process supervising them
-(so if any of them goes down, a new process is started).
-
-To terminate the app you can use the `kill` command providing the process identifies (PID) of the master process.
-In this case:
+We have another container with production-like setup, but used only for testing.
+You can establish the connection in the same way, just use the staging service instead.
 
 ```shell
-~ @staging> kill 9115
+aws ecs execute-command --cluster wca-on-rails --task $(aws ecs list-tasks --cluster wca-on-rails --service-name wca-on-rails-staging --query 'taskArns[*]' --output text | awk -F/ '{print $NF}') --container rails-staging --interactive --command "/bin/bash"
 ```
 
-Now you can verify that the processes are no longer there:
-
-```shell
-~ @staging> ps aux | grep unicorn
-cubing   24633  0.0  0.0  10472   900 pts/4    S+   15:19   0:00 grep unicorn
-```
+## Checking the state of the app
 
-Usually that's what you see when the website is down.
+We use health checks that automatically restart the server when it can't serve requests for a couple minutes.
+You can check the metrics in the [health](https://us-west-2.console.aws.amazon.com/ecs/v2/clusters/wca-on-rails/services/wca-on-rails-prod/health?region=us-west-2) section of ecs
 
 ## (Re)starting the website
 
 ❗ *That's the most common way of fixing "the website is down" issues.*
 
-To start the app (or restart if already running) you can run the deploy script like so:
-
-```shell
-~ @staging> worldcubeassociation.org/scripts/deploy.sh restart_app
-```
-
-Once this finishes you can verify it's up again:
-
-```shell
-~ @staging> ps aux | grep unicorn
-cubing   24838 30.6  4.2 530784 170600 ?       Sl   15:30   0:06 unicorn master -D -c config/unicorn.rb
-cubing   24848  0.0  4.0 530784 164816 ?       Sl   15:30   0:00 unicorn worker[0] -D -c config/unicorn.rb
-cubing   24852  0.0  4.0 530784 164788 ?       Sl   15:30   0:00 unicorn worker[1] -D -c config/unicorn.rb
-cubing   24856  0.0  4.0 530784 164888 ?       Sl   15:30   0:00 unicorn worker[2] -D -c config/unicorn.rb
-cubing   24860  0.0  4.0 530784 164812 ?       Sl   15:30   0:00 unicorn worker[3] -D -c config/unicorn.rb
-cubing   24869  0.0  0.0  10472   900 pts/4    S+   15:31   0:00 grep unicorn
-```
+The Serer is automatically restarted when the website is down
 
 ## Talking to the database
-
-{: .help}
-> We have PhpMyAdmin available now - document how to use this instead.
-
-You can connect to the MySQL database shell simply by running:
-
-```shell
-~ @staging> mysql cubing
-```
-
-Now you should see a new CLI prompt and be able to run SQL queries:
-
-```shell
-mysql> SELECT COUNT(*) FROM Competitions;
-+----------+
-| COUNT(*) |
-+----------+
-|     7202 |
-+----------+
-1 row in set (0.04 sec)
-```
-
-⚠️ **Feel free to play around with the database on the staging server,
-however on the production server please use it only when necessary.
-Even running some `SELECT` queries may hurt the application performance
-if those queries are more complex.**
+We use phpmyadmin to talk to the database. If you have access it, you can find [the link](https://www.worldcubeassociation.org/results/database/) for it in the main menu under your avatar. 
+You can find the username and generate a password in the [admin panel](https://www.worldcubeassociation.org/admin/generate_db_token)
 
 ### Looking at currently running queries
 
@@ -311,23 +137,17 @@ mysql> show processlist;
 10 rows in set (0.00 sec)
 ```
 
+You can also check what the longest running queries are in the RDS Console.
+
 ## Looking into the logs
 
 {: .help}
 > This can be done via NewRelic now.
 
 There are several log files you may look into:
 
-* `worldcubeassociation.org/WcaOnRails/log/production.log` - that's a very detailed log of the Rails application,
-  it includes SQL queries and details about requests that errored
-* `/var/log/nginx/access.log` - this file includes one line per every incoming request,
-  so it's sometimes useful for monitoring the current traffic
-
-
-## Changing our Google API KEY
-
-You should go to the [developer console](https://console.developers.google.com/apis/credentials?project=wca-website&pli=1) (login with the wca.software google account which is in our credentials document), select the "WCA production key", and add the new server's IP (you should also remove the old server's IP).
-
+* New Relic
+* Cloudwatch
 
 # Staging