Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add info about our container setup #9

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions knowledge_base/infrastructure.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@ layout: default
{: .help}
> There is much more information about our infrastructure that still needs to be added

# Containerized Services
WCA Services are containerized using Docker containers. You can find more about the Setup (here)[./infrastructure/containers.md]

# Caching and Database Optimizations

This is a list of things we are currently caching or using the Read replica for instead of using the primary database:
Expand Down
26 changes: 26 additions & 0 deletions knowledge_base/infrastructure/containers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: Containerized Services
parent: Infrastructure
layout: default
---

# Images
> Don't manually push images to the ECR Repository, this might cause the Rails Image to get out of sync with the Sidekiq image
> Instead always use the GitHub Workflows which pushes both at the same time
>
There are two Dockerfiles, one for the main website and one for the Sidekiq Server which differ marginally by dependencies and entrypoint.
Dockerfiles are the same for staging and production, but we still tag them separately so we can test changes on staging.
# Services
## Production
Production is divided into two services, the main Rails Service and an auxiliary service running Sidekiq and PHPMyAdmin.
This is done, so we can scale production while only scaling the Rails Container.
## Staging
Staging only runs a single service containing all 3 containers.
# Deployment
## Production
GitHub Releases are used to build new images for production.
To ensure we don't deploy broken code and a smooth transition we use Blue Green Deployment.
Whenever a new image is pushed to the `latest` tag in our `wca-on-rails` image repository a Pipeline is triggered which starts a new task in our production service.
When the service successfully comes up healthy and stays healthy for 5 minutes, the traffic is diverted to the new tasks and the old one is terminated.
## Staging
Staging is deployed manually and always uses the last image tagged with the `staging` tag in our `wca-on-rails` image repository. To deploy staging you can type `@the-wca-bot deploy staging` into the PR you want to deploy to staging,
54 changes: 0 additions & 54 deletions wst-processes/internal_processes.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,60 +16,6 @@ QUICK LINK: https://s3.console.aws.amazon.com/s3/object/wca-backups?region=us-we
3. Toggle "Show versions" to see a version history, and download the one you need
4. Bear in mind that you can just grep this file for basic data - you don't always need to load it into a SQL client


## Deprecated - tarsnap backups:

We backup the production database to tarsnap weekly.

The easiest way to get to them is by sshing to the production server and running:
```bash
tarsnap --list-archives | sort
```

>...
wca-backup-20170529_000922
wca-backup-20170605_001107
wca-backup-20170612_000909
wca-backup-20170619_000838
wca-backup-20170626_000934
wca-backup-20170703_001047
wca-backup-20170710_001104
wca-backup-20170717_001036
wca-backup-20170724_001003
wca-backup-20170731_000945


To get the contents of the database from June 12th, you can run the following command:

```bash
tarsnap -Oxf wca-backup-20170612_000909 secrets/wca_db/cubing.sql > /tmp/cubing.sql
```

Now that you have this file the safest is probably to use it locally (do not import it onto staging, or else the whole world will be able to access it via phpMyAdmin on staging!).

Before moving the file to another machine it's best to tar it so that it shrinks to a reasonable size:
```bash
/tmp @production> cd /tmp
/tmp @production> tar -czvf cubing.sql.tar.gz cubing.sql
```

Over on your laptop now, you can scp, extract the file and import it into a newly created database:

```bash
/tmp @yourlaptop> scp worldcubeassociation.org:/tmp/cubing.sql.tar.gz .
cubing.sql.tar.gz 100% 140MB 28.0MB/s 00:05
/tmp @yourlaptop> tar xvf cubing.sql.tar.gz
cubing.sql
/tmp @yourlaptop> mysql
mysql> create database wca_backup_20170612_000909;
Query OK, 1 row affected (0.01 sec)

mysql> use wca_backup_20170612_000909;
Database changed
mysql> source cubing.sql;
```


# Add/Remove Someone to/from WST

1. Add them to the Software Team on the WCA website: <https://www.worldcubeassociation.org/teams/2/edit>.
Expand Down
228 changes: 24 additions & 204 deletions wst-processes/managing_deployments.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,24 @@ parent: WST Processes
layout: default
---

{: .warning}
> Large portions of this are based on the old Github Wiki, much of which may now be out of date.

{: .warning}
> ⚠️ **If you want to try out the commands/tasks described in this document (or any other) please
use the staging server for that, not the production one. You can easily distinguish between
these two by looking at the command line prompt (`~ @staging> ` vs `~ @production> `).**



# SSH to production

This process will work for any of our instances running on EC2 servers.
This process will automatically connect to the running production container.
Make sure you have the AWS CLI and the Session Manager Plugin installed

1. Go to [running instances](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Instances:instanceState=running) and select the prod one
2. Click on "Connect" in top-right of the screen
3. Click "Connect" under the "Session Manager" tab (this is the tab which loads by default)
```shell
aws ecs execute-command --cluster wca-on-rails --task $(aws ecs list-tasks --cluster wca-on-rails --service-name wca-on-rails-prod --query 'taskArns[*]' --output text) --container rails-production --interactive --command '/bin/bash'
```

# Troubleshooting Production

The website is mostly Rails and Javascript, with a layer of nginx in front of them. There's nothing special about our nginx setup, so the internet is your friend if it looks like there's an nginx problem. It's very rare that anyone goes wrong with nginx though.
The website is mostly Rails and Javascript.

Rails pages are newer bootstrappy looking pages that have a navbar with sign in and sign out functionality.

## Debugging Rails

We run our Rails code on Puma. Rails logs are at `/home/cubing/worldcubeassociation.org/WcaOnRails/log/production.log`.
We run our Rails code on Unicorn. Rails logs are in Cloudwatch as part of the `wca-on-rails` cluster, the `wca-on-rails-prod` service and `rails-production` container

[New Relic](https://one.newrelic.com/nr1-core/apm/overview/MTA2ODk1NHxBUE18QVBQTElDQVRJT058MTAzODUxNjM?account=1068954&duration=1800000&state=5b1b3f6a-f807-c41e-09cf-3c258213c127) can also be a useful tool for getting some insight into Rails - particularly for searching logs.

Expand All @@ -51,205 +42,40 @@ some example decent configs are

# Useful Commands to Know

How to run migrations on production or staging after deploying the latest code:

`~/worldcubeassociation.org/WcaOnRails @production> RAILS_ENV=production bin/rake db:migrate`

# General information

## Secrets
{: .help}
> WST recently transferred secrets management to Vault - this section needs to be updated.

- Production secrets are stored in an encrypted chef [data bag](https://docs.chef.io/data_bags.html) at `chef/data_bags/secrets/production.json`.
- Show secrets: `knife data bag show secrets production -c /etc/chef/solo.rb --secret-file secrets/my_secret_key`
- Edit secrets: `knife data bag edit secrets production -c /etc/chef/solo.rb --secret-file secrets/my_secret_key`

**Note:** in order to do this locally, you need to fetch the `secrets/my_secret_key` key from production, as well as fetch and edit the chef configuration file `/etc/chef/solo.rb` from production.

## Vagrant

{: .help}
> This section may well be years out of date - review by senior members needed.

The following sections assumes you are familiar with basic unix shell operations.

### Accessing the VM

When you put the VM up using `vagrant up noregs`, it will start the Rails application in a [screen](https://www.gnu.org/software/screen/) session, making the website available on [http://localhost:2331](http://localhost:2331).

You can ssh to the VM using `vagrant ssh noregs`.

### Accessing the screen session

Once you ssh-ed to the VM, you can resume the screen session using :
`screen -r`

From there you can use `^a "` (Ctrl-A + ") to list the windows in the session. The shell running the Rails application is the one named "run".

### Updating the database

If the last changes you fetched change the db schema, you will have to migrate the application.

For this you have to go to the application's root directory ("WcaOnRails") of the git repository.
It's a shared folder between the host and the VM, and it's mounted on `/vagrant/WcaOnRails` inside the VM.
(You can also just switch to the "dev" window in the screen session, which is shell in this directory)

From there you see what migrations are pending using :

```
RACK_ENV=development DATABASE_URL=mysql2://root:pentagon-pouncing-flared-trusted@localhost/cubing bundle exec rake db:migrate:status
```

And apply them using :

```
RACK_ENV=development DATABASE_URL=mysql2://root:pentagon-pouncing-flared-trusted@localhost/cubing bundle exec rake db:migrate
```


### Restarting the application

This is necessary if you changed the configuration files.
You can just stop the running application in the shell and go back in the history to get the command, for the record here it is :

```
RACK_ENV=development DATABASE_URL=mysql2://root:pentagon-pouncing-flared-trusted@localhost/cubing bundle exec rails server
```

### Running tests and filling the database

Wonder if you changes will pass the travis build ?

Go to the application directory and set up a test database:

```
RACK_ENV=test DATABASE_URL=mysql2://root:pentagon-pouncing-flared-trusted@localhost/cubing_test bundle exec rake db:reset
```

Then go to the application directory and run the testsuite yourself using :

```
RACK_ENV=test DATABASE_URL=mysql2://root:pentagon-pouncing-flared-trusted@localhost/cubing_test bundle exec rspec
```

### Looking up routes

If you modified the application routes, you probably want to check the existing routes and check that you didn't break anything.
From the application directory you can display the current routes using :

```
bundle exec rake routes
```

### Debugging

An easy way to inspect the application state is to use `byebug`.
You can basically put "byebug" somewhere in the code (preferably where you think it breaks :wink: ), and the application will stop on this point and open a `byebug` prompt in the application shell.
If you already know `gdb` the features and usage are similar, please take a look at the [guide](https://github.com/deivid-rodriguez/byebug/blob/master/GUIDE.md) for a detailed overview.

### Clearing Cache

The server uses filestore for caching and we'll need to periodically clear the cache.

Running `df -h` will show 100% usage on `/`. When this happens, we'll need to run the Rake command: `RACK_ENV=production bin/rake tmp:cache:clear`


Secrets are stored in Hashicorp Vault. Our ECS tasks authenticate to Vault using their IAM Task Roles.
These roles are mapped to a specific set of permissions.
You can read more about our Vault setup in the [Google Doc](https://docs.google.com/document/d/1ZszGawG70oZaTrXu5-gKxfS8HwwKG_0U9b-CQjKQ6RE/edit#heading=h.t6tsp7w7uomr)

# Operating the Website

Here you can find the most common operations necessary to keep the website running.

## Staging

## Checking the state of the app

{: . help}
> This is highly out of date, and requires a rewrite.

The website is essentially a system process responsible for handling incoming requests.
To see if the process(es) are running you can run the following command (list all processes then filter the relevant ones):

```shell
~ @staging> ps aux | grep unicorn
cubing 9115 0.0 4.2 531308 170704 ? Sl Dec06 0:07 unicorn master -D -c config/unicorn.rb
cubing 9161 0.0 4.5 599932 182536 ? Sl Dec06 0:24 unicorn worker[1] -D -c config/unicorn.rb
cubing 9165 0.0 4.5 599932 182368 ? Sl Dec06 0:24 unicorn worker[2] -D -c config/unicorn.rb
cubing 9169 0.0 6.6 666520 268332 ? Sl Dec06 0:24 unicorn worker[3] -D -c config/unicorn.rb
cubing 24181 0.0 4.2 599932 172732 ? Sl 15:07 0:00 unicorn worker[0] -D -c config/unicorn.rb
cubing 24607 0.0 0.0 10472 900 pts/4 S+ 15:18 0:00 grep unicorn
```

The second column is the process identifier (useful when you want to manage the process, e.g. terminate it),
whereas the last column is the actual command behind the process.

As you can see there are several processes, four workers and one master process supervising them
(so if any of them goes down, a new process is started).

To terminate the app you can use the `kill` command providing the process identifies (PID) of the master process.
In this case:
We have another container with production-like setup, but used only for testing.
You can establish the connection in the same way, just use the staging service instead.

```shell
~ @staging> kill 9115
aws ecs execute-command --cluster wca-on-rails --task $(aws ecs list-tasks --cluster wca-on-rails --service-name wca-on-rails-staging --query 'taskArns[*]' --output text | awk -F/ '{print $NF}') --container rails-staging --interactive --command "/bin/bash"
```

Now you can verify that the processes are no longer there:

```shell
~ @staging> ps aux | grep unicorn
cubing 24633 0.0 0.0 10472 900 pts/4 S+ 15:19 0:00 grep unicorn
```
## Checking the state of the app

Usually that's what you see when the website is down.
We use health checks that automatically restart the server when it can't serve requests for a couple minutes.
You can check the metrics in the [health](https://us-west-2.console.aws.amazon.com/ecs/v2/clusters/wca-on-rails/services/wca-on-rails-prod/health?region=us-west-2) section of ecs

## (Re)starting the website

❗ *That's the most common way of fixing "the website is down" issues.*

To start the app (or restart if already running) you can run the deploy script like so:

```shell
~ @staging> worldcubeassociation.org/scripts/deploy.sh restart_app
```

Once this finishes you can verify it's up again:

```shell
~ @staging> ps aux | grep unicorn
cubing 24838 30.6 4.2 530784 170600 ? Sl 15:30 0:06 unicorn master -D -c config/unicorn.rb
cubing 24848 0.0 4.0 530784 164816 ? Sl 15:30 0:00 unicorn worker[0] -D -c config/unicorn.rb
cubing 24852 0.0 4.0 530784 164788 ? Sl 15:30 0:00 unicorn worker[1] -D -c config/unicorn.rb
cubing 24856 0.0 4.0 530784 164888 ? Sl 15:30 0:00 unicorn worker[2] -D -c config/unicorn.rb
cubing 24860 0.0 4.0 530784 164812 ? Sl 15:30 0:00 unicorn worker[3] -D -c config/unicorn.rb
cubing 24869 0.0 0.0 10472 900 pts/4 S+ 15:31 0:00 grep unicorn
```
The Serer is automatically restarted when the website is down

## Talking to the database

{: .help}
> We have PhpMyAdmin available now - document how to use this instead.

You can connect to the MySQL database shell simply by running:

```shell
~ @staging> mysql cubing
```

Now you should see a new CLI prompt and be able to run SQL queries:

```shell
mysql> SELECT COUNT(*) FROM Competitions;
+----------+
| COUNT(*) |
+----------+
| 7202 |
+----------+
1 row in set (0.04 sec)
```

⚠️ **Feel free to play around with the database on the staging server,
however on the production server please use it only when necessary.
Even running some `SELECT` queries may hurt the application performance
if those queries are more complex.**
We use phpmyadmin to talk to the database. If you have access it, you can find [the link](https://www.worldcubeassociation.org/results/database/) for it in the main menu under your avatar.
You can find the username and generate a password in the [admin panel](https://www.worldcubeassociation.org/admin/generate_db_token)

### Looking at currently running queries

Expand Down Expand Up @@ -311,23 +137,17 @@ mysql> show processlist;
10 rows in set (0.00 sec)
```

You can also check what the longest running queries are in the RDS Console.

## Looking into the logs

{: .help}
> This can be done via NewRelic now.

There are several log files you may look into:

* `worldcubeassociation.org/WcaOnRails/log/production.log` - that's a very detailed log of the Rails application,
it includes SQL queries and details about requests that errored
* `/var/log/nginx/access.log` - this file includes one line per every incoming request,
so it's sometimes useful for monitoring the current traffic


## Changing our Google API KEY

You should go to the [developer console](https://console.developers.google.com/apis/credentials?project=wca-website&pli=1) (login with the wca.software google account which is in our credentials document), select the "WCA production key", and add the new server's IP (you should also remove the old server's IP).

* New Relic
* Cloudwatch

# Staging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add description for how to connect to staging console


Expand Down
Loading
Loading