Skip to content

DevOps and Running the Application

Bidyashish edited this page Jan 13, 2025 · 120 revisions

DevOps

This document will explain DevOps setup and utilities for AEST/SIMS project.

N.B: ROOT means repository root directory

Table of Content

Prerequisites

  1. OpenShift namespace
  2. OpenShift Client (OC CLI)
  3. Keycloak realm
  4. Docker (for local development only)
  5. Make cmd (for local development only - windows users)
  • 5.1 Install Chocolatey first in order to install 'make'. In a CMD terminal execute:
@"%SystemRoot%\System32\WindowsPowerShell\v1.0\powershell.exe" -NoProfile -InputFormat None -ExecutionPolicy Bypass -Command "[System.Net.ServicePointManager]::SecurityProtocol = 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))" && SET "PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin"

or refer to: https://docs.chocolatey.org/en-us/choco/setup

  • 5.2 Install make using the commmand:
choco install make
  1. Test cmd (you can comment out when using as well - windows users)

Local Development

  1. Clone repo to the local machine git clone https://github.com/bcgov/SIMS
  2. Create a .env file in the repository root dir, for reference check /config/env-example on Microsoft Teams files. Then run all make commands from the /sources dir (see below)
  3. To build the application: make local-build
  4. To run all web+api+DB: make local
  5. To stop all application stack: make stop
  6. To clean all applications including storage: make local-clean
  7. To run database only: make postgres
  8. To run api with database: make local-api
  9. Shell into local api container: make api
  10. Run api test on Docker: make test-api
  11. To run local redis: make local-redis or make redis (local redis is required to run the application)
  12. To run queue-consumers in local docker: make queue-consumers
  13. To run forms in local docker: make forms
  14. To run clamav in local docker: make clamav

For Camunda

  1. To make Camunda make camunda from sources
  2. run npm i from packages/backend/workflows
  3. To deploy workflows make deploy-camunda-definitions from SIMS/sources or npm run deploy from packages/backend/workflows folder

For Web (If not using make local)

  1. run npm i from packages/web folder
  2. To run web npm run serve from the packages/web folder

For Backend (API + Workers + Others) (If not using make local)

  1. run npm i from packages/backend
  2. npm run start:[dev][debug] [workers][api][other]

OpenShift

OpenShift is cloud-native deployment platform to run all our application stacks. The OpenShift (oc) CLI is required to run any OpenShit operation from local machine or OpenShift web console.

OpenShift Login

  • Developer need an account on OpenShift 4 cluster managed by BC Gov.

  • Copy temporary token from web console and use oc login --token=#Token --server=https://api.silver.devops.gov.bc.ca:6443

  • After login please verify all the attached namespaces: oc projects

  • Select any project: oc project #ProjectName

OpenShift Infrastructure

  • Application images are building on a single namespace (tools namespace)

  • Images are promoted to different environment using Deployment Config.

  • All application secrets and configs are kept in OpenShift Secret and config maps. Theses values are injected to target application through deployment config.

  • BC Government OpenShift DevOps Security Considerations

OpenShift Template files

Under ROOT/devops/openshift/, all the OpenShift related template files are stored.

  • api-deploy.yml: Api deployment config template.
  • db-migrations-job.yml: DB migrations job template.
  • docker-build.yml: Generic builder template.
  • forms-build.yml: Formio builder template.
  • forms-deploy.yml: Formio deployment config template.
  • init-secrets.yml: Init secrets to setup the initial environment setup on Openshift.
  • networkpolicy.yml: Security on the network policy on openshift template.
  • queue-consumers-deploy.yml: Queue Consumers deployment config template.
  • security-init.yml: Network and security polices template to enable any namespace for application dev.
  • web-deploy.yml: Web app deployment config template.
  • workers-deploy.yml: Workers deployment config template.

Database

Under ROOT/devops/openshift/database/, all the database related template files are stored.

  • mongo-ha-param.yml: Parameter file to run mongo template file mongo-ha.yml.
  • mongo-ha.yml: HA Mongo State-full-state deployment config template.
  • redis-ha-deploy.yml: Redis State-full-state deployment config template.
  • redis-secrets.yml: Redis secrets template.

OpenShift Setup

We have created a setup of make helper commands, Now we can perform following steps to setup any namespace.

  • Setup your env variable in ROOT/.env file or in ROOT/devops/Makefile, sample env file is available under ROOT/configs/env-example. The list of essential env variables are

    1. NAMESPACE
    2. BUILD_NAMESPACE
    3. HOST_PREFIX (optional)
    4. BUILD_REF (optional, git brach/tag to use for building images)
    5. BUILD_ID (optional, default is 1)

Initial Setup

  • Login and select namespace

  • Setup network, security policies OpenShift network policies, ClusterRole addition for image puller and github-action rolebinding : make init-oc NAMESPACE=$namespace

ClamAV setup

  • Run Github action ClamAV - Install/Upgrade/Remove to create ClamAV server in openshift.
    • image
    • Select the Environment to create the server.
    • Input the action either install/upgrade/uninstall.
    • ClamAV Image Tag for the version of clam av server in the workflow and `Run workflow'.

Crunchy setup

  • Run Github action Crunchy DB - Install/Upgrade to create Crunchy server in openshift.
    • image
    • Select the Environment to create the server.
    • Input the action install.
    • `Run workflow'.

SFTP Setup

  • Add to the existing env variable in ROOT/.env file or in ROOT/devops/Makefile, sample env file is available under ROOT/configs/env-example. The list of essential env variables are

    1. INIT_ZONE_B_SFTP_SERVER=
    2. INIT_ZONE_B_SFTP_SERVER_PORT=
    3. INIT_ZONE_B_SFTP_USER_NAME=
    4. INIT_ZONE_B_SFTP_PRIVATE_KEY_PASSPHRASE=
  • Add the private key for the zone B sftp server in the file ROOT/devops/openshift/zone-b-private-key.cer

  • Setup Zone B SFTP secrets: make init-zone-b-sftp-secret NAMESPACE=$namespace

Secrets Setup

  • Add the appropriate openshift secrets in the Github Secrets -> Environments.

  • Run Github action Env Setup - Deploy SIMS Secrets to Openshift to create all the secrets in openshift.

    • image
    • Select the Environment to create the secrets.
    • Input the tag as Build Ref in the workflow and `Run workflow'.

FORMIO Setup

Build and Deploy
  • Popluate the mongo-ha-param.yml with the required values for mongo db creation.

  • Create Mongo DB: make oc-deploy-ha-mongo NAMESPACE=$namespace

Note: For a fresh install we may need to Build Forms in the tools namespace and then deploy, else for deploying into a new environment, where the build is already available Deploy Forms is enough.

Build Forms
  • Run Github action Env Setup - Build Forms Server to build the formio(Forms server) in tools namespace of openshift.
    • image
    • The minimum version of the formio server to be deployed is v2.5.3. Please refer the formio tag url for any updates needed.
    • Input the tag as Build Ref in the workflow and `Run workflow'.
Deploy Forms
  • Fetch the mongo-url secret from the mongodb-ha-creds created as part of the previous oc-deploy-ha-mongo make command and update the GITHUB secrets -> Environment -> MONGODB_URI

  • Run Github action Env Setup - Deploy Forms Server to deploy the formio(Forms server) in tools namespace of openshift.

    • image
    • Select the Environment to build, deploy the Forms server and its related secrets, service and routes.
    • The minimum version of the formio server to be deployed is v2.5.3. Please refer the formio tag url for any updates needed.
    • Input the tag as Build Ref in the workflow and `Run workflow'.
Formio Definition Deployment
  • Fetch the secrets from the {HOST-PREFIX}-forms created as part of the previous Github action and update the GITHUB secrets -> Environment ->

    • FORMIO_ROOT_EMAIL : FORMS_SA_USER_NAME
    • FORMIO_ROOT_PASSWORD : FORMS_SA_PASSWORD
    • FORMS_URL : FORMS_URL
    • FORMS_SECRET_NAME : {HOST-PREFIX}-forms
  • Run Github action Release - Deploy Form.io resources to deploy the forms resources to the formio server.

    • image
    • Select the Environment to build, deploy the Forms server and its related secrets, service and routes.
    • Input the tag as Build Ref in the workflow and `Run workflow'.

Redis Setup through Make

  • Setup Redis secrets:

    • make init-redis NAMESPACE=$namespace
  • Deploy Redis with 6 replicas:

    • make deploy-redis NAMESPACE=$namespace
  • Initialize the Redis Cluster

    • Make sure that all the redis pods are up and running before initializing the cluster:
      • make init-redis-cluster NAMESPACE=$namespace REDIS_PORT=$redis_port
    • When prompted, type 'yes'

Redis Setup through Github Actions

image

image

Deploy API, Web, Workers, Queue-consumers

  • Run Github action Release - Deploy to deploy the API Web Workers Queue-consumers in the namespace.
    • image
    • Input the tag as Git Ref in the workflow.
    • Select the Environment to deploy the API Web Workers Queue-consumers.
    • `Run workflow'.

Crunchy Backup and Restore

  • Note Crunchy has automatic jobs to have backups running automaticatlly to restore to a particular timestamp, make file has been created for ease of execution.
  • Run Github action Crunchy DB - Install/Upgrade to upgrade the crunchy helm chart in the openshift server.
    • image
    • Select the Environment to restore the crunchy.
    • Input the action either upgrade.
    • Check the checkbox Enable restore.
    • Input the timestamp the DB has to be restored in 'YYYY-MM-DD HH:MM:SS' format.
    • `Run workflow'.
  • To run the helm restore, the postgres operator has a security command to enable it from your local.
    • Connect the openshift server locally using the oc commands and the oc token.
    • Get proper approvals before the Restore command is executed.
    • Run the command kubectl annotate -n <namespace> postgrescluster simsdb --overwrite postgres-operator.crunchydata.com/pgbackrest-restore="$(date)" . Update the namespace with the approporiate environment namespace to start the restore.
  • Disable Restore
    • Update the crunchy openshift with the restore disabled by running the helm upgrade github action Crunchy DB - Install/Upgrade again like below.
    • image
    • This disables the restore in helm chart, thus even when the helm restore command is run again locally from the previous steps, the restore will not happen.

Crunchy read-only-user setup

Steps to Perform in Master Node of Postgres

  1. Wait around 20 mins to deploy helm chart completely after Helm upgrade.
  2. Run connect-[ENV]-db-superuser for each environment from ~/sources/makefile eg. make connect-dev-db-superuser MASTER_POD=Pod_id
  3. For superuser credentials please look into openshift secrets. Secret name: simsdb-pguser-postgres Secret key names: user and password.
  4. Once connected to Database as superuser, run the following commands.
GRANT USAGE ON SCHEMA information_schema TO "read-only-user";
GRANT USAGE ON SCHEMA sims TO "read-only-user";
GRANT SELECT ON ALL TABLES IN SCHEMA sims TO "read-only-user";
ALTER DEFAULT PRIVILEGES IN SCHEMA sims GRANT SELECT ON TABLES TO "read-only-user";

Troubleshooting and Additional Context

Some Additional Commands

  • Delete the resources associate with Mongo database (PVCs are not deleted): oc-db-backup-delete-mongodb

Redis cluster failing to recover

Partial cluster failure recovery

When redis cluster is restarted or pods are put down and started, and the cluster did not recover gracefully in the openshift environment, or when the Redis node does not connect to another node and join the cluster. Please follow the below make commands.

  • Step 1: Bring down the slave pods: Make the redis pods from 6 to 3 in openshift.

  • STEP 2: run the following Github action Env Setup - Redis recovery in Openshift - Check if the masters are Cluster meet successfully.

  • STEP 3: Make the redis pods replicas to 6 - This should automatically pass the queue-consumers.

Master and slave Connect Failure

When the Redis Cluster is not able to connect the slaves or the master as below, or in most of the scenarios to recover the failing redis due to slave-master or master-master connect issues.

  • To check the below, go to one of the pods in redis like 'redis-0' in this case and go to 'Terminal'

  • Run command redis-cli and you will go into the redis command line terminal, when you can run cluster commands.

  • Run cluster nodes to see the ip and the slave status.

  • image

  • Check the ip of one of the master redis pod (redis-1/redis-2 --> Note, sometimes those might not be the masters, go to logs to verify atleast 3 masters are up and proceed further) as below, by going into the details of any one of the pods.

  • image

  • image

  • Clearly the ip's in the cluster nodes command and the pods are not same. This means the redis cluster is trying to connect to a master which has a different ip.

  • Step to recover masters pods

  • Bring down the slave pods: Make the redis pods from 6 to 3 in openshift(Note, sometimes those might not be the masters, go to logs to verify atleast 3 masters are up and if needed bring down the pods to 4).
  • image
  • Run the following Github action Env Setup - Redis recovery in Openshift - Check if the masters are Cluster meet successfully.
  • Make the redis pods replicas to 6 again. This should update the nodes.conf file in the redis cluster and when you run the cluster nodescommand again in redis-0, you should see that the masters are connected but slaves are failing.
  • image
  • image
  • This is because the masters are up and connected in the cluster but there are no slaves to serve its slots. This can be verified by checking one of the master redis pods as below.
  • image
  • Also if we verify the slave redis pods logs, you can see they are trying to connect the master but not successful as below.
  • image
  • image
  • Step to recover slave pods
  • Delete the slave pods manually as below.
  • image
  • This should automatically update the nodes.conf for new slaves to be connected to the cluster as below and some slave pods the address is updated.
  • image
  • image
  • Now when we try to run the same cluster nodes command in one of the masters, you can see all the slave and masters are connected successfully and the ips are the appropriate ones from the pods.
  • image
  • image
  • image
  • image
  • Now the queue-consumers pods will recover automatically.

image

Total cluster failure recovery

When nothing above works - Important this is the last resort, we will not have any backups of db, as it clears the PVC also.

The redis nodes are unable to connect the cluster we have to delete and deploy the redis instances and create cluster again to make redis up and running normally.

Note: This is just a temporary solution as we do not have a permanent solution in place to recover the redis cluster. This process will result in deleting all the redis data as we have to delete the stateful set.

Follow the given set of instructions to deploy the redis instance and create cluster.

  • STEP 1: run the following command to delete the stateful set and other redis dependencies make delete-redis NAMESPACE=$namespace

    Or go to Env Setup - Delete Redis in Openshift, select a branch or tag, select the environment and click on "Run Workflow" like the image below: image

  • STEP 2: Follow the instructions from the section above Redis Setup through Make or Redis Setup through Github Actions to setup redis.

  • STEP 3: Restart queue-consumers and api.

  • STEP 4: If this activity is performed in DEV(!!ONLY DEV ENVIRONMENT!!) environment, please pause the schedulers used for file integrations.

Pod disruption budget and Pod replicas

  • All pods are created with 2 replicas and will go to a maximum of 10 replicas when load increases.
  • Pod disruption budget is set for all the deployment config pods (except the db backup container) with a maxUnavailable value of 1.
  • For the databases mongo has a maxUnavailable value of 1, but redis has a maxUnavailable of 2.
  • As per the configuration, when there is a drain of nodes or maintenance happening, only one node will be drained and the application will be live.
  • Sample PDB for our API is shown below:
image
Clone this wiki locally