Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Redis version and increase Memory #4253

Closed
5 tasks done
ninosamson opened this issue Jan 16, 2025 · 2 comments
Closed
5 tasks done

Upgrade Redis version and increase Memory #4253

ninosamson opened this issue Jan 16, 2025 · 2 comments
Assignees

Comments

@ninosamson
Copy link
Collaborator

ninosamson commented Jan 16, 2025

The current Redis Cluster has been causing issues eventually and requires manual intervention. An effort should be executed to ensure the cluster is correctly configured, can support eventual faulted nodes, and recover without intervention. Also, the amount of resources allocated to execute the cluster can potentially be pretty low and must be adjusted.

  • Upgrade Redis version. Review whether an upgrade to the latest version can be applied. Otherwise, we can upgrade to the most recent and backward-compatible version and plan a major upgrade later.
  • Check the current resource configuration for Redis Cluster execution. This document can be a starting point.
  • Review the current cluster structure (3 masters, 3 slaves) to revalidate if it is still the best approach. It seems that losing one node is making the entire cluster unstable and requiring manual intervention.
  • Try to find references in other BC Gov projects and the BC Gov community.
  • If Change Redis Persistence to RDB #2459 isn't completed it can be done as part of this effort, time allowing.
@ninosamson ninosamson added the Business Items under Business Consideration label Jan 16, 2025
@CarlyCotton
Copy link
Collaborator

Likely a timeboxed activity as analysis is required. Should do sooner than later but may not have time for 2.4 release.

@CarlyCotton CarlyCotton added this to the Full-Time "Asset" milestone Jan 16, 2025
@CarlyCotton CarlyCotton added Dev & Architecture Development and Architecture and removed Business Items under Business Consideration labels Jan 16, 2025
@andrewsignori-aot andrewsignori-aot removed the Dev & Architecture Development and Architecture label Jan 16, 2025
@ninosamson
Copy link
Collaborator Author

Tagging to 2.5 as we pulled this into sprint late to get a head start.

github-merge-queue bot pushed a commit that referenced this issue Jan 29, 2025
Team, the below files are copied from bitnami
(https://github.com/bitnami/charts/tree/main/bitnami/redis-cluster) for
the charts, so please ignore the review on these files below

![image](https://github.com/user-attachments/assets/20eba934-660b-45ab-bcd4-feb629cb72cf)
or in folders 
> base files in devops/helm/redis-cluster
> devops/helm/redis-cluster/templates

Make file is created for easy installtion of helm charts.

The values.yaml overridden with values from the values-{NAMESPACE}.yaml.
The recommendation values are not given in the
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
link given, but as per the examples, redis cluster starts at a minimum
configuration of Nano where the values are given in this
https://github.com/bitnami/charts/blob/main/bitnami/common/templates/_resources.tpl#L15.

I will create a the next PR with the actual values for namespaces
values-{NAMESPACE}.yaml.

To enable the github action initial commit of the file is created.

NOTE: Additional file changes
sources/camunda-docker-compose-core.yml - missed network type
sources/packages/forms/Dockerfile.dev is changed to enable development
environment login
github-merge-queue bot pushed a commit that referenced this issue Feb 5, 2025
As part of the existing redis files removal, these files should be
removed by creating the new Pull request once the new redis-cluster is
deployed succesfully.

<img width="320" alt="image"
src="https://github.com/user-attachments/assets/db64333f-82c7-41f4-a189-f7a27809584b"
/>

<img width="312" alt="image"
src="https://github.com/user-attachments/assets/c270cfc0-c1f5-4a45-b8a8-90489db9e689"
/>

- PVC is updated with 1GB size in sync with the existing PROD size
- Service account is required while creating the cluster as it helps the
Redis Pods the necessary RBAC (Role-Based Access Control) to interact
with the other objects created during installation like the secrets,
configmaps and PVCs.
- Existing Makefile commands are removed for the old redis in devops
folder as we usse helm installation for the new redis and is available
in devops/helm/redis-cluster folder.
- Redis Creds will have 32 alphanumeric generated password as previously
generated ones.

**AOF vs RDB**
As part of the analysis, finding the right persistence mechanism for our
project was crucial, so on checking the official documentations of the
https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/,
here are some of the answer.

**Common functionalities of AOF and RDB and how it is used during
disaster recovery**
- We have enabled PVC for our DB, so both AOF and RDB gets saved into
it.
- Even if we uninstall the helm chart, the PVCs stay and when tried to
install again with a different version or after disaster recovery, the
existing PVC is connected automatically by the helm current
configurations and there is no loss of data

**AOF**
Is kind of a write operation to the disk in a file appending everytime,
usually it will have serious of files which does base file, incremental
update file and manifest file. This can be found by running the below
command and answers as below in the redis-cli in any of the
redis-cluster pods.
```
$ cat /opt/bitnami/redis/etc/redis.conf | grep appendonly
appendonly yes
# For example, if appendfilename is set to appendonly.aof, the following file
# - appendonly.aof.1.base.rdb as a base file.
# - appendonly.aof.1.incr.aof, appendonly.aof.2.incr.aof as incremental files.
# - appendonly.aof.manifest as a manifest file.
appendfilename "appendonly.aof"
appenddirname "appendonlydir"
```
There files are present in /bitnami/redis/data folder and the file
appendonly.aof.1.base.rdb is the base file and
appendonly.aof.1.incr.aof, appendonly.aof.2.incr.aof are the incremental
files and the appendonly.aof.manifest is the manifest file, where it has
the metadata/configuration of the aof files.
The reason we have 2 incremental files appendonly.aof.2.incr.aof, is
when the base file corrupts and the new base file needs to be replaced,
with the child creating a new base AOF file while the parent logs
updates in an incremental AOF; once rewriting completes, Redis
atomically updates the manifest and cleans up old files to ensure a
consistent dataset. This is a feature we have in Redis 7+, as we are
using 7.4.2-debian-12-r0, it available.

PROS and CONS:
The only downside of AOF is, as the filesize is very large due to the
incremental updates, it will be take more time to recover but the loss
of data in case of disaster is maximum one sec, this is done using the
configuration below.
<img width="248" alt="image"
src="https://github.com/user-attachments/assets/b22b2ada-f175-46d9-a656-1b68f9619272"
/>

**RDB**
Is a file which takes a SNAPSHOT of the current dataset more like a
backup strategy that run in certain intervals as configured. It is a
single file and can be found running the below command in the redis-cli
of the redis-cluster pods.
```
$ cat /opt/bitnami/redis/etc/redis.conf | grep dbfilename
# and 'dbfilename') and that aren't usually modified during runtime
dbfilename dump.rdb
# above using the 'dbfilename' configuration directive.
```
The file is present in /bitnami/redis/data and the file dump.rdb
contains the snapshot of the dataset, The configuration for them is done
in the save configuration as below.
<img width="549" alt="image"
src="https://github.com/user-attachments/assets/13aa1f8a-1e09-4964-8349-50d059b84b46"
/>
| **Time Interval (seconds)** | **Minimum Number of Changes** |
|----------------------------|------------------------------|
| 900 seconds (15 minutes)   | 1 change                     |
| 300 seconds (5 minutes)    | 10 changes                   |
| 60 seconds (1 minute)      | 10,000 changes               |

PROS and CONS:
RDB can recover the Data quickly as it does not have to run through
multiple files or the filesize is relatively smaller than the AOF. But
the only downside is the interval in which the changes are saved as per
the current configuration for minimal changes as 10 is around 5 minutes
and if there is only one change it is 15 min. So if there is any 9 data
changes, as per the RDB configuration the change to save in the disk
will take 15 min, and during this time if there is a disaster, it will
lose those 9 data changes.

**Conclusion**
To have the best of both worlds of RDB and AOF, enabling both of them at
the same time, solves the recovery strategy. Also after the
implementation of the helm installation for Redis, the upgrade and full
disaster recovery can be done via the github actions.

**Installation and upgrade of redis**
Installing/Upgrade of redis-cluster is handled by the GHA `Redis Cluster
- Install/Upgrade` .

![image](https://github.com/user-attachments/assets/cc525402-70b2-437f-8531-2e3820415b30)

**Issues in the Redis Cluster**
Troubleshooting guides as per the BC GOV is given clearly in the given
links

**https://github.com/bcgov/common-service-showcase/wiki/Redis-Troubleshooting**
Also if the cluster fails completely, we can uninstall the redis using
the
`helm delete redis-cluster . -n {NAMESPACE}` commands run from the
`/devops/helm/redis-cluster` folder. This ensures the PVC's are not
deleted and cluster is removed. So when installing the redis-cluster
using the GHA in the previous steps, it can be recovered, without
minimum or no data loss.

**Migration from Old Redis**
- Bring the old redis pods in the statefulset to 0 

![image](https://github.com/user-attachments/assets/636c9f28-76a3-4b62-a088-c37664d75357)
- Install redis-cluster using the GHA `Redis Cluster - Install/Upgrade`
.
- Deploying the release tag - this ensure all the applications will have
the updated redis host and password from the new redis and once the
deployment is successful, the API, queue-consumers and workers
connections should work seemlessly.
- Currently backup and recovery of the redis keys from old to new redis
steps are not requested, but can be done by port-forwarding locally the
existing redis and backing up and restoring into the new redis-cluster.

**Rollback Procedures**
- During rollback the newly created redis-cluster statefulset pods
should be bring down to 0
![image](https://github.com/user-attachments/assets/0d68f9ec-fefb-446d-a03c-c9c2c632237a)
- Bring the old redis from 0 to 6
![image](https://github.com/user-attachments/assets/404ba468-0197-48a6-9433-a46e3b505b94)
- Continue the rollback steps in the release notes.

**Note:**
Once the deployment is complete and the redis-cluster is in place, the
wiki will be updated.
github-merge-queue bot pushed a commit that referenced this issue Feb 6, 2025
Enable Metrics to check alerts in sysdig.
Currently only dev namespaces enabled, to test the metrics.

<img width="851" alt="image"
src="https://github.com/user-attachments/assets/7cbbeb77-bf2c-4f3a-a6de-8deb2c0120b5"
/>

Ignore the below files, as they are copied from bitnami to enable
metrics
<img width="319" alt="image"
src="https://github.com/user-attachments/assets/46dbd7d0-19cd-45a5-a56c-870e001a5459"
/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants