Skip to content

Commit

Permalink
Add manual Graceful recovery results for 1.3.0 (#2102)
Browse files Browse the repository at this point in the history
  • Loading branch information
ciarams87 authored Jun 6, 2024
1 parent 12992e8 commit a5bfdbd
Show file tree
Hide file tree
Showing 2 changed files with 74 additions and 91 deletions.
109 changes: 18 additions & 91 deletions tests/graceful-recovery/graceful-recovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@ This document describes how we test graceful recovery from restarts on NGF.
- [Steps](#steps)
- [Setup](#setup)
- [Run the tests](#run-the-tests)
- [Restart nginx-gateway container](#restart-nginx-gateway-container)
- [Restart NGINX container](#restart-nginx-container)
- [Restart Node with draining](#restart-node-with-draining)
- [Restart Node without draining](#restart-node-without-draining)
<!-- TOC -->
Expand All @@ -21,139 +19,68 @@ Ensure that NGF can recover gracefully from container failures without any user

## Test Environment

- A Kubernetes cluster with 3 nodes on GKE
- Node: e2-medium (2 vCPU, 4GB memory)
- A Kind cluster

## Steps

### Setup

1. Setup GKE Cluster.
2. Clone the repo and change into the nginx-gateway-fabric directory.
3. Check out the latest tag (unless you are installing the edge version from the main branch).
4. Go into `deploy/manifests/nginx-gateway.yaml` and change the following:
1. Deploy a one-Node Kind cluster. Can run `make create-kind-cluster` from main directory.

2. Go into `deploy/manifests/nginx-gateway.yaml` and change the following:

- `runAsNonRoot` from `true` to `false`: this allows us to insert our ephemeral container as root which enables us to restart the nginx-gateway container.
- Add the `--product-telemetry-disable` argument to the nginx-gateway container args.

5. Follow the [installation instructions](https://github.com/nginxinc/nginx-gateway-fabric/blob/main/site/content/installation/installing-ngf/manifests.md)
to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalancer Service.
6. In a separate terminal track NGF logs.
3. Follow [this guide](https://docs.nginx.com/nginx-gateway-fabric/installation/running-on-kind/) to deploy NGINX Gateway Fabric using manifests and expose it through a NodePort Service.

4. In a separate terminal track NGF logs.

```console
kubectl -n nginx-gateway logs -f deploy/nginx-gateway -c nginx-gateway
```

7. In a separate terminal track NGINX container logs.
5. In a separate terminal track NGINX container logs.

```console
kubectl -n nginx-gateway logs -f deploy/nginx-gateway -c nginx
```

8. In a separate terminal Exec into the NGINX container inside the NGF pod.
6. In a separate terminal Exec into the NGINX container inside the NGF pod.

```console
kubectl exec -it -n nginx-gateway <NGF_POD> --container nginx -- sh
kubectl exec -it -n nginx-gateway $(kubectl get pods -n nginx-gateway | sed -n '2s/^\([^[:space:]]*\).*$/\1/p') --container nginx -- sh
```

9. In a different terminal, deploy the
7. In a different terminal, deploy the
[https-termination example](https://github.com/nginxinc/nginx-gateway-fabric/tree/main/examples/https-termination).
10. Send traffic through the example application and ensure it is working correctly.
8. Send traffic through the example application and ensure it is working correctly.

### Run the tests

#### Restart nginx-gateway container

1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
2. Insert ephemeral container in NGF Pod.

```console
kubectl debug -it -n nginx-gateway <NGF_POD> --image=busybox:1.28 --target=nginx-gateway
```

3. Kill nginx-gateway process through a SIGKILL signal (Process command should start with `/usr/bin/gateway`).

```console
kill -9 <nginx-gateway_PID>
```

4. Check for errors in the NGF and NGINX container logs.
5. When the nginx-gateway container is back up, ensure traffic flows through the example application correctly.
6. Open up the NGF and NGINX container logs and check for errors.
7. Send traffic through the example application and ensure it is working correctly.
8. Check that NGF can still process changes of resources.
1. Delete the HTTPRoute resources.

```console
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
```

2. Send traffic through the example application using the updated resources and ensure traffic does not flow.
3. Apply the HTTPRoute resources.

```console
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
```

4. Send traffic through the example application using the updated resources and ensure traffic flows correctly.

#### Restart NGINX container

1. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
2. If the terminal inside the NGINX container is no longer running, Exec back into the NGINX container.
3. Inside the NGINX container, kill the nginx-master process through a SIGKILL signal
(Process command should start with `nginx: master process`).

```console
kill -9 <nginx-master_PID>
```

4. When NGINX container is back up, ensure traffic flows through the example application correctly.
5. Open up the NGINX container logs and check for errors.
6. Check that NGF can still process changes of resources.
1. Delete the HTTPRoute resources.

```console
kubectl delete -f ../../examples/https-termination/cafe-routes.yaml
```

2. Send traffic through the example application using the updated resources and ensure traffic does not flow.
3. Apply the HTTPRoute resources.

```console
kubectl apply -f ../../examples/https-termination/cafe-routes.yaml
```

4. Send traffic through the example application using the updated resources and ensure traffic flows correctly.

#### Restart Node with draining

1. Switch over to a one-Node Kind cluster. Can run `make create-kind-cluster` from main directory.
2. Run steps 4-11 of the [Setup](#setup) section above using
[this guide](https://docs.nginx.com/nginx-gateway-fabric/installation/running-on-kind/) for running on Kind.
3. Ensure NGF and NGINX container logs are set up and traffic flows through the example application correctly.
4. Drain the Node of its resources.
1. Drain the Node of its resources.

```console
kubectl drain kind-control-plane --ignore-daemonsets --delete-local-data
```

5. Delete the Node.
2. Delete the Node.

```console
kubectl delete node kind-control-plane
```

6. Restart the Docker container.
3. Restart the Docker container.

```console
docker restart kind-control-plane
```

7. Check the logs of the old and new NGF and NGINX containers for errors.
8. Send traffic through the example application and ensure it is working correctly.
9. Check that NGF can still process changes of resources.
4. Check the logs of the old and new NGF and NGINX containers for errors.
5. Send traffic through the example application and ensure it is working correctly.
6. Check that NGF can still process changes of resources.
1. Delete the HTTPRoute resources.

```console
Expand All @@ -171,4 +98,4 @@ to deploy NGINX Gateway Fabric using manifests and expose it through a LoadBalan

#### Restart Node without draining

1. Repeat the above test but remove steps 4-5 which include draining and deleting the Node.
1. Repeat the above test but remove steps 1-2 which include draining and deleting the Node.
56 changes: 56 additions & 0 deletions tests/graceful-recovery/results/1.3.0/1.3.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Results for v1.3.0

<!-- TOC -->
- [Results for v1.3.0](#results-for-v130)
- [Summary](#summary)
- [Versions](#versions)
- [Tests](#tests)
- [Restart Node with draining](#restart-node-with-draining)
- [Restart Node without draining](#restart-node-without-draining)
- [Future Improvements](#future-improvements)
<!-- TOC -->


## Summary

- No new issues since 1.1.
- Known issue https://github.com/nginxinc/nginx-gateway-fabric/issues/1108 still exists.

## Versions

NGF version:


```text
"version":"edge"
"commit":"c5f8dbe112ca1be261f73b9f5b4925cda3d5860a"
"date":"2024-06-06T04:07:01Z"
```

with NGINX:

```text
nginx/1.27.0
built by gcc 13.2.1 20231014 (Alpine 13.2.1_git20231014)
OS: Linux 6.6.26-linuxkit
```

Kubernetes:

```text
v1.30.0
```

## Tests

### Restart Node with draining

No errors.

### Restart Node without draining

Same issue as 1.1 where NGF is unable to recover: https://github.com/nginxinc/nginx-gateway-fabric/issues/1108

## Future Improvements

- None

0 comments on commit a5bfdbd

Please sign in to comment.