Automate remaining graceful recovery tests #2140

bjee19 · 2024-06-13T21:03:16Z

Automate remaining graceful recovery tests which involve restarting the Node which NGF is running on.

Problem: Need to automate the remaining graceful recovery tests.

Solution: Automate the remaining graceful recovery tests.

Testing: Test works correctly locally and on the pipeline.

Closes #1901

Checklist

Before creating a PR, run through this checklist and mark each as complete.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works
I have checked that all unit tests pass after adding my changes
I have updated necessary documentation
I have rebased my branch onto main
I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

codecov · 2024-06-17T18:34:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.00%. Comparing base (7654cb6) to head (c22fe8c).
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2140      +/-   ##
==========================================
+ Coverage   87.61%   95.00%   +7.38%     
==========================================
  Files          96        1      -95     
  Lines        6695      220    -6475     
  Branches       50       50              
==========================================
- Hits         5866      209    -5657     
+ Misses        773       11     -762     
+ Partials       56        0      -56

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tests/suite/graceful_recovery_test.go

bjee19 · 2024-06-17T20:55:44Z

Because of the draining and deleting of the node, in addition to the restarting of the kind container. If this test fails/errors mid-process it will cause following tests to fail. If this is not okay I see two options

Make sure this test is run last in the functional test pipeline
Run this test (and maybe just all the graceful recovery tests) in a different pipeline.

Does anyone have any thoughts on this, and if this is even something to worry about?

sjberman · 2024-06-17T21:10:29Z

If this test fails/errors mid-process it will cause following tests to fail
@bjee19 Is this an intermittent issue that you see? How do the following tests fail? If there is some condition we could check, maybe we can Skip() any other test in that case?

bjee19 · 2024-06-17T22:01:47Z

@sjberman

Here are some sample errors I got from following tests when I drain the node, delete the node, but fail on restarting the docker container.

[BeforeEach] /home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/tests/suite/sample_test.go:28
  [It] /home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/tests/suite/sample_test.go:45

  [FAILED] Expected success, but got an error:
      <*fmt.wrapError | 0xc000103560>: 
      client rate limiter Wait returned an error: context deadline exceeded
      {
          msg: "client rate limiter Wait returned an error: context deadline exceeded",
          err: <context.deadlineExceededError>{},
      }
  In [BeforeEach] at: /home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/tests/suite/sample_test.go:37 @ 06/17/24 21:43:50.069

[FAILED] in [BeforeEach] - /home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/tests/suite/tracing_test.go:53 @ 06/17/24 21:54:02.502
• [FAILED] [612.532 seconds]
Tracing [BeforeEach] sends tracing spans for one policy attached to one route [functional, tracing]
  [BeforeEach] /home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/tests/suite/tracing_test.go:45
  [It] /home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/tests/suite/tracing_test.go:165

  [FAILED] Error: INSTALLATION FAILED: context deadline exceeded

So it seems to fail on the setup section because we drain the node and delete it, but if we error there and don't complete it by restarting the container, the previous tests are left without a node. I'm not too sure of what we can do about this to mitigate with our AfterAll in the graceful-recovery test because if we failed on resetting inside the test, I think we would fail to cleanup/reset in the AfterAll.

sjberman · 2024-06-18T14:01:53Z

@bjee19 We currently run the telemetry test on its own in the pipeline, maybe we just do something similar with these couple of tests that require draining. It uses different labels than the rest.

bjee19 · 2024-06-20T15:34:42Z

@sjberman So I added the graceful-recovery test to run on its own in the pipeline. The most recent run is a intentional failure. When this test fails, it no longer runs the rest of the functional tests, is that by design? I think if so then this solves the issue above, as any failure in the graceful recovery test that messes up with the kubernetes node or container will get caught in the graceful recovery test and won't propagate downwards to following tests because it will exit.

sjberman · 2024-06-20T18:47:44Z

@bjee19 Yeah, I think works fine.

tests/suite/graceful_recovery_test.go

github-actions bot added the tests Pull requests that update tests label Jun 13, 2024

bjee19 force-pushed the tests/automate-remaining-graceful-recovery-tests branch from c22fe8c to 330d50b Compare June 17, 2024 19:01

bjee19 commented Jun 17, 2024

View reviewed changes

tests/suite/graceful_recovery_test.go Outdated Show resolved Hide resolved

bjee19 changed the title ~~draft, do not review: Automate remaining graceful recovery tests~~ Automate remaining graceful recovery tests Jun 17, 2024

bjee19 marked this pull request as ready for review June 17, 2024 20:55

bjee19 requested a review from a team as a code owner June 17, 2024 20:55

github-actions bot added the documentation Improvements or additions to documentation label Jun 20, 2024

kate-osborn requested changes Jun 20, 2024

View reviewed changes

bjee19 requested a review from kate-osborn June 21, 2024 18:37

kate-osborn requested changes Jun 21, 2024

View reviewed changes

bjee19 force-pushed the tests/automate-remaining-graceful-recovery-tests branch from f3f82b9 to ebc4b04 Compare June 24, 2024 20:21

bjee19 requested a review from kate-osborn June 25, 2024 22:32

bjee19 commented Jun 25, 2024

View reviewed changes

tests/suite/graceful_recovery_test.go Show resolved Hide resolved

kate-osborn requested changes Jun 26, 2024

View reviewed changes

bjee19 requested a review from kate-osborn June 26, 2024 17:57

kate-osborn approved these changes Jun 26, 2024

View reviewed changes

bjee19 requested a review from sjberman June 26, 2024 18:22

sjberman reviewed Jun 26, 2024

View reviewed changes

tests/suite/graceful_recovery_test.go Outdated Show resolved Hide resolved

sjberman approved these changes Jun 26, 2024

View reviewed changes

bjee19 added 4 commits June 26, 2024 14:13

Automate remaining graceful recovery tests

28c753d

Remove debugging statements

6b039b8

Run pipeline with failing test

0ea4682

Revert the test to make it work

1971d96

bjee19 added 26 commits June 26, 2024 14:13

Run pipeline with failing test

7f52961

Remove manual test document

ea8fa3d

Run pipeline

b104e2b

Add separate functions for draining and abrupt restart tests

f7932bb

Add review feedback

9ad9fb8

Refactor docker inspect command

23f6ff0

Use cluster name passed in from flag

d145601

Add cluster name flag to right make command

5c26b39

Correct flag name

a19bce4

Remove comments

e752d21

Rebase with fixes to skipped failing tests

dd25200

Teardown NGF between each test

c79b644

Run pipeline

3d2151b

Use BeEmpty instead of empty string

71affc5

Remove functional label

7fe4691

Add stable readiness check

53a6529

Add back in extended timeout

43597ad

Extend timout duration for waiting on NGF

666f35f

Add skip if test is running on GKE

13ae858

Increase stable readiness count

88ac69a

Adjust readiness count

1a7027f

Update comment

4c305d0

Add nil check for clusterName

bb392b1

Adjust wording on error

9769303

Add MustPassRepeatedly to Eventually check

18d153b

Move checks on clusterName earlier

9370b25

bjee19 force-pushed the tests/automate-remaining-graceful-recovery-tests branch from b867bef to 9370b25 Compare June 26, 2024 21:13

Re-run pipeline

df28aa1

bjee19 merged commit aee7a26 into nginxinc:main Jun 26, 2024
41 checks passed

bjee19 deleted the tests/automate-remaining-graceful-recovery-tests branch June 26, 2024 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate remaining graceful recovery tests #2140

Automate remaining graceful recovery tests #2140

bjee19 commented Jun 13, 2024 •

edited

Loading

codecov bot commented Jun 17, 2024

bjee19 commented Jun 17, 2024

sjberman commented Jun 17, 2024

bjee19 commented Jun 17, 2024

sjberman commented Jun 18, 2024

bjee19 commented Jun 20, 2024

sjberman commented Jun 20, 2024

Automate remaining graceful recovery tests #2140

Automate remaining graceful recovery tests #2140

Conversation

bjee19 commented Jun 13, 2024 • edited Loading

Checklist

Release notes

codecov bot commented Jun 17, 2024

Codecov Report

bjee19 commented Jun 17, 2024

sjberman commented Jun 17, 2024

bjee19 commented Jun 17, 2024

sjberman commented Jun 18, 2024

bjee19 commented Jun 20, 2024

sjberman commented Jun 20, 2024

bjee19 commented Jun 13, 2024 •

edited

Loading