Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global timer makes litmus create verdict sooner than the AUT is ready causing the node drain test to fail #2098

Closed
sysarch-repo opened this issue Jul 2, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@sysarch-repo
Copy link

sysarch-repo commented Jul 2, 2024

Describe the bug
During the node drain test, a global "run-to-completion" timer of 90 seconds seems to be applied. AUTs that are not ready within the 90 sec after the node drain cause the verdict to be an error which makes the node drain test fail even the AUT comes back to operation.
Note, the global timer contradicts #1838 that mentions a timer of 30 mins.

To Reproduce
Steps to reproduce the behavior:

CNTI testsuite 1.3.0

  1. Deploy AUT that takes an extended time to terminate (e.g. due to large value in terminationGracePeriodSeconds and preStop hook implemented to delay the shutdown)
  2. Run the node drain test (debug level)
  3. Observe failed test
  4. Inspect the debug log for the litmus verdict mentioning a global timer of 90 seconds (global timeout reached: 1m30s):
    "status": {
        "experimentStatus": {
            "errorOutput": {
                "errorCode": "CHAOS_INJECT_ERROR",
                "reason": "{\"errorCode\":\"CHAOS_INJECT_ERROR\",\"phase\":\"ChaosInject\",\"reason\":\"failed to drain the target node: WARNING: ignoring DaemonSet-managed Pods: cnf-testsuite/cluster-tools-m8svp, kube-system/aws-node-nwjrf, kube-system/ebs-csi-node-rzpfr, kube-system/kube-proxy-vvwn8; deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: cnf-testsuite/dockerd\\nI0702 12:31:11.515935      23 request.go:668] Waited for 1.097972754s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/api/v1/namespaces/ns1/pods/rel1-pod1-7c44f78ff-sqlrk\\nThere are pending pods in node \\\"ip-10-0-82-127.ec2.internal\\\" when an error occurred: [error when waiting for pod \\\"rel1-pod2-0-0\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"rel1-pod1-7c44f78ff-sqlrk\\\" terminating: global timeout reached: 1m30s]\\npod/rel1-pod1-7c44f78ff-sqlrk\\npod/rel1-pod2-0-0\\nerror: unable to drain node \\\"ip-10-0-82-127.ec2.internal\\\", aborting command...\\n\\nThere are pending nodes to be drained:\\n ip-10-0-82-127.ec2.internal\\nerror when waiting for pod \\\"rel1-pod2-0-0\\\" terminating: global timeout reached: 1m30s\\nerror when waiting for pod \\\"rel1-pod1-7c44f78ff-sqlrk\\\" terminating: global timeout reached: 1m30s\\n\",\"target\":\"{node: ip-10-0-82-127.ec2.internal}\"}"
            },
            "phase": "Error",
            "probeSuccessPercentage": "0",
            "verdict": "Error"
        }, 

Note the TOTAL_CHAOS_DURATION env value in the ChaosEngine resource showing the 90sec test duration concerned in this ticket:

ubuntu@ip-10-0-17-124:~$ cat /home/ubuntu/auts/aut/dns/node-drain-chaosengine.yml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: rel1-pod1-378d4229
  namespace: ns1
spec:
  appinfo:
    appns: 'ns1'
    applabel: 'app.name/pod-group=rel1-pod1'
    appkind: 'deployment'
  # It can be delete/retain
  jobCleanUpPolicy: 'delete'
  # It can be active/stop
  engineState: 'active'
  chaosServiceAccount: node-drain-sa
  experiments:
    - name: node-drain
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '90'
            - name: TARGET_NODE
              value: 'ip-10-0-83-53.ec2.internal'

Expected behavior
The global timer of 1m30sec shall be configurable via env or in the CNF config or relaxed as mentioned in #1838. Note, the node drain impacts all pods running on the drained node, not only the AUT components. Some components (e.g. clusters) need an extended time for an ordered termination and start.

Device (please complete the following information):
$ uname -a
Linux ip-10-0-17-74 6.5.0-1020-aws #20 SMP Wed May 1 16:10:50 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

@sysarch-repo sysarch-repo added the bug Something isn't working label Jul 2, 2024
@sysarch-repo sysarch-repo changed the title Global timer of 90 sec makes litmus create verdict sooner than the AUT is ready making the node drain test fail Global timer makes litmus create verdict sooner than the AUT is ready causing the node drain test to fail Jul 3, 2024
@martin-mat
Copy link
Collaborator

@martin-mat martin-mat self-assigned this Jul 4, 2024
@sysarch-repo
Copy link
Author

@martin-mat, excellent and thanks for the investigation.
Maybe the env concepts outlined in USAGE.md can be re-used to resolve this ticket:

Environment variables for timeouts:
Timeouts are controlled by these environment variables, set them if default values aren't suitable:
CNF_TESTSUITE_GENERIC_OPERATION_TIMEOUT=60
CNF_TESTSUITE_RESOURCE_CREATION_TIMEOUT=120
CNF_TESTSUITE_NODE_READINESS_TIMEOUT=240
CNF_TESTSUITE_POD_READINESS_TIMEOUT=180
CNF_TESTSUITE_LITMUS_CHAOS_TEST_TIMEOUT=1800

martin-mat added a commit to martin-mat/cnf-testsuite that referenced this issue Jul 9, 2024
Add possibility to change duration of node drain
litmus chaos test. This is needed for CNFs with
longer startup/shutdown.
Additionally, fix litmus waiter code and timeout module.
Slight refactor of LitmusManager module.

Refs: cnti-testcatalog#2098

Signed-off-by: Martin Matyas <[email protected]>
@martin-mat
Copy link
Collaborator

After thorough analysis of the issue, there were few different issues found:

  • litmus waiting/timeouts not working correctly
  • generic waiting for timeouts not working correctly
  • combining 2 issues above made the litmus tests somehow working
  • parameter that made node drain working is duration of litmus chaos test, not any test case timeout. Added environment variable to change it.

All fixes in #2102

Tested:
image

martin-mat added a commit that referenced this issue Jul 10, 2024
Add possibility to change duration of node drain
litmus chaos test. This is needed for CNFs with
longer startup/shutdown.
Additionally, fix litmus waiter code and timeout module.
Slight refactor of LitmusManager module.

Refs: #2098

Signed-off-by: Martin Matyas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants