Global timer makes litmus create verdict sooner than the AUT is ready causing the node drain test to fail #2098

sysarch-repo · 2024-07-02T13:38:25Z

Describe the bug
During the node drain test, a global "run-to-completion" timer of 90 seconds seems to be applied. AUTs that are not ready within the 90 sec after the node drain cause the verdict to be an error which makes the node drain test fail even the AUT comes back to operation.
Note, the global timer contradicts #1838 that mentions a timer of 30 mins.

To Reproduce
Steps to reproduce the behavior:

CNTI testsuite 1.3.0

Deploy AUT that takes an extended time to terminate (e.g. due to large value in terminationGracePeriodSeconds and preStop hook implemented to delay the shutdown)
Run the node drain test (debug level)
Observe failed test
Inspect the debug log for the litmus verdict mentioning a global timer of 90 seconds (global timeout reached: 1m30s):

    "status": {
        "experimentStatus": {
            "errorOutput": {
                "errorCode": "CHAOS_INJECT_ERROR",
                "reason": "{\"errorCode\":\"CHAOS_INJECT_ERROR\",\"phase\":\"ChaosInject\",\"reason\":\"failed to drain the target node: WARNING: ignoring DaemonSet-managed Pods: cnf-testsuite/cluster-tools-m8svp, kube-system/aws-node-nwjrf, kube-system/ebs-csi-node-rzpfr, kube-system/kube-proxy-vvwn8; deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: cnf-testsuite/dockerd\\nI0702 12:31:11.515935      23 request.go:668] Waited for 1.097972754s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/api/v1/namespaces/ns1/pods/rel1-pod1-7c44f78ff-sqlrk\\nThere are pending pods in node \\\"ip-10-0-82-127.ec2.internal\\\" when an error occurred: [error when waiting for pod \\\"rel1-pod2-0-0\\\" terminating: global timeout reached: 1m30s, error when waiting for pod \\\"rel1-pod1-7c44f78ff-sqlrk\\\" terminating: global timeout reached: 1m30s]\\npod/rel1-pod1-7c44f78ff-sqlrk\\npod/rel1-pod2-0-0\\nerror: unable to drain node \\\"ip-10-0-82-127.ec2.internal\\\", aborting command...\\n\\nThere are pending nodes to be drained:\\n ip-10-0-82-127.ec2.internal\\nerror when waiting for pod \\\"rel1-pod2-0-0\\\" terminating: global timeout reached: 1m30s\\nerror when waiting for pod \\\"rel1-pod1-7c44f78ff-sqlrk\\\" terminating: global timeout reached: 1m30s\\n\",\"target\":\"{node: ip-10-0-82-127.ec2.internal}\"}"
            },
            "phase": "Error",
            "probeSuccessPercentage": "0",
            "verdict": "Error"
        },

Note the TOTAL_CHAOS_DURATION env value in the ChaosEngine resource showing the 90sec test duration concerned in this ticket:

ubuntu@ip-10-0-17-124:~$ cat /home/ubuntu/auts/aut/dns/node-drain-chaosengine.yml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: rel1-pod1-378d4229
  namespace: ns1
spec:
  appinfo:
    appns: 'ns1'
    applabel: 'app.name/pod-group=rel1-pod1'
    appkind: 'deployment'
  # It can be delete/retain
  jobCleanUpPolicy: 'delete'
  # It can be active/stop
  engineState: 'active'
  chaosServiceAccount: node-drain-sa
  experiments:
    - name: node-drain
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '90'
            - name: TARGET_NODE
              value: 'ip-10-0-83-53.ec2.internal'

Expected behavior
The global timer of 1m30sec shall be configurable via env or in the CNF config or relaxed as mentioned in #1838. Note, the node drain impacts all pods running on the drained node, not only the AUT components. Some components (e.g. clusters) need an extended time for an ordered termination and start.

Device (please complete the following information):
$ uname -a
Linux ip-10-0-17-74 6.5.0-1020-aws #20 SMP Wed May 1 16:10:50 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

martin-mat · 2024-07-03T22:39:16Z

noting probable source of those 90s
https://github.com/cnti-testcatalog/testsuite/blob/v1.3.0/src/tasks/utils/chaos_templates.cr#L116

sysarch-repo · 2024-07-04T10:54:40Z

@martin-mat, excellent and thanks for the investigation.
Maybe the env concepts outlined in USAGE.md can be re-used to resolve this ticket:

Environment variables for timeouts:
Timeouts are controlled by these environment variables, set them if default values aren't suitable:
CNF_TESTSUITE_GENERIC_OPERATION_TIMEOUT=60
CNF_TESTSUITE_RESOURCE_CREATION_TIMEOUT=120
CNF_TESTSUITE_NODE_READINESS_TIMEOUT=240
CNF_TESTSUITE_POD_READINESS_TIMEOUT=180
CNF_TESTSUITE_LITMUS_CHAOS_TEST_TIMEOUT=1800

Add possibility to change duration of node drain litmus chaos test. This is needed for CNFs with longer startup/shutdown. Additionally, fix litmus waiter code and timeout module. Slight refactor of LitmusManager module. Refs: cnti-testcatalog#2098 Signed-off-by: Martin Matyas <[email protected]>

martin-mat · 2024-07-09T13:04:57Z

After thorough analysis of the issue, there were few different issues found:

litmus waiting/timeouts not working correctly
generic waiting for timeouts not working correctly
combining 2 issues above made the litmus tests somehow working
parameter that made node drain working is duration of litmus chaos test, not any test case timeout. Added environment variable to change it.

All fixes in #2102

Tested:

Add possibility to change duration of node drain litmus chaos test. This is needed for CNFs with longer startup/shutdown. Additionally, fix litmus waiter code and timeout module. Slight refactor of LitmusManager module. Refs: #2098 Signed-off-by: Martin Matyas <[email protected]>

sysarch-repo added the bug Something isn't working label Jul 2, 2024

sysarch-repo changed the title ~~Global timer of 90 sec makes litmus create verdict sooner than the AUT is ready making the node drain test fail~~ Global timer makes litmus create verdict sooner than the AUT is ready causing the node drain test to fail Jul 3, 2024

martin-mat self-assigned this Jul 4, 2024

martin-mat mentioned this issue Jul 9, 2024

Add possibility to change node drain duration #2102

Merged

20 tasks

martin-mat closed this as completed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global timer makes litmus create verdict sooner than the AUT is ready causing the node drain test to fail #2098

Global timer makes litmus create verdict sooner than the AUT is ready causing the node drain test to fail #2098

sysarch-repo commented Jul 2, 2024 •

edited

Loading

martin-mat commented Jul 3, 2024

sysarch-repo commented Jul 4, 2024

martin-mat commented Jul 9, 2024

Global timer makes litmus create verdict sooner than the AUT is ready causing the node drain test to fail #2098

Global timer makes litmus create verdict sooner than the AUT is ready causing the node drain test to fail #2098

Comments

sysarch-repo commented Jul 2, 2024 • edited Loading

martin-mat commented Jul 3, 2024

sysarch-repo commented Jul 4, 2024

martin-mat commented Jul 9, 2024

sysarch-repo commented Jul 2, 2024 •

edited

Loading