Added node chaos scenarios by yashashreesuresh · Pull Request #10 · krkn-chaos/krkn

yashashreesuresh · 2020-04-30T11:07:01Z

This commit:

Adds a node scenario to stop and start an instance
Adds a node scenario to terminate an instance
Adds a node scenario to reboot an instance
Adds a node scenario to stop the kubelet
Adds a node scenario to crash the node

Fixes: #8

config/config.yaml

node_scenarios.py

config/config.yaml

kraken/node_actions/node_scenarios.py

node_scenarios/node_start_stop_scenario.yml

rht-perf-ci · 2020-05-06T11:51:02Z

Can one of the admins verify this patch?

kraken/node_actions/node_scenarios.py

kraken/node_actions/aws_node_scenarios.py

node_scenarios/node_scenarios_example.yml

yashashreesuresh · 2020-06-17T16:41:47Z

The current implementation uses abstract class; class for each cloud type inherits the abstract class. With this, the node scenario functions can to be called only after creating an object of the class. The advantage with using abstract class is that it makes sure all the cloud types support all the scenarios. But I think it would be easier to directly call functions without creating an object. Thoughts? @mffiedler @paigerube14

mffiedler · 2020-06-18T12:04:42Z

I think the use of the abstract class to allow for cloud-specific scenario implementations where the cloud instance APIs are required is the correct approach. We also need to allow for scenarios which are not cloud specific (see the node crash scenarios which use oc debug node in the PR @paigerube14 started in #18 for an example). Where possible, having a default/common implementation of the scenario would be beneficial and it can be overridden by a cloud-specific implementation where needed. Hope that makes sense.

paigerube14 · 2020-06-18T20:17:09Z

I completely agree with Mike, I think the abstract might actually end up being easier for all the cloud providers to have a common layout.

@yashashreesuresh did you want me to add my 2 functions (kubelet_action and crash_node) for the oc debug parts that Mike is talking about to your branch? How did you want to do that?

yashashreesuresh · 2020-06-19T12:20:29Z

I completely agree with Mike, I think the abstract might actually end up being easier for all the cloud providers to have a common layout.

@yashashreesuresh did you want me to add my 2 functions (kubelet_action and crash_node) for the oc debug parts that Mike is talking about to your branch? How did you want to do that?

@paigerube14 I have added the kubelet scenario as https://github.com/openshift-scale/kraken/pull/10/files#diff-d7c8b18dc1f8fd23bc1e62bdee0ccdceL8 subprocess.check_output doesn't finish to completion because once the kubelet is stopped, debug pod never terminates. Therefore, I had to use https://github.com/openshift-scale/kraken/pull/10/files#diff-d7c8b18dc1f8fd23bc1e62bdee0ccdceR15 subprocess.run command.
I was trying your node_crash scenario and I observed many times the command gets executed but the node status doesn't change from Ready->NotReady->Ready. And sometimes the command pauses and doesn't finish to completion and I had to use Ctrl+C. However, I have added your command here https://github.com/openshift-scale/kraken/pull/10/files#diff-54651a20a8f57ef1c067dbaac8fdc0a6R63. It needs to be changed.

yashashreesuresh · 2020-06-22T14:17:16Z

All the scenarios are added, PTAL @chaitanyaenr

paigerube14 · 2020-06-23T15:03:50Z

I completely agree with Mike, I think the abstract might actually end up being easier for all the cloud providers to have a common layout.
@yashashreesuresh did you want me to add my 2 functions (kubelet_action and crash_node) for the oc debug parts that Mike is talking about to your branch? How did you want to do that?

@paigerube14 I have added the kubelet scenario as https://github.com/openshift-scale/kraken/pull/10/files#diff-d7c8b18dc1f8fd23bc1e62bdee0ccdceL8 subprocess.check_output doesn't finish to completion because once the kubelet is stopped, debug pod never terminates. Therefore, I had to use https://github.com/openshift-scale/kraken/pull/10/files#diff-d7c8b18dc1f8fd23bc1e62bdee0ccdceR15 subprocess.run command.
I was trying your node_crash scenario and I observed many times the command gets executed but the node status doesn't change from Ready->NotReady->Ready. And sometimes the command pauses and doesn't finish to completion and I had to use Ctrl+C. However, I have added your command here https://github.com/openshift-scale/kraken/pull/10/files#diff-54651a20a8f57ef1c067dbaac8fdc0a6R63. It needs to be changed.

I did have to use subprocess.Popen to get the stop kubelet to return properly. I'm not sure about the node_crash scenario. I'm not sure if the node will go to the notReady state. I was going off of the commands Mike had laid out in the original issue. I think for the node_crash scenario I also had to use subprocess.Popen to get a proper response. I think I thought either subprocess function worked so I didn't add it to my pull request, sorry about that. I tried the node_crash scenario from my branch and the Popen properly returns

It might take some time but the node I tried to kill did end up getting into the not ready state:

Try invoking oc debug node/ip-10-0-203-2.us-east-2.compute.internal -- chroot /host dd if=/dev/urandom of=/proc/sysrq-trigger
2020-06-23 11:01:31,957 [INFO] Scenario: {'node_scenarios': [{'name': 'Fork bomb the node', 'actions': ['node_crash'], 'label_selector': 'node-role.kubernetes.io/worker', 'instance_kill_count': 1, 'timeout': 20, 'cloud_type': 'aws'}]} has been successfully injected!
2020-06-23 11:01:31,957 [INFO] Waiting for the specified duration: 60

(venv3) prubenda@prubenda-mac kraken % oc get nodes
NAME                                         STATUS     ROLES    AGE    VERSION
ip-10-0-132-58.us-east-2.compute.internal    Ready      worker   125m   v1.18.3+91d0edd
ip-10-0-138-72.us-east-2.compute.internal    Ready      master   136m   v1.18.3+91d0edd
ip-10-0-176-148.us-east-2.compute.internal   Ready      master   136m   v1.18.3+91d0edd
ip-10-0-183-154.us-east-2.compute.internal   Ready      worker   126m   v1.18.3+91d0edd
ip-10-0-203-2.us-east-2.compute.internal     NotReady   worker   126m   v1.18.3+91d0edd
ip-10-0-210-102.us-east-2.compute.internal   Ready      master   135m   v1.18.3+91d0edd

yashashreesuresh · 2020-06-24T18:24:25Z

I did have to use subprocess.Popen to get the stop kubelet to return properly. I'm not sure about the node_crash scenario. I'm not sure if the node will go to the notReady state. I was going off of the commands Mike had laid out in the original issue. I think for the node_crash scenario I also had to use subprocess.Popen to get a proper response. I think I thought either subprocess function worked so I didn't add it to my pull request, sorry about that. I tried the node_crash scenario from my branch and the Popen properly returns

I tried with Popen, it's not working as I expected. The subprocess.Popen("oc debug node/" + node + " -- chroot /host dd if=/dev/urandom of=/proc/sysrq-trigger") command finishes. Only after few seconds, the node goes to NotReady state. I expected the node to go to NotReady state while the Popen command is in execution but the node when to NotReady state after the debug pod is removed. Few times, the node state doesn't change at all.
I am not sure what's happening.

Anyways, I have removed the node_crash_scenario. You can add your code here https://github.com/openshift-scale/kraken/pull/10/files#diff-54651a20a8f57ef1c067dbaac8fdc0a6R58 once this PR is merged as it worked right for you.

yashashreesuresh · 2020-06-29T12:14:55Z

For node crash scenarios, nodes will not show immediately as NotReady in the api. They have to miss a readiness check interval from the API server perspective. If a node is in trouble with memory or disk pressure but still running, it can self report the issue immediately. But for a crashed node, the cluster needs to notice it is gone to mark it NotReady.

Added back the node crash scenario that was removed.

config/config.yaml

paigerube14 · 2020-06-29T13:27:07Z

Is it expected that we set up the boto3 region and some other attributes before running? I did not set up of boto3 and I am getting an error botocore.exceptions.NoRegionError: You must specify a region. This might be good to add to the README or it might be nice to either specify those in the config or in the scenario itself where you define the cloud type to aws

yashashreesuresh · 2020-06-29T14:18:55Z

Is it expected that we set up the boto3 region and some other attributes before running? I did not set up of boto3 and I am getting an error botocore.exceptions.NoRegionError: You must specify a region. This might be good to add to the README or it might be nice to either specify those in the config or in the scenario itself where you define the cloud type to aws

boto3 client is created in the default region specified in the ~/.aws/config file. This file is created when awscli is installed. Which version of AWS CLI are use using?

paigerube14 · 2020-06-29T14:34:04Z

Is it expected that we set up the boto3 region and some other attributes before running? I did not set up of boto3 and I am getting an error botocore.exceptions.NoRegionError: You must specify a region. This might be good to add to the README or it might be nice to either specify those in the config or in the scenario itself where you define the cloud type to aws

boto3 client is created in the default region specified in the ~/.aws/config file. This file is created when awscli is installed. Which version of AWS CLI are use using?

Ah I didn't have an aws cli installed. I definitely think this should be put in the read me as an extra set up steps because that is not clear. Maybe just specify if using the specific node scenarios you'll need to set up aws cli with certain config. I followed https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html if you want to use this in the documentation

yashashreesuresh · 2020-06-29T14:40:56Z

Ah I didn't have an aws cli installed. I definitely think this should be put in the read me as an extra set up steps because that is not clear. Maybe just specify if using the specific node scenarios you'll need to set up aws cli with certain config. I followed https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html if you want to use this in the documentation

Yeah, I will add it in the README. Thank you!

paigerube14

LGTM

mffiedler · 2020-07-09T01:59:02Z

/hold
I need some time to review and test this. It has gotten pretty big. In the future I have the suggestion that we add 1 scenario per PR.

yashashreesuresh · 2020-07-09T06:10:09Z

/hold
I need some time to review and test this. It has gotten pretty big. In the future I have the suggestion that we add 1 scenario per PR.

Sure.

chaitanyaenr · 2020-08-25T12:22:09Z

README.md

 [Cerberus](https://github.com/openshift-scale/cerberus) can be used to monitor the cluster under test and the aggregated go/no-go signal generated by it can be consumed by Kraken to determine pass/fail. This is to make sure the Kubernetes/OpenShift environments are healthy on a cluster level instead of just the targeted components level. It is highly recommended to turn on the Cerberus health check feature avaliable in Kraken after installing and setting up Cerberus. To do that, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the config file.

+### Kubernetes/OpenShift node chaos scenarios supported
+Following node chaos scenarios are supported:


We might want to mention that the cloud related scenarios like start/stop/reboot are supported only for AWS as of now.

chaitanyaenr · 2020-08-25T12:33:16Z

kraken/node_actions/abstract_node_scenarios.py

+        self.node_reboot_scenario(instance_kill_count, node, timeout)
+        logging.info("stop_start_kubelet_scenario has been successfully injected!")
+
+    # Node scenario to crash the node


We might want to add a note in the readme that the node needs to be rebooted to recover from node crash and that it can be added as an action in the node scenarios config.

chaitanyaenr · 2020-08-25T12:42:27Z

README.md

+
+**NOTE**: With aws as the cloud type, make sure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) is installed.
+
 ### Kubernetes/OpenShift chaos scenarios supported


We might want to move the node scenarios under "Kubernetes/OpenShift chaos scenarios supported" section.

This commit: - Adds a node scenario to stop and start an instance - Adds a node scenario to terminate an instance - Adds a node scenario to reboot an instance - Adds a node scenario to stop the kubelet - Adds a node scenario to crash the node

chaitanyaenr

LGTM, tested all the scenarios and they work well as expected.

chaitanyaenr · 2020-08-25T15:48:28Z

@paigerube14 @mffiedler Thoughts?

paigerube14 · 2020-08-25T15:53:59Z

README.md

+
+
+```
+node_scenarios:


Should we make this in a separate readme for the node scenarios?

I think it would be better to place everything in the main Readme as the user gets to know about different scenarios supported by Kraken.

mffiedler · 2020-08-27T20:44:42Z

I think we should merge this and do any follow up work needed as a separate PR

LGTM

yashashreesuresh force-pushed the node_scenarios branch from 98a0aa5 to 39cd608 Compare April 30, 2020 13:53

chaitanyaenr reviewed May 5, 2020

View reviewed changes

config/config.yaml Outdated Show resolved Hide resolved

config/config.yaml Outdated Show resolved Hide resolved

node_scenarios.py Outdated Show resolved Hide resolved

chaitanyaenr reviewed May 5, 2020

View reviewed changes

node_scenarios.py Outdated Show resolved Hide resolved

node_scenarios.py Outdated Show resolved Hide resolved

node_scenarios.py Outdated Show resolved Hide resolved

yashashreesuresh force-pushed the node_scenarios branch 4 times, most recently from 3c74396 to b07c525 Compare May 5, 2020 14:02

chaitanyaenr reviewed May 6, 2020

View reviewed changes

config/config.yaml Outdated Show resolved Hide resolved

kraken/node_actions/node_scenarios.py Outdated Show resolved Hide resolved

node_scenarios/node_start_stop_scenario.yml Outdated Show resolved Hide resolved

node_scenarios/node_start_stop_scenario.yml Outdated Show resolved Hide resolved

yashashreesuresh force-pushed the node_scenarios branch from b07c525 to ff2fc40 Compare May 6, 2020 11:43

mffiedler reviewed May 6, 2020

View reviewed changes

kraken/node_actions/node_scenarios.py Outdated Show resolved Hide resolved

yashashreesuresh force-pushed the node_scenarios branch 5 times, most recently from 4d2b36d to 5220416 Compare May 11, 2020 07:55

chaitanyaenr reviewed May 12, 2020

View reviewed changes

kraken/node_actions/aws_node_scenarios.py Outdated Show resolved Hide resolved

node_scenarios/node_scenarios_example.yml Outdated Show resolved Hide resolved

yashashreesuresh mentioned this pull request Jun 11, 2020

Node failures 1 #18

Closed

paigerube14 reviewed Jun 15, 2020

View reviewed changes

node_scenarios/node_scenarios_example.yml Outdated Show resolved Hide resolved

chaitanyaenr added the Needs Rebase label Jun 16, 2020

yashashreesuresh force-pushed the node_scenarios branch from 5220416 to 722b369 Compare June 19, 2020 12:04

yashashreesuresh force-pushed the node_scenarios branch from 722b369 to 2cd296e Compare June 24, 2020 18:10

yashashreesuresh changed the title ~~Added node scenarios to stop and terminate instance~~ Added node chaos scenarios Jun 24, 2020

paigerube14 reviewed Jun 29, 2020

View reviewed changes

config/config.yaml Show resolved Hide resolved

yashashreesuresh force-pushed the node_scenarios branch 2 times, most recently from 76a768e to 8c7f59a Compare June 30, 2020 11:26

paigerube14 reviewed Jun 30, 2020

View reviewed changes

yashashreesuresh force-pushed the node_scenarios branch from 8c7f59a to c23ad21 Compare July 8, 2020 17:37

yashashreesuresh force-pushed the node_scenarios branch from c23ad21 to cde164b Compare July 9, 2020 06:56

yashashreesuresh mentioned this pull request Jul 9, 2020

Added cluster shut down scenario #25

Closed

chaitanyaenr added the Needs Rebase label Aug 21, 2020

yashashreesuresh force-pushed the node_scenarios branch 2 times, most recently from da25447 to 303d55c Compare August 23, 2020 15:32

chaitanyaenr removed the Needs Rebase label Aug 24, 2020

chaitanyaenr reviewed Aug 25, 2020

View reviewed changes

yashashreesuresh force-pushed the node_scenarios branch from 303d55c to 5dd0772 Compare August 25, 2020 14:27

yashashreesuresh force-pushed the node_scenarios branch from 5dd0772 to a91bf85 Compare August 25, 2020 14:29

chaitanyaenr approved these changes Aug 25, 2020

View reviewed changes

paigerube14 reviewed Aug 25, 2020

View reviewed changes

chaitanyaenr merged commit 31f06b8 into krkn-chaos:master Aug 27, 2020

yashashreesuresh deleted the node_scenarios branch November 18, 2020 15:32

github-actions bot mentioned this pull request Mar 2, 2026

🌱 CNCF mission generation 2026-03-02 kubestellar/console-kb#251

Merged


		NOTE: With aws as the cloud type, make sure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) is installed.

		### Kubernetes/OpenShift chaos scenarios supported



		```
		node_scenarios:

Conversation

yashashreesuresh commented Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rht-perf-ci commented May 6, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yashashreesuresh commented Jun 17, 2020

Uh oh!

mffiedler commented Jun 18, 2020

Uh oh!

paigerube14 commented Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yashashreesuresh commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yashashreesuresh commented Jun 22, 2020

Uh oh!

paigerube14 commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yashashreesuresh commented Jun 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yashashreesuresh commented Jun 29, 2020

Uh oh!

Uh oh!

paigerube14 commented Jun 29, 2020

Uh oh!

yashashreesuresh commented Jun 29, 2020

Uh oh!

paigerube14 commented Jun 29, 2020

Uh oh!

yashashreesuresh commented Jun 29, 2020

Uh oh!

paigerube14 left a comment

Choose a reason for hiding this comment

Uh oh!

mffiedler commented Jul 9, 2020

Uh oh!

yashashreesuresh commented Jul 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaitanyaenr left a comment

Choose a reason for hiding this comment

Uh oh!

chaitanyaenr commented Aug 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mffiedler commented Aug 27, 2020

Uh oh!

Reviewers

yashashreesuresh commented Apr 30, 2020 •

edited

Loading

paigerube14 commented Jun 18, 2020 •

edited

Loading

yashashreesuresh commented Jun 19, 2020 •

edited

Loading

paigerube14 commented Jun 23, 2020 •

edited

Loading

yashashreesuresh commented Jun 24, 2020 •

edited

Loading