Skip to content

Added node chaos scenarios#10

Merged
chaitanyaenr merged 1 commit intokrkn-chaos:masterfrom
yashashreesuresh:node_scenarios
Aug 27, 2020
Merged

Added node chaos scenarios#10
chaitanyaenr merged 1 commit intokrkn-chaos:masterfrom
yashashreesuresh:node_scenarios

Conversation

@yashashreesuresh
Copy link
Contributor

@yashashreesuresh yashashreesuresh commented Apr 30, 2020

This commit:

  • Adds a node scenario to stop and start an instance
  • Adds a node scenario to terminate an instance
  • Adds a node scenario to reboot an instance
  • Adds a node scenario to stop the kubelet
  • Adds a node scenario to crash the node

Fixes: #8

@yashashreesuresh yashashreesuresh force-pushed the node_scenarios branch 4 times, most recently from 3c74396 to b07c525 Compare May 5, 2020 14:02
@rht-perf-ci
Copy link

Can one of the admins verify this patch?

@yashashreesuresh yashashreesuresh force-pushed the node_scenarios branch 5 times, most recently from 4d2b36d to 5220416 Compare May 11, 2020 07:55
@yashashreesuresh yashashreesuresh mentioned this pull request Jun 11, 2020
@yashashreesuresh
Copy link
Contributor Author

The current implementation uses abstract class; class for each cloud type inherits the abstract class. With this, the node scenario functions can to be called only after creating an object of the class. The advantage with using abstract class is that it makes sure all the cloud types support all the scenarios. But I think it would be easier to directly call functions without creating an object. Thoughts? @mffiedler @paigerube14

@mffiedler
Copy link
Collaborator

I think the use of the abstract class to allow for cloud-specific scenario implementations where the cloud instance APIs are required is the correct approach. We also need to allow for scenarios which are not cloud specific (see the node crash scenarios which use oc debug node in the PR @paigerube14 started in #18 for an example). Where possible, having a default/common implementation of the scenario would be beneficial and it can be overridden by a cloud-specific implementation where needed. Hope that makes sense.

@paigerube14
Copy link
Collaborator

paigerube14 commented Jun 18, 2020

I completely agree with Mike, I think the abstract might actually end up being easier for all the cloud providers to have a common layout.

@yashashreesuresh did you want me to add my 2 functions (kubelet_action and crash_node) for the oc debug parts that Mike is talking about to your branch? How did you want to do that?

@yashashreesuresh
Copy link
Contributor Author

yashashreesuresh commented Jun 19, 2020

I completely agree with Mike, I think the abstract might actually end up being easier for all the cloud providers to have a common layout.

@yashashreesuresh did you want me to add my 2 functions (kubelet_action and crash_node) for the oc debug parts that Mike is talking about to your branch? How did you want to do that?

@paigerube14 I have added the kubelet scenario as https://github.com/openshift-scale/kraken/pull/10/files#diff-d7c8b18dc1f8fd23bc1e62bdee0ccdceL8 subprocess.check_output doesn't finish to completion because once the kubelet is stopped, debug pod never terminates. Therefore, I had to use https://github.com/openshift-scale/kraken/pull/10/files#diff-d7c8b18dc1f8fd23bc1e62bdee0ccdceR15 subprocess.run command.
I was trying your node_crash scenario and I observed many times the command gets executed but the node status doesn't change from Ready->NotReady->Ready. And sometimes the command pauses and doesn't finish to completion and I had to use Ctrl+C. However, I have added your command here https://github.com/openshift-scale/kraken/pull/10/files#diff-54651a20a8f57ef1c067dbaac8fdc0a6R63. It needs to be changed.

@yashashreesuresh
Copy link
Contributor Author

All the scenarios are added, PTAL @chaitanyaenr

@paigerube14
Copy link
Collaborator

paigerube14 commented Jun 23, 2020

I completely agree with Mike, I think the abstract might actually end up being easier for all the cloud providers to have a common layout.
@yashashreesuresh did you want me to add my 2 functions (kubelet_action and crash_node) for the oc debug parts that Mike is talking about to your branch? How did you want to do that?

@paigerube14 I have added the kubelet scenario as https://github.com/openshift-scale/kraken/pull/10/files#diff-d7c8b18dc1f8fd23bc1e62bdee0ccdceL8 subprocess.check_output doesn't finish to completion because once the kubelet is stopped, debug pod never terminates. Therefore, I had to use https://github.com/openshift-scale/kraken/pull/10/files#diff-d7c8b18dc1f8fd23bc1e62bdee0ccdceR15 subprocess.run command.
I was trying your node_crash scenario and I observed many times the command gets executed but the node status doesn't change from Ready->NotReady->Ready. And sometimes the command pauses and doesn't finish to completion and I had to use Ctrl+C. However, I have added your command here https://github.com/openshift-scale/kraken/pull/10/files#diff-54651a20a8f57ef1c067dbaac8fdc0a6R63. It needs to be changed.

I did have to use subprocess.Popen to get the stop kubelet to return properly. I'm not sure about the node_crash scenario. I'm not sure if the node will go to the notReady state. I was going off of the commands Mike had laid out in the original issue. I think for the node_crash scenario I also had to use subprocess.Popen to get a proper response. I think I thought either subprocess function worked so I didn't add it to my pull request, sorry about that. I tried the node_crash scenario from my branch and the Popen properly returns

It might take some time but the node I tried to kill did end up getting into the not ready state:

Try invoking oc debug node/ip-10-0-203-2.us-east-2.compute.internal -- chroot /host dd if=/dev/urandom of=/proc/sysrq-trigger
2020-06-23 11:01:31,957 [INFO] Scenario: {'node_scenarios': [{'name': 'Fork bomb the node', 'actions': ['node_crash'], 'label_selector': 'node-role.kubernetes.io/worker', 'instance_kill_count': 1, 'timeout': 20, 'cloud_type': 'aws'}]} has been successfully injected!
2020-06-23 11:01:31,957 [INFO] Waiting for the specified duration: 60
(venv3) prubenda@prubenda-mac kraken % oc get nodes
NAME                                         STATUS     ROLES    AGE    VERSION
ip-10-0-132-58.us-east-2.compute.internal    Ready      worker   125m   v1.18.3+91d0edd
ip-10-0-138-72.us-east-2.compute.internal    Ready      master   136m   v1.18.3+91d0edd
ip-10-0-176-148.us-east-2.compute.internal   Ready      master   136m   v1.18.3+91d0edd
ip-10-0-183-154.us-east-2.compute.internal   Ready      worker   126m   v1.18.3+91d0edd
ip-10-0-203-2.us-east-2.compute.internal     NotReady   worker   126m   v1.18.3+91d0edd
ip-10-0-210-102.us-east-2.compute.internal   Ready      master   135m   v1.18.3+91d0edd

@yashashreesuresh yashashreesuresh changed the title Added node scenarios to stop and terminate instance Added node chaos scenarios Jun 24, 2020
@yashashreesuresh
Copy link
Contributor Author

yashashreesuresh commented Jun 24, 2020

I did have to use subprocess.Popen to get the stop kubelet to return properly. I'm not sure about the node_crash scenario. I'm not sure if the node will go to the notReady state. I was going off of the commands Mike had laid out in the original issue. I think for the node_crash scenario I also had to use subprocess.Popen to get a proper response. I think I thought either subprocess function worked so I didn't add it to my pull request, sorry about that. I tried the node_crash scenario from my branch and the Popen properly returns

I tried with Popen, it's not working as I expected. The subprocess.Popen("oc debug node/" + node + " -- chroot /host dd if=/dev/urandom of=/proc/sysrq-trigger") command finishes. Only after few seconds, the node goes to NotReady state. I expected the node to go to NotReady state while the Popen command is in execution but the node when to NotReady state after the debug pod is removed. Few times, the node state doesn't change at all.
I am not sure what's happening.

Anyways, I have removed the node_crash_scenario. You can add your code here https://github.com/openshift-scale/kraken/pull/10/files#diff-54651a20a8f57ef1c067dbaac8fdc0a6R58 once this PR is merged as it worked right for you.

@yashashreesuresh
Copy link
Contributor Author

For node crash scenarios, nodes will not show immediately as NotReady in the api. They have to miss a readiness check interval from the API server perspective. If a node is in trouble with memory or disk pressure but still running, it can self report the issue immediately. But for a crashed node, the cluster needs to notice it is gone to mark it NotReady.

Added back the node crash scenario that was removed.

@paigerube14
Copy link
Collaborator

Is it expected that we set up the boto3 region and some other attributes before running? I did not set up of boto3 and I am getting an error botocore.exceptions.NoRegionError: You must specify a region. This might be good to add to the README or it might be nice to either specify those in the config or in the scenario itself where you define the cloud type to aws

@yashashreesuresh
Copy link
Contributor Author

Is it expected that we set up the boto3 region and some other attributes before running? I did not set up of boto3 and I am getting an error botocore.exceptions.NoRegionError: You must specify a region. This might be good to add to the README or it might be nice to either specify those in the config or in the scenario itself where you define the cloud type to aws

boto3 client is created in the default region specified in the ~/.aws/config file. This file is created when awscli is installed. Which version of AWS CLI are use using?

@paigerube14
Copy link
Collaborator

Is it expected that we set up the boto3 region and some other attributes before running? I did not set up of boto3 and I am getting an error botocore.exceptions.NoRegionError: You must specify a region. This might be good to add to the README or it might be nice to either specify those in the config or in the scenario itself where you define the cloud type to aws

boto3 client is created in the default region specified in the ~/.aws/config file. This file is created when awscli is installed. Which version of AWS CLI are use using?

Ah I didn't have an aws cli installed. I definitely think this should be put in the read me as an extra set up steps because that is not clear. Maybe just specify if using the specific node scenarios you'll need to set up aws cli with certain config. I followed https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html if you want to use this in the documentation

@yashashreesuresh
Copy link
Contributor Author

Ah I didn't have an aws cli installed. I definitely think this should be put in the read me as an extra set up steps because that is not clear. Maybe just specify if using the specific node scenarios you'll need to set up aws cli with certain config. I followed https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html if you want to use this in the documentation

Yeah, I will add it in the README. Thank you!

@yashashreesuresh yashashreesuresh force-pushed the node_scenarios branch 2 times, most recently from 76a768e to 8c7f59a Compare June 30, 2020 11:26
Copy link
Collaborator

@paigerube14 paigerube14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mffiedler
Copy link
Collaborator

/hold
I need some time to review and test this. It has gotten pretty big. In the future I have the suggestion that we add 1 scenario per PR.

@yashashreesuresh
Copy link
Contributor Author

/hold
I need some time to review and test this. It has gotten pretty big. In the future I have the suggestion that we add 1 scenario per PR.

Sure.

[Cerberus](https://github.com/openshift-scale/cerberus) can be used to monitor the cluster under test and the aggregated go/no-go signal generated by it can be consumed by Kraken to determine pass/fail. This is to make sure the Kubernetes/OpenShift environments are healthy on a cluster level instead of just the targeted components level. It is highly recommended to turn on the Cerberus health check feature avaliable in Kraken after installing and setting up Cerberus. To do that, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the config file.

### Kubernetes/OpenShift node chaos scenarios supported
Following node chaos scenarios are supported:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to mention that the cloud related scenarios like start/stop/reboot are supported only for AWS as of now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it.

self.node_reboot_scenario(instance_kill_count, node, timeout)
logging.info("stop_start_kubelet_scenario has been successfully injected!")

# Node scenario to crash the node
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add a note in the readme that the node needs to be rebooted to recover from node crash and that it can be added as an action in the node scenarios config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it.

README.md Outdated

**NOTE**: With aws as the cloud type, make sure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) is installed.

### Kubernetes/OpenShift chaos scenarios supported
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to move the node scenarios under "Kubernetes/OpenShift chaos scenarios supported" section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This commit:
- Adds a node scenario to stop and start an instance
- Adds a node scenario to terminate an instance
- Adds a node scenario to reboot an instance
- Adds a node scenario to stop the kubelet
- Adds a node scenario to crash the node
Copy link
Collaborator

@chaitanyaenr chaitanyaenr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested all the scenarios and they work well as expected.

@chaitanyaenr
Copy link
Collaborator

@paigerube14 @mffiedler Thoughts?



```
node_scenarios:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this in a separate readme for the node scenarios?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to place everything in the main Readme as the user gets to know about different scenarios supported by Kraken.

@mffiedler
Copy link
Collaborator

I think we should merge this and do any follow up work needed as a separate PR

LGTM

@chaitanyaenr chaitanyaenr merged commit 31f06b8 into krkn-chaos:master Aug 27, 2020
@yashashreesuresh yashashreesuresh deleted the node_scenarios branch November 18, 2020 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Node chaos scenarios

5 participants