Added node chaos scenarios#10
Conversation
98a0aa5 to
39cd608
Compare
3c74396 to
b07c525
Compare
b07c525 to
ff2fc40
Compare
|
Can one of the admins verify this patch? |
4d2b36d to
5220416
Compare
|
The current implementation uses abstract class; class for each cloud type inherits the abstract class. With this, the node scenario functions can to be called only after creating an object of the class. The advantage with using abstract class is that it makes sure all the cloud types support all the scenarios. But I think it would be easier to directly call functions without creating an object. Thoughts? @mffiedler @paigerube14 |
|
I think the use of the abstract class to allow for cloud-specific scenario implementations where the cloud instance APIs are required is the correct approach. We also need to allow for scenarios which are not cloud specific (see the node crash scenarios which use oc debug node in the PR @paigerube14 started in #18 for an example). Where possible, having a default/common implementation of the scenario would be beneficial and it can be overridden by a cloud-specific implementation where needed. Hope that makes sense. |
|
I completely agree with Mike, I think the abstract might actually end up being easier for all the cloud providers to have a common layout. @yashashreesuresh did you want me to add my 2 functions (kubelet_action and crash_node) for the oc debug parts that Mike is talking about to your branch? How did you want to do that? |
5220416 to
722b369
Compare
@paigerube14 I have added the kubelet scenario as https://github.com/openshift-scale/kraken/pull/10/files#diff-d7c8b18dc1f8fd23bc1e62bdee0ccdceL8 subprocess.check_output doesn't finish to completion because once the kubelet is stopped, debug pod never terminates. Therefore, I had to use https://github.com/openshift-scale/kraken/pull/10/files#diff-d7c8b18dc1f8fd23bc1e62bdee0ccdceR15 subprocess.run command. |
|
All the scenarios are added, PTAL @chaitanyaenr |
I did have to use subprocess.Popen to get the stop kubelet to return properly. I'm not sure about the node_crash scenario. I'm not sure if the node will go to the notReady state. I was going off of the commands Mike had laid out in the original issue. I think for the node_crash scenario I also had to use subprocess.Popen to get a proper response. I think I thought either subprocess function worked so I didn't add it to my pull request, sorry about that. I tried the node_crash scenario from my branch and the Popen properly returns It might take some time but the node I tried to kill did end up getting into the not ready state: |
722b369 to
2cd296e
Compare
I tried with Popen, it's not working as I expected. The subprocess.Popen("oc debug node/" + node + " -- chroot /host dd if=/dev/urandom of=/proc/sysrq-trigger") command finishes. Only after few seconds, the node goes to NotReady state. I expected the node to go to NotReady state while the Popen command is in execution but the node when to NotReady state after the debug pod is removed. Few times, the node state doesn't change at all. Anyways, I have removed the node_crash_scenario. You can add your code here https://github.com/openshift-scale/kraken/pull/10/files#diff-54651a20a8f57ef1c067dbaac8fdc0a6R58 once this PR is merged as it worked right for you. |
Added back the node crash scenario that was removed. |
|
Is it expected that we set up the boto3 region and some other attributes before running? I did not set up of boto3 and I am getting an error |
boto3 client is created in the |
Ah I didn't have an aws cli installed. I definitely think this should be put in the read me as an extra set up steps because that is not clear. Maybe just specify if using the specific node scenarios you'll need to set up aws cli with certain config. I followed https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html if you want to use this in the documentation |
Yeah, I will add it in the README. Thank you! |
76a768e to
8c7f59a
Compare
8c7f59a to
c23ad21
Compare
|
/hold |
Sure. |
c23ad21 to
cde164b
Compare
da25447 to
303d55c
Compare
| [Cerberus](https://github.com/openshift-scale/cerberus) can be used to monitor the cluster under test and the aggregated go/no-go signal generated by it can be consumed by Kraken to determine pass/fail. This is to make sure the Kubernetes/OpenShift environments are healthy on a cluster level instead of just the targeted components level. It is highly recommended to turn on the Cerberus health check feature avaliable in Kraken after installing and setting up Cerberus. To do that, set cerberus_enabled to True and cerberus_url to the url where Cerberus publishes go/no-go signal in the config file. | ||
|
|
||
| ### Kubernetes/OpenShift node chaos scenarios supported | ||
| Following node chaos scenarios are supported: |
There was a problem hiding this comment.
We might want to mention that the cloud related scenarios like start/stop/reboot are supported only for AWS as of now.
| self.node_reboot_scenario(instance_kill_count, node, timeout) | ||
| logging.info("stop_start_kubelet_scenario has been successfully injected!") | ||
|
|
||
| # Node scenario to crash the node |
There was a problem hiding this comment.
We might want to add a note in the readme that the node needs to be rebooted to recover from node crash and that it can be added as an action in the node scenarios config.
README.md
Outdated
|
|
||
| **NOTE**: With aws as the cloud type, make sure [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) is installed. | ||
|
|
||
| ### Kubernetes/OpenShift chaos scenarios supported |
There was a problem hiding this comment.
We might want to move the node scenarios under "Kubernetes/OpenShift chaos scenarios supported" section.
303d55c to
5dd0772
Compare
This commit: - Adds a node scenario to stop and start an instance - Adds a node scenario to terminate an instance - Adds a node scenario to reboot an instance - Adds a node scenario to stop the kubelet - Adds a node scenario to crash the node
5dd0772 to
a91bf85
Compare
chaitanyaenr
left a comment
There was a problem hiding this comment.
LGTM, tested all the scenarios and they work well as expected.
|
@paigerube14 @mffiedler Thoughts? |
|
|
||
|
|
||
| ``` | ||
| node_scenarios: |
There was a problem hiding this comment.
Should we make this in a separate readme for the node scenarios?
There was a problem hiding this comment.
I think it would be better to place everything in the main Readme as the user gets to know about different scenarios supported by Kraken.
|
I think we should merge this and do any follow up work needed as a separate PR LGTM |
This commit:
Fixes: #8