Skip to content

Node failures 1#18

Closed
paigerube14 wants to merge 15 commits intokrkn-chaos:masterfrom
paigerube14:node_failures_1
Closed

Node failures 1#18
paigerube14 wants to merge 15 commits intokrkn-chaos:masterfrom
paigerube14:node_failures_1

Conversation

@paigerube14
Copy link
Collaborator

This covers the stop kubelet and node crash scenarios Mike mentioned in issue: #8

I based a lot of my more general set up of the node kill scenario yaml file from the below and this could be combined at some point https://github.com/openshift-scale/kraken/pull/10/files

These 2 scenarios are not cloud specific.
This is a first pass of stopping kubelet. I was hoping to add the ability to stop, wait, and then restart of kubelet but I was not able to get it to work properly.
I also had to find a separate command then the one Mike mentioned for the fork bomb scenario.

@rht-perf-ci
Copy link

Can one of the admins verify this patch?

@mffiedler
Copy link
Collaborator

@paigerube14 please check the travis-ci failure - looks like a style issue: https://travis-ci.com/github/openshift-scale/kraken/builds/170789518

@yashashreesuresh
Copy link
Contributor

Hey @paigerube14, I had worked on stop kubelet and node crash scenarios as I was assigned the node chaos scenarios issue #8 . However, I was not able to update the same in my PR as I got sidetracked with the issues in cerberus.

@paigerube14
Copy link
Collaborator Author

Hey @yashashreesuresh, ah sorry about that. I had seen your pull request that covered the first couple of scenarios and figured you had not gotten to the other 2 yet so was going to try to help out. What do you want to do here? I would love to see what you were working on here and I can add in some things you had worked on to get this pull request merged for these 2 scenarios. Or I can just drop this pull request and you can fully cover the issue in your pull request. @mffiedler @chaitanyaenr Would love some input here.

@mffiedler
Copy link
Collaborator

mffiedler commented Jun 11, 2020

@chaitanyaenr @yashashreesuresh Thoughts on moving forward? Keep working in this PR? or move the work over to #10?

(we should also assign issues to avoid a scenario like this - feel free to self-assign when you start work)

@chaitanyaenr
Copy link
Collaborator

Agreed that we should assign the issues to ourselves to avoid any confusions, I totally forgot to do that, sorry my bad.

Think both the PR's are valid and we can keep working on them in parallel since they cover different node scenarios. #10 covers node reboot,stop and terminate scenario while this PR covers stopping a kubelet and fork bombing a node. We might want to sync up on the code structure/design for consuming node scenarios before moving forward though to make sure they are in sync.

Thoughts?

@paigerube14
Copy link
Collaborator Author

Definitely agree that we need to try to sync up code for all node scenarios. It sounded like the pull requests cover different scenarios but Yashashree might have already been working on the 2 I covered as well just not in the pull request.

@chaitanyaenr
Copy link
Collaborator

Sorry I missed that, we can move the work to #10 then if it's okay with everyone.

@yashashreesuresh
Copy link
Contributor

yashashreesuresh commented Jun 12, 2020

Hey @yashashreesuresh, ah sorry about that. I had seen your pull request that covered the first couple of scenarios and figured you had not gotten to the other 2 yet so was going to try to help out. What do you want to do here? I would love to see what you were working on here and I can add in some things you had worked on to get this pull request merged for these 2 scenarios. Or I can just drop this pull request and you can fully cover the issue in your pull request. @mffiedler @chaitanyaenr Would love some input here.

I have been working on aws node scenarios. I have used boto to start/stop/restart/terminate the nodes. In all my node scenarios I execute a scenario and perform the check to see if the node is back healthy. For the kubelet scenario, I have used the restart module of boto to restart the kubelet after it is stopped. Therefore, the entire scenario to stop and restart the kubelet becomes cloud specific. However, if you could add the functions kubelet_action and crash_node under same directory structure kraken/node_actions/common_node_functions.py, I can use them in my functions which include restart and checking if the node is back healthy. This way, both of us can contribute to the kubelet and node crash scenarios. :)
There’s no option for me to add the label on the PR indicating WIP or assign the issue to myself, else I would have added the label which would have avoided this confusion.

@mffiedler
Copy link
Collaborator

@paigerube14 @chaitanyaenr I think we can close this in favor of the current AWS node scenario support. Agree? We can continue Azure and GCP support using #40 and #41 . If no objections I will close this out.

@paigerube14
Copy link
Collaborator Author

@paigerube14 @chaitanyaenr I think we can close this in favor of the current AWS node scenario support. Agree? We can continue Azure and GCP support using #40 and #41 . If no objections I will close this out.

Yes I agree that this can be closed. This was partially included in Yashashree's PR

@chaitanyaenr
Copy link
Collaborator

+1, I agree that this can be closed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants