Added cluster shut down scenario#25
Added cluster shut down scenario#25yashashreesuresh wants to merge 1 commit intokrkn-chaos:masterfrom
Conversation
|
This PR is based on node_scenarios PR: #10 as I am reusing the AWS node stop and start functions. |
|
Can one of the admins verify this patch? |
3888ee3 to
513be85
Compare
kraken/invoke/command.py
Outdated
| try: | ||
| subprocess.run(command, shell=True, universal_newlines=True, timeout=45) | ||
| except Exception: | ||
| logging.error("Failed to run %s" % (command)) |
There was a problem hiding this comment.
I think you should keep a logging statement here and not just pass
There was a problem hiding this comment.
This function was added just for the stop_kubelet node scenario, when this command is executed once, the kubelet stops even if the command exceeds the timeout (here set to 45s) and exception indicating 'failed to run command' wouldn't be meaningful in that case.
| try: | ||
| ret = cli.list_node(pretty=True) | ||
| if label_selector: | ||
| ret = cli.list_node(pretty=True, label_selector=label_selector) |
There was a problem hiding this comment.
I do not see anywhere where you call list_nodes and pass it a label_selector. Is that correct?
There was a problem hiding this comment.
Yeah, previously in line, list_nodes was used instead of list_killable_node.
|
I have just a few little comments but I was able to run this cluster shut down the other day and it worked perfectly! |
| for node in nodes: | ||
| cloud_object.stop_instances(node_id[node]) | ||
| logging.info("Waiting for 250s to shut down all the nodes") | ||
| time.sleep(250) |
There was a problem hiding this comment.
Are we able to have the user set this in their config or use 250 as a default? Is there a specific reason we chose 250 seconds here?
There was a problem hiding this comment.
There's no specific reason, it took around 2 mins on 10 node cluster. So I chose 250 seconds to accommodate clusters of bigger sizes. start_instance function can be called on a node only when it is in stopped state else it throws error. However I have added a try except condition such that we sleep for 10 additional seconds when a node isn't in stopped state even after 250 seconds.
| stopped_nodes = nodes - restarted_nodes | ||
| logging.info("Waiting for 250s to allow cluster component initilization") | ||
| time.sleep(250) | ||
| logging.info("Successfully injected cluster_shut_down scenario!") |
There was a problem hiding this comment.
Are we able to add in a verification that the nodes are all back up and ready here? Is that too much for kraken that it can just be handled in cerberus? Thoughts?
There was a problem hiding this comment.
I think this part would be handled by cerberus. With cerberus intergration, when we receive a true, kraken proceeds with next scenario indicating all the nodes are ready but with false, we terminate kraken indicating some components aren't healthy. But it can be explicitly specified after line when cerberus integration is enabled if needed. Thoughts?
fb242b9 to
a7cb346
Compare
|
@yashashreesuresh Can we rebase the code please? |
a7cb346 to
2dc9aa3
Compare
This commit adds a scenario to shut down all the nodes including the masters and restarts them after a specified duration.
2dc9aa3 to
4e4ba5a
Compare
@chaitanyaenr I have rebased the code. But I was not able to test completely as I don't have a cluster. After this PR gets merged, cluster_shut_down can be moved to a separate file where cluster_shut_down scenario for other cloud types can be added as well. |
|
@yashashreesuresh Thanks for rebasing the code. No worries about the cluster, will test the scenario and let you know how it goes. |
| Key Members(slack_usernames): paigerube14, rook, mffiedler, mohit, dry923, rsevilla, ravi | ||
| * [**#sig-scalability on Kubernetes Slack**](https://kubernetes.slack.com) | ||
| * [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com) | ||
| * [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com) No newline at end of file |
There was a problem hiding this comment.
is this an open slack channel that anyone can get on?
| - litmus_scenarios: # List of litmus scenarios to load | ||
| - - https://hub.litmuschaos.io/api/chaos/1.10.0?file=charts/generic/node-cpu-hog/rbac.yaml | ||
| - scenarios/node_hog_engine.yaml | ||
| - cluster_shut_down_scenario: |
There was a problem hiding this comment.
The scenario type in run_kraken is looking for scenario type that ends with an "s". Just need to add an s here to be: cluster_shut_down_scenarios
| runs = shut_down_config["runs"] | ||
| shut_down_duration = shut_down_config["shut_down_duration"] | ||
| cloud_type = shut_down_config["cloud_type"] | ||
| if cloud_type == "aws": |
There was a problem hiding this comment.
Can we add in the other cloud types that have been added?
In addition, could we add in an else case for if the cloud type is not supported and to not run/exit with an error message
|
Based most of my coding on this PR and got it merged for AWS, Azure, Openstack and GCP |
This commit adds a scenario to shut down all the nodes including the masters and restarts them after a specified duration.