Added cluster shut down scenario by yashashreesuresh · Pull Request #25 · krkn-chaos/krkn

yashashreesuresh · 2020-07-09T10:14:52Z

This commit adds a scenario to shut down all the nodes including the masters and restarts them after a specified duration.

yashashreesuresh · 2020-07-09T10:17:36Z

This PR is based on node_scenarios PR: #10 as I am reusing the AWS node stop and start functions.

rht-perf-ci · 2020-07-09T10:51:02Z

Can one of the admins verify this patch?

paigerube14 · 2020-08-24T15:32:56Z

kraken/invoke/command.py

+    try:
+        subprocess.run(command, shell=True, universal_newlines=True, timeout=45)
    except Exception:
-        logging.error("Failed to run %s" % (command))


I think you should keep a logging statement here and not just pass

This function was added just for the stop_kubelet node scenario, when this command is executed once, the kubelet stops even if the command exceeds the timeout (here set to 45s) and exception indicating 'failed to run command' wouldn't be meaningful in that case.

paigerube14 · 2020-08-24T15:43:01Z

kraken/kubernetes/client.py

    try:
-        ret = cli.list_node(pretty=True)
+        if label_selector:
+            ret = cli.list_node(pretty=True, label_selector=label_selector)


I do not see anywhere where you call list_nodes and pass it a label_selector. Is that correct?

Yeah, previously in line, list_nodes was used instead of list_killable_node.

paigerube14 · 2020-08-24T15:56:19Z

I have just a few little comments but I was able to run this cluster shut down the other day and it worked perfectly!

paigerube14 · 2020-08-24T15:57:56Z

run_kraken.py

+        for node in nodes:
+            cloud_object.stop_instances(node_id[node])
+        logging.info("Waiting for 250s to shut down all the nodes")
+        time.sleep(250)


Are we able to have the user set this in their config or use 250 as a default? Is there a specific reason we chose 250 seconds here?

There's no specific reason, it took around 2 mins on 10 node cluster. So I chose 250 seconds to accommodate clusters of bigger sizes. start_instance function can be called on a node only when it is in stopped state else it throws error. However I have added a try except condition such that we sleep for 10 additional seconds when a node isn't in stopped state even after 250 seconds.

paigerube14 · 2020-08-24T15:59:04Z

run_kraken.py

+            stopped_nodes = nodes - restarted_nodes
+        logging.info("Waiting for 250s to allow cluster component initilization")
+        time.sleep(250)
+        logging.info("Successfully injected cluster_shut_down scenario!")


Are we able to add in a verification that the nodes are all back up and ready here? Is that too much for kraken that it can just be handled in cerberus? Thoughts?

I think this part would be handled by cerberus. With cerberus intergration, when we receive a true, kraken proceeds with next scenario indicating all the nodes are ready but with false, we terminate kraken indicating some components aren't healthy. But it can be explicitly specified after line when cerberus integration is enabled if needed. Thoughts?

chaitanyaenr · 2021-01-24T02:05:40Z

@yashashreesuresh Can we rebase the code please?

This commit adds a scenario to shut down all the nodes including the masters and restarts them after a specified duration.

yashashreesuresh · 2021-02-01T18:34:06Z

@yashashreesuresh Can we rebase the code please?

@chaitanyaenr I have rebased the code. But I was not able to test completely as I don't have a cluster.
The code needs to be tested.

After this PR gets merged, cluster_shut_down can be moved to a separate file where cluster_shut_down scenario for other cloud types can be added as well.

chaitanyaenr · 2021-02-17T17:51:32Z

@yashashreesuresh Thanks for rebasing the code. No worries about the cluster, will test the scenario and let you know how it goes.

paigerube14 · 2021-02-18T14:22:34Z

README.md

 Key Members(slack_usernames): paigerube14, rook, mffiedler, mohit, dry923, rsevilla, ravi
 * [**#sig-scalability on Kubernetes Slack**](https://kubernetes.slack.com)
-* [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com)
+* [**#forum-perfscale on CoreOS Slack**](https://coreos.slack.com)


is this an open slack channel that anyone can get on?

paigerube14 · 2021-02-18T14:59:47Z

config/config.yaml

        -   litmus_scenarios:                              # List of litmus scenarios to load
            - - https://hub.litmuschaos.io/api/chaos/1.10.0?file=charts/generic/node-cpu-hog/rbac.yaml
              - scenarios/node_hog_engine.yaml
+        -   cluster_shut_down_scenario:


The scenario type in run_kraken is looking for scenario type that ends with an "s". Just need to add an s here to be: cluster_shut_down_scenarios

paigerube14 · 2021-02-18T15:07:13Z

run_kraken.py

+    runs = shut_down_config["runs"]
+    shut_down_duration = shut_down_config["shut_down_duration"]
+    cloud_type = shut_down_config["cloud_type"]
+    if cloud_type == "aws":


Can we add in the other cloud types that have been added?
In addition, could we add in an else case for if the cloud type is not supported and to not run/exit with an error message

paigerube14 · 2021-06-23T18:01:40Z

Based most of my coding on this PR and got it merged for AWS, Azure, Openstack and GCP
#102

yashashreesuresh force-pushed the cluster_shut_down branch from 3888ee3 to 513be85 Compare July 10, 2020 08:35

chaitanyaenr added the Needs Rebase label Aug 21, 2020

paigerube14 reviewed Aug 24, 2020

View reviewed changes

yashashreesuresh force-pushed the cluster_shut_down branch 2 times, most recently from fb242b9 to a7cb346 Compare August 28, 2020 09:14

This was referenced Aug 31, 2020

Updates ansible playbook and vars #29

Closed

Updating Kraken workload cloud-bulldozer/scale-ci-pipeline#102

Open

chaitanyaenr removed the Needs Rebase label Aug 31, 2020

chaitanyaenr added the Needs Rebase label Jan 24, 2021

yashashreesuresh force-pushed the cluster_shut_down branch from a7cb346 to 2dc9aa3 Compare February 1, 2021 18:22

Added cluster shut down scenario

4e4ba5a

This commit adds a scenario to shut down all the nodes including the masters and restarts them after a specified duration.

yashashreesuresh force-pushed the cluster_shut_down branch from 2dc9aa3 to 4e4ba5a Compare February 1, 2021 18:28

yashashreesuresh mentioned this pull request Feb 2, 2021

Cluster shut down scenarios in other cloud types #63

Closed

paigerube14 reviewed Feb 18, 2021

View reviewed changes

paigerube14 mentioned this pull request Jun 15, 2021

Adding shut down scenario for gcp, az, aws, openstack #102

Merged

paigerube14 closed this Jun 23, 2021

Conversation

yashashreesuresh commented Jul 9, 2020

Uh oh!

yashashreesuresh commented Jul 9, 2020

Uh oh!

rht-perf-ci commented Jul 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paigerube14 Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paigerube14 commented Aug 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaitanyaenr commented Jan 24, 2021

Uh oh!

yashashreesuresh commented Feb 1, 2021

Uh oh!

chaitanyaenr commented Feb 17, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paigerube14 Feb 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paigerube14 commented Jun 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

paigerube14 Aug 24, 2020 •

edited

Loading

paigerube14 Feb 18, 2021 •

edited

Loading