HDDS-7645. Kubernetes check should fail fast if cluster cannot start #5028

adoroszlai · 2023-07-06T07:47:21Z

What changes were proposed in this pull request?

Kubernetes check currently proceeds to execute tests even if cluster is not able to start up. It should exit without trying to run the tests.

**** Waiting until the k8s cluster is running ****

...
4 pods are running out from the 5
100 'all_pods_are_running' is failed...

**** Executing robot tests scm-0 ****

Defaulted container "scm" out of: scm, init (init)
Unable to use a TTY - input is not a terminal or the right kind of file
error: unable to upgrade connection: container not found ("scm")

This change skips the tests if cluster fails to start up.

Also:

Fix -1 pods are running message (due to hard-coded subtraction intended to account for the header row of kubectl get pod's output)
Allow custom number of retry attempts (for easier testing)
Reduce code duplication in test scripts

https://issues.apache.org/jira/browse/HDDS-7645

How was this patch tested?

Triggered cluster startup "error" by setting low number of retry attempts. Verified tests are not attempted, logs are collected, cluster is shut down:

$ RETRY_ATTEMPTS=5 OZONE_TEST_SELECTOR=getting-started ./hadoop-ozone/dev-support/checks/kubernetes.sh
...

**** Applying k8s resources from getting-started ****

...

**** Waiting until the k8s cluster is running ****

No resources found in default namespace.
0 pods are running. Waiting for more.
1 'all_pods_are_running' is failed...
3 pods are running out from the 5
2 'all_pods_are_running' is failed...
5 pods are running out from the 6
3 'all_pods_are_running' is failed...
1 'grep_log scm-0 SCM exiting safe mode.' is failed...
2 'grep_log scm-0 SCM exiting safe mode.' is failed...
3 'grep_log scm-0 SCM exiting safe mode.' is failed...
4 'grep_log scm-0 SCM exiting safe mode.' is failed...
5 'grep_log scm-0 SCM exiting safe mode.' is failed...

**** Collecting container logs ****


**** Deleting k8s resources ****

configmap "config" deleted
service "datanode" deleted
service "datanode-public" deleted
service "om" deleted
service "om-public" deleted
service "s3g" deleted
service "s3g-public" deleted
service "scm" deleted
service "scm-public" deleted
statefulset.apps "datanode" deleted
statefulset.apps "om" deleted
statefulset.apps "s3g" deleted
statefulset.apps "scm" deleted

$ ls -1 hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT/kubernetes/examples/getting-started/logs
pod-datanode-0.log
pod-datanode-1.log
pod-datanode-2.log
pod-om-0.log
pod-s3g-0.log
pod-scm-0-init.log
pod-scm-0.log

With even fewer retries:

$ RETRY_ATTEMPTS=2 OZONE_TEST_SELECTOR=getting-started ./hadoop-ozone/dev-support/checks/kubernetes.sh  
...

**** Waiting until the k8s cluster is running ****

No resources found in default namespace.
0 pods are running. Waiting for more.
1 'all_pods_are_running' is failed...
3 pods are running out from the 5
2 'all_pods_are_running' is failed...

**** Collecting container logs ****

...

Regular CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5472252928

GeorgeJahad · 2023-07-06T20:38:59Z

hadoop-ozone/dist/src/main/k8s/examples/testlib.sh

-   elif [ "$RUNNING_COUNT" -ne "$ALL_COUNT" ]; then
-      echo "$RUNNING_COUNT pods are running out from the $ALL_COUNT"
+   elif [ "$running" -ne "$all" ]; then
+      echo "$running pods are running out from the $all"


Suggested change

echo "$running pods are running out from the $all"

echo "$running pods are running out of $all"

Good point, this has been bothering me for a while. How about:

echo "$running / $all pods are running"

?

yes, that is good too. I just found the current message confusing and had to read it a couple of times to follow it.

GeorgeJahad

lgtm

adoroszlai · 2023-07-07T08:27:07Z

Thanks @GeorgeJahad for the review.

* master: (36 commits) HDDS-8990. Intermittent timeout waiting on datanode4 9856 to become available (apache#5039) Revert "HDDS-7750. Incorrect WRITE ACL check. (apache#4992)" HDDS-7750. Incorrect WRITE ACL check. (apache#4992) HDDS-8985. Intermittent timeout exiting safe mode in HA secure tests (apache#5033) HDDS-8593. Add RootCARotationPoller to CertClient (apache#5030) HDDS-7645. Kubernetes check should fail fast if cluster cannot start (apache#5028) HDDS-8981. TestRootedOzoneFileSystem runs out of disk space (apache#5029) HDDS-8592. Fetch and save all root certificates during service's certificate rotation. (apache#5025) HDDS-8981. Disable TestRootedOzoneFileSystem#testSafeMode HDDS-8591. Create scheduler to check for new root ca certificates (apache#4961) HDDS-8979. error validating kustomization.yaml (apache#5024) HDDS-8973. Ozone SCM HA should not allocates duplicate IDs when transferring leadership (apache#5018) HDDS-8970. Snapshot Diff should return path relative to bucket root (apache#5015) HDDS-8975. Clarify SCM HA auto-bootstrap doc (apache#5021) HDDS-8689. Rotate Root CA and Sub CA in SCM. (apache#4943) HDDS-8436. Support setSafeMode(), isFileClosed() FileSystem API (apache#4825) HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots (apache#5022) HDDS-8962. Ensure docker env is stopped (apache#5011) HDDS-7794. [snapshot] SnapshotDiff should throw better error messages for exception handling (apache#5007) HDDS-7922. [FSO] S3G folder support fso layout filestatus s3A compatibility (apache#4448) ...

adoroszlai added 3 commits July 6, 2023 08:23

Fix for -1 pods are running

e7222a7

Allow custom number of retry attempts

5a558ad

HDDS-7645. Kubernetes check should fail fast if cluster cannot start

eff6de9

adoroszlai self-assigned this Jul 6, 2023

adoroszlai requested a review from GeorgeJahad July 6, 2023 09:07

GeorgeJahad reviewed Jul 6, 2023

View reviewed changes

GeorgeJahad approved these changes Jul 6, 2023

View reviewed changes

adoroszlai added 2 commits July 6, 2023 23:03

improve X pods are running message

5af41c6

Merge remote-tracking branch 'origin/master' into HDDS-7645

60fe693

adoroszlai merged commit 41d9e03 into apache:master Jul 7, 2023

adoroszlai deleted the HDDS-7645 branch July 7, 2023 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-7645. Kubernetes check should fail fast if cluster cannot start #5028

HDDS-7645. Kubernetes check should fail fast if cluster cannot start #5028

Uh oh!

adoroszlai commented Jul 6, 2023

Uh oh!

GeorgeJahad Jul 6, 2023

Uh oh!

adoroszlai Jul 6, 2023 •

edited

Loading

Uh oh!

GeorgeJahad Jul 6, 2023 •

edited

Loading

Uh oh!

GeorgeJahad left a comment

Uh oh!

adoroszlai commented Jul 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	echo "$running pods are running out from the $all"
	echo "$running pods are running out of $all"

HDDS-7645. Kubernetes check should fail fast if cluster cannot start #5028

HDDS-7645. Kubernetes check should fail fast if cluster cannot start #5028

Uh oh!

Conversation

adoroszlai commented Jul 6, 2023

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

GeorgeJahad Jul 6, 2023

Choose a reason for hiding this comment

Uh oh!

adoroszlai Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jul 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adoroszlai Jul 6, 2023 •

edited

Loading

GeorgeJahad Jul 6, 2023 •

edited

Loading