Skip to content

Conversation

@adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

Kubernetes check currently proceeds to execute tests even if cluster is not able to start up. It should exit without trying to run the tests.

**** Waiting until the k8s cluster is running ****

...
4 pods are running out from the 5
100 'all_pods_are_running' is failed...

**** Executing robot tests scm-0 ****

Defaulted container "scm" out of: scm, init (init)
Unable to use a TTY - input is not a terminal or the right kind of file
error: unable to upgrade connection: container not found ("scm")

This change skips the tests if cluster fails to start up.

Also:

  • Fix -1 pods are running message (due to hard-coded subtraction intended to account for the header row of kubectl get pod's output)
  • Allow custom number of retry attempts (for easier testing)
  • Reduce code duplication in test scripts

https://issues.apache.org/jira/browse/HDDS-7645

How was this patch tested?

Triggered cluster startup "error" by setting low number of retry attempts. Verified tests are not attempted, logs are collected, cluster is shut down:

$ RETRY_ATTEMPTS=5 OZONE_TEST_SELECTOR=getting-started ./hadoop-ozone/dev-support/checks/kubernetes.sh
...

**** Applying k8s resources from getting-started ****

...

**** Waiting until the k8s cluster is running ****

No resources found in default namespace.
0 pods are running. Waiting for more.
1 'all_pods_are_running' is failed...
3 pods are running out from the 5
2 'all_pods_are_running' is failed...
5 pods are running out from the 6
3 'all_pods_are_running' is failed...
1 'grep_log scm-0 SCM exiting safe mode.' is failed...
2 'grep_log scm-0 SCM exiting safe mode.' is failed...
3 'grep_log scm-0 SCM exiting safe mode.' is failed...
4 'grep_log scm-0 SCM exiting safe mode.' is failed...
5 'grep_log scm-0 SCM exiting safe mode.' is failed...

**** Collecting container logs ****


**** Deleting k8s resources ****

configmap "config" deleted
service "datanode" deleted
service "datanode-public" deleted
service "om" deleted
service "om-public" deleted
service "s3g" deleted
service "s3g-public" deleted
service "scm" deleted
service "scm-public" deleted
statefulset.apps "datanode" deleted
statefulset.apps "om" deleted
statefulset.apps "s3g" deleted
statefulset.apps "scm" deleted

$ ls -1 hadoop-ozone/dist/target/ozone-1.4.0-SNAPSHOT/kubernetes/examples/getting-started/logs
pod-datanode-0.log
pod-datanode-1.log
pod-datanode-2.log
pod-om-0.log
pod-s3g-0.log
pod-scm-0-init.log
pod-scm-0.log

With even fewer retries:

$ RETRY_ATTEMPTS=2 OZONE_TEST_SELECTOR=getting-started ./hadoop-ozone/dev-support/checks/kubernetes.sh  
...

**** Waiting until the k8s cluster is running ****

No resources found in default namespace.
0 pods are running. Waiting for more.
1 'all_pods_are_running' is failed...
3 pods are running out from the 5
2 'all_pods_are_running' is failed...

**** Collecting container logs ****

...

Regular CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5472252928

@adoroszlai adoroszlai self-assigned this Jul 6, 2023
@adoroszlai adoroszlai requested a review from GeorgeJahad July 6, 2023 09:07
elif [ "$RUNNING_COUNT" -ne "$ALL_COUNT" ]; then
echo "$RUNNING_COUNT pods are running out from the $ALL_COUNT"
elif [ "$running" -ne "$all" ]; then
echo "$running pods are running out from the $all"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "$running pods are running out from the $all"
echo "$running pods are running out of $all"

Copy link
Contributor Author

@adoroszlai adoroszlai Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this has been bothering me for a while. How about:

echo "$running / $all pods are running"

?

Copy link
Contributor

@GeorgeJahad GeorgeJahad Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is good too. I just found the current message confusing and had to read it a couple of times to follow it.

Copy link
Contributor

@GeorgeJahad GeorgeJahad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@adoroszlai adoroszlai merged commit 41d9e03 into apache:master Jul 7, 2023
@adoroszlai adoroszlai deleted the HDDS-7645 branch July 7, 2023 08:26
@adoroszlai
Copy link
Contributor Author

Thanks @GeorgeJahad for the review.

errose28 added a commit to errose28/ozone that referenced this pull request Jul 10, 2023
* master: (36 commits)
  HDDS-8990. Intermittent timeout waiting on datanode4 9856 to become available (apache#5039)
  Revert "HDDS-7750. Incorrect WRITE ACL check. (apache#4992)"
  HDDS-7750. Incorrect WRITE ACL check. (apache#4992)
  HDDS-8985. Intermittent timeout exiting safe mode in HA secure tests (apache#5033)
  HDDS-8593. Add RootCARotationPoller to CertClient (apache#5030)
  HDDS-7645. Kubernetes check should fail fast if cluster cannot start (apache#5028)
  HDDS-8981. TestRootedOzoneFileSystem runs out of disk space (apache#5029)
  HDDS-8592. Fetch and save all root certificates during service's certificate rotation. (apache#5025)
  HDDS-8981. Disable TestRootedOzoneFileSystem#testSafeMode
  HDDS-8591. Create scheduler to check for new root ca certificates (apache#4961)
  HDDS-8979. error validating kustomization.yaml (apache#5024)
  HDDS-8973. Ozone SCM HA should not allocates duplicate IDs when transferring leadership (apache#5018)
  HDDS-8970. Snapshot Diff should return path relative to bucket root (apache#5015)
  HDDS-8975. Clarify SCM HA auto-bootstrap doc (apache#5021)
  HDDS-8689. Rotate Root CA and Sub CA in SCM. (apache#4943)
  HDDS-8436. Support setSafeMode(), isFileClosed() FileSystem API (apache#4825)
  HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots (apache#5022)
  HDDS-8962. Ensure docker env is stopped (apache#5011)
  HDDS-7794. [snapshot] SnapshotDiff should throw better error messages for exception handling (apache#5007)
  HDDS-7922. [FSO] S3G folder support fso layout filestatus s3A compatibility (apache#4448)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants