-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-7645. Kubernetes check should fail fast if cluster cannot start #5028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| elif [ "$RUNNING_COUNT" -ne "$ALL_COUNT" ]; then | ||
| echo "$RUNNING_COUNT pods are running out from the $ALL_COUNT" | ||
| elif [ "$running" -ne "$all" ]; then | ||
| echo "$running pods are running out from the $all" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| echo "$running pods are running out from the $all" | |
| echo "$running pods are running out of $all" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this has been bothering me for a while. How about:
echo "$running / $all pods are running"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is good too. I just found the current message confusing and had to read it a couple of times to follow it.
GeorgeJahad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
Thanks @GeorgeJahad for the review. |
* master: (36 commits) HDDS-8990. Intermittent timeout waiting on datanode4 9856 to become available (apache#5039) Revert "HDDS-7750. Incorrect WRITE ACL check. (apache#4992)" HDDS-7750. Incorrect WRITE ACL check. (apache#4992) HDDS-8985. Intermittent timeout exiting safe mode in HA secure tests (apache#5033) HDDS-8593. Add RootCARotationPoller to CertClient (apache#5030) HDDS-7645. Kubernetes check should fail fast if cluster cannot start (apache#5028) HDDS-8981. TestRootedOzoneFileSystem runs out of disk space (apache#5029) HDDS-8592. Fetch and save all root certificates during service's certificate rotation. (apache#5025) HDDS-8981. Disable TestRootedOzoneFileSystem#testSafeMode HDDS-8591. Create scheduler to check for new root ca certificates (apache#4961) HDDS-8979. error validating kustomization.yaml (apache#5024) HDDS-8973. Ozone SCM HA should not allocates duplicate IDs when transferring leadership (apache#5018) HDDS-8970. Snapshot Diff should return path relative to bucket root (apache#5015) HDDS-8975. Clarify SCM HA auto-bootstrap doc (apache#5021) HDDS-8689. Rotate Root CA and Sub CA in SCM. (apache#4943) HDDS-8436. Support setSafeMode(), isFileClosed() FileSystem API (apache#4825) HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots (apache#5022) HDDS-8962. Ensure docker env is stopped (apache#5011) HDDS-7794. [snapshot] SnapshotDiff should throw better error messages for exception handling (apache#5007) HDDS-7922. [FSO] S3G folder support fso layout filestatus s3A compatibility (apache#4448) ...
What changes were proposed in this pull request?
Kubernetes check currently proceeds to execute tests even if cluster is not able to start up. It should exit without trying to run the tests.
This change skips the tests if cluster fails to start up.
Also:
-1 pods are runningmessage (due to hard-coded subtraction intended to account for the header row ofkubectl get pod's output)https://issues.apache.org/jira/browse/HDDS-7645
How was this patch tested?
Triggered cluster startup "error" by setting low number of retry attempts. Verified tests are not attempted, logs are collected, cluster is shut down:
With even fewer retries:
Regular CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5472252928