Skip to content

Conversation

@adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

Try to avoid intermittent timeout in HA secure tests by improving wait_for_safemode_exit:

  • wait for SCM (and KDC in secure cluster) to be up, then
  • use safemode wait instead of repeated safemode status

This reduces the number of SCM client connections and KDC requests, and accounts for the time until SCM is started.

https://issues.apache.org/jira/browse/HDDS-8985

How was this patch tested?

Tested cluster startup locally in a few environments.

Full CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5484345000

@adoroszlai adoroszlai self-assigned this Jul 7, 2023
@adoroszlai adoroszlai added the test label Jul 7, 2023
@adoroszlai adoroszlai requested a review from ChenSammi July 7, 2023 12:27
@Galsza
Copy link
Contributor

Galsza commented Jul 7, 2023

Thank you @adoroszlai this should help with the secure clusters LGTM+1

Copy link
Contributor

@ChenSammi ChenSammi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good. Thanks @adoroszlai .

@ChenSammi ChenSammi merged commit 7e9d3c0 into apache:master Jul 9, 2023
@adoroszlai adoroszlai deleted the HDDS-8985 branch July 10, 2023 07:41
@adoroszlai
Copy link
Contributor Author

Thanks @ChenSammi, @Galsza for the review.

errose28 added a commit to errose28/ozone that referenced this pull request Jul 10, 2023
* master: (36 commits)
  HDDS-8990. Intermittent timeout waiting on datanode4 9856 to become available (apache#5039)
  Revert "HDDS-7750. Incorrect WRITE ACL check. (apache#4992)"
  HDDS-7750. Incorrect WRITE ACL check. (apache#4992)
  HDDS-8985. Intermittent timeout exiting safe mode in HA secure tests (apache#5033)
  HDDS-8593. Add RootCARotationPoller to CertClient (apache#5030)
  HDDS-7645. Kubernetes check should fail fast if cluster cannot start (apache#5028)
  HDDS-8981. TestRootedOzoneFileSystem runs out of disk space (apache#5029)
  HDDS-8592. Fetch and save all root certificates during service's certificate rotation. (apache#5025)
  HDDS-8981. Disable TestRootedOzoneFileSystem#testSafeMode
  HDDS-8591. Create scheduler to check for new root ca certificates (apache#4961)
  HDDS-8979. error validating kustomization.yaml (apache#5024)
  HDDS-8973. Ozone SCM HA should not allocates duplicate IDs when transferring leadership (apache#5018)
  HDDS-8970. Snapshot Diff should return path relative to bucket root (apache#5015)
  HDDS-8975. Clarify SCM HA auto-bootstrap doc (apache#5021)
  HDDS-8689. Rotate Root CA and Sub CA in SCM. (apache#4943)
  HDDS-8436. Support setSafeMode(), isFileClosed() FileSystem API (apache#4825)
  HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots (apache#5022)
  HDDS-8962. Ensure docker env is stopped (apache#5011)
  HDDS-7794. [snapshot] SnapshotDiff should throw better error messages for exception handling (apache#5007)
  HDDS-7922. [FSO] S3G folder support fso layout filestatus s3A compatibility (apache#4448)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants