Skip to content

Conversation

@ChenSammi
Copy link
Contributor

@kerneltime kerneltime requested a review from neils-dev December 12, 2022 17:12
@kerneltime
Copy link
Contributor

@Galsza can you please take a look?

@captainzmc captainzmc self-requested a review December 16, 2022 03:23
@captainzmc
Copy link
Member

Thanks for @ChenSammi's patch. I have locally verified that after the HA cluster is started, the volume creation, bucket creation and put key are all success.
The status of Recon may not be correct at present. After I started recon, I found that it has been trying to connect port 9860 (this is the port of SCM clinet of non-HA).
I think we'd better add a Recon-ha that starts with ozone-site-ha.xml.

@captainzmc
Copy link
Member

captainzmc commented Dec 16, 2022

If possible, we'd better add an Ozoneshell-ha as well. Currently, OzoneShell is also non-HA by default. Users need to manually change its ozone-site.xml to ozone-site-ha.xml if they want to operate the HA cluster

@neils-dev
Copy link
Contributor

Thanks @ChenSammi for adding SCM HA support to the intellij IDE dev environment. I've run the HA configuration in the following manner through the run configs:
i. PrimordialSCMInit-ha
ii. PrimordialSCM-ha
iii. OzoneManagerInit-ha
iv. OzoneManager-ha
for the scm 2 & 3
v. Scm(2or3)Bootstrap-ha
vi.) Scm(2or3)-ha

Then the datanodesX-ha run configs. The ozone HA cluster came up with the scm first in safe mode then exiting safe mode without issue. With the datanodes I noticed some exceptions (timeout) in the log as follows:

java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at org.apache.hadoop.ozone.container.common.states.datanode.RunningDatanodeState.computeNextContainerState(RunningDatanodeState.java:199)

In addition, on starting the DNs in the cluster I found the following warnings generated from the SCMs:
2022-12-16 16:31:20,543 WARN events.EventQueue (EventQueue.java:fireEvent(216)) - No event handler registered for event TypedEvent{payloadType=DatanodeDetails, name='Datanode_Command_Queue

The cluster in the dev environment successfully ran shell file system operations with both the OzoneSell and OzoneFsShell. Also ran freon load tests, okbg, without issue.

For the ScmRoles run from the run config , receiving an error message

Unmatched argument at index 1: 'scm'
Usage: ozone admin [-hV] [--verbose] [-conf=<configurationPath>] ...

Do I need to add something to the runConfiguration for it to work? Running it as is and getting the error.

@captainzmc
Copy link
Member

captainzmc commented Dec 18, 2022

i. PrimordialSCMInit-ha
ii. PrimordialSCM-ha
iii. OzoneManagerInit-ha
iv. OzoneManager-ha
for the scm 2 & 3
v. Scm(2or3)Bootstrap-ha
vi.) Scm(2or3)-ha

Hi @neils-dev, The correct boot sequence is:

  1. PrimordialSCMInit-ha
  2. PrimordialSCM-ha
  3. Scm2 Bootstrap-ha
  4. Scm2-ha
  5. Scm3 Bootstrap-ha
  6. Scm3-ha
  7. OzoneManagerInit-ha
  8. OzoneManager-ha
  9. Recon (Should change conf from ozone-site.xml to ozone-site-ha.xml)
  10. Datanode(1~3)-ha

Copy link
Member

@captainzmc captainzmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 The change looks good.

@captainzmc
Copy link
Member

Let's merge this, Thanks @ChenSammi for the patch and thanks @neils-dev for the check.

@captainzmc captainzmc merged commit 9d30c9a into apache:master Jan 6, 2023
@ChenSammi
Copy link
Contributor Author

Thanks @captainzmc and @neils-dev for the code review.

errose28 added a commit to errose28/ozone that referenced this pull request Jan 9, 2023
* master: (176 commits)
  HDDS-7726. EC: Enhance datanode reconstruction log message (apache#4155)
  HDDS-7739. EC: Increase the information in the RM sending command log message (apache#4153)
  HDDS-7652. Volume Quota not enforced during write when bucket quota is not set (apache#4124)
  HDDS-7628. Intermittent failure in TestOzoneContainerWithTLS (apache#4142)
  HDDS-7695. EC metrics related to replication commands don't add up (apache#4152)
  HDDS-7729. EC: ECContainerReplicaCount should handle pending delete of unhealthy replicas (apache#4146)
  HDDS-7738. SCM terminates when adding container to a closed pipeline (apache#4154)
  HDDS-7243. Remove RequestFeatureValidator from echoRPC method which supports only ValidationCondition.OLDER_CLIENT_REQUESTS (apache#4051)
  HDDS-7708. No check for certificate duration config scenarios. (apache#4149)
  HDDS-7727. EC: SCM unregistered event handler for DatanodeCommandCountUpdated (apache#4147)
  HDDS-7606. Add SCM HA support in intellij run (apache#4058)
  HDDS-7666. EC: Unrecoverable EC containers with some remaining replicas may block decommissioning (apache#4118)
  HDDS-7339. Implement Certificate renewal task for services (apache#3982)
  HDDS-7696. MisReplicationHandler does not consider QUASI_CLOSED replicas as sources (apache#4144)
  HDDS-7714. Docker cluster ozone-om-ha fails during docker-compose up (apache#4137)
  HDDS-7716. Log read requests rejected with permission denied in OM audit (apache#4136)
  HDDS-7588. Intermittent failure in TestObjectStoreWithLegacyFS#testFlatKeyStructureWithOBS (apache#4040)
  HDDS-7633. Compile error with Java 11: package com.sun.jmx.mbeanserver is not visible (apache#4077)
  HDDS-7648. Add a servername tag in UGI metrics. (apache#4094)
  HDDS-7564. Update Ozone version after 1.3.0 release (apache#4115)
  ...
@ChenSammi ChenSammi deleted the HDDS-7606 branch February 20, 2023 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants