Skip to content

Conversation

@GlenGeng-awx
Copy link
Contributor

What changes were proposed in this pull request?

Use SCMHAManagerImpl and SCMRatisServerImpl instead of MockSCMHAManager and MockRatisServer in MiniOzoneCluster.
The MockRatisServer and MockSCMHAManager can not fully test the HA code.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4622

How was this patch tested?

CI

@amaliujia
Copy link
Contributor

O seems like we are more or less doing the similar change :) #1733

if (Strings.isNullOrEmpty(storageDir)) {
storageDir = ServerUtils.getDefaultRatisDirectory(conf);
File metaDirPath = ServerUtils.getOzoneMetaDirPath(conf);
storageDir = (new File(metaDirPath, "scm-ha")).getPath();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's prefered to use SCMHA Configuration to set the storageDir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #1739 this can be set through HA config.

@GlenGeng-awx
Copy link
Contributor Author

O seems like we are more or less doing the similar change :) #1733

We can fix this issue together, and you can take away my changes if needed.

public static StorageContainerManager getScmSimple(OzoneConfiguration conf)
throws IOException, AuthenticationException {
SCMConfigurator configurator = new SCMConfigurator();
configurator.setSCMHAManager(MockSCMHAManager.getInstance(true));
Copy link
Contributor

@amaliujia amaliujia Dec 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need this line:
conf.setBoolean(ScmConfigKeys.OZONE_SCM_HA_ENABLE_KEY, true);

SCMHAManagerImpl will return isLeader immediately without enabling this HA:

    if (!SCMHAUtils.isSCMHAEnabled(conf)) {
      // When SCM HA is not enabled, the current SCM is always the leader.
      return Optional.of((long)0);
    }

Copy link
Contributor

@amaliujia amaliujia Dec 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently the cluster.waitForClusterToBeReady returns quickly even the SCM has not become leader yet.

I am guessing CI can still pass become you also have changed the rpc timeout parameters. So either the cluster wait for ready or the parameter change can solve the not leader problem.

I am not sure if we need both. Maybe we only need to keep cluster.waitForClusterToBeReady?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Rpc.setRequestTimeout(properties, TimeDuration.valueOf(
conf.getRatisRequestTimeout(), TimeUnit.MILLISECONDS));
Rpc.setTimeoutMin(properties, TimeDuration.valueOf(
conf.getRatisRequestMinTimeout(), TimeUnit.MILLISECONDS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to update these parameters. The cluster.waitForClusterToBeReady is supposed to wait for SCM become leader thus solve the not leader exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we only need the wait leader in cluster.waitForClusterToBeReady. The problem here is, we've set the min/max timeout un-properly. I will create a separate jira for his fix.

lock.readLock().lock();
try {
return stateManager
return stateManager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the dead lock solution seems is removing the read write lock? Is there an explanation why we can remove locks here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a jira for this issue, can give the explanation of this solution there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants