Skip to content

Conversation

@fapifta
Copy link
Contributor

@fapifta fapifta commented Mar 11, 2023

What changes were proposed in this pull request?

The intent of the renewLock in the DefaultCertificateClient is to ensure no more than one thread changes the on disk certificate material at a given time. The problem is that the lock is belonging to an instance of the CertificateClient implementation, and with that there might be multiple instances from the lock.
As we do not use any of the properties of a ReentrantLock, I simply think we should synchronize the renewal and the on disk operations on the class.
In the meantime, I switched to shutdownNow instead of shutDown in the executor, after HDDS-8134 we do not have multiple executors running at the same time, but the service after close should not schedule any new runs but should finish the current run. This is not a foolproof implementation, but a good enough one for now. We are working on this code and there are further changes to how it is structured, so I think we should wait for the final form to get to a good solution, where only one executor is present per certificate material.

An other warning is also fixed by this patch, as notificationReceivers are also guarded by a synchronized block locking on the Set itself, but the instance is not final. This is not really a problem, as we do not tackle with the instance anymore, but seemed to be something that we can fix together.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8135

How was this patch tested?

Existing tests should suffice.

@kerneltime
Copy link
Contributor

@Galsza can you please take a look.

@kerneltime kerneltime requested a review from ChenSammi March 13, 2023 16:34
@kerneltime
Copy link
Contributor

cc @SaketaChalamchala can you take a look as well.

@kerneltime
Copy link
Contributor

cc @DaveTeng0 can you take a look please.

@kerneltime kerneltime requested a review from neils-dev March 13, 2023 16:35
@fapifta fapifta marked this pull request as draft March 13, 2023 20:07
@fapifta
Copy link
Contributor Author

fapifta commented Mar 13, 2023

Converted back the PR to draft, as I realized that the problem with the tests are not just simple glitches.
In the failing tests, there are still multiple executors at the same time, and locking on the static context of the class combined with an "infinite" (Integer.MAX_VALUE) number of retries in one thread in the test causes the test to timeout whatever it takes.

I will explore some other solutions, and we will discuss with @ChenSammi the original author on how we want to solve this, but the solution itself as it seems will require a deeper thought and a broader change.
I don't like the simple idea of making the executor static as well, and then guard that way for the first sight, but I am open to discuss that as an option as well.

We don't need to rush at the moment, as in a production system, there should not be competing threads as far as I can tell.

@ChenSammi
Copy link
Contributor

Looks like this situation is caused by not closed certificate clients. When all unnecessary clients are closed in HDDS-8134, I think then there will be no such problem. @fapifta

@fapifta fapifta marked this pull request as ready for review April 3, 2023 14:37
@fapifta
Copy link
Contributor Author

fapifta commented Apr 3, 2023

Yes @ChenSammi you were right, after solving the closure of clients properly, the problem with this patch gone away. And as if it is ensured that only one such client exists, then the problem should not come back.

@ChenSammi
Copy link
Contributor

The patch LGTM +1. Thanks @fapifta

@ChenSammi ChenSammi merged commit e22a8f6 into apache:master Apr 4, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Apr 6, 2023
* master: (155 commits)
  update readme (apache#4535)
  HDDS-8374. Disable flaky unit test: TestContainerStateCounts
  HDDS-8016. updated the ozone doc for linked bucket and deletion async limitation (apache#4526)
  HDDS-8237. [Snapshot] loadDb() used by SstFiltering service creates extraneous directories. (apache#4446)
  HDDS-8035. Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics (apache#4362)
  HDDS-8039. Allow container inspector to run from ozone debug. (apache#4337)
  HDDS-8304. [Snapshot] Reduce flakiness in testSkipTrackingWithZeroSnapshot (apache#4487)
  HDDS-7974. [Snapshot] KeyDeletingService to be aware of Ozone snapshots (apache#4486)
  HDDS-8368. ReplicationManager: Create ContainerReplicaOp with correct target Datanode (apache#4532)
  HDDS-8358. Fix the space usage comparator in ContainerBalancerSelectionCriteria (apache#4527)
  HDDS-8359. ReplicationManager: Fix getContainerReplicationHealth() so that it builds ContainerCheckRequest correctly (apache#4528)
  HDDS-8361. Useless object in TestOzoneBlockTokenIdentifier (apache#4517)
  HDDS-8325. Consolidate and refine RocksDB metrics of services (apache#4506)
  HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. (apache#4381)
  HDDS-8127. Exclude deleted containers from Recon container count (apache#4440)
  HDDS-8364. ReadReplicas may give wrong results with topology-aware read enabled (apache#4522)
  HDDS-8354. Avoid WARNING about ObjectEndpoint#get (apache#4515)
  HDDS-8324. DN data cache gets removed randomly asking for data from disk (apache#4499)
  HDDS-8291. Upgrade to Hadoop 3.3.5 (apache#4484)
  HDDS-8355. Mark TestOMRatisSnapshots#testInstallSnapshot as flaky
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants