HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. #4381

fapifta · 2023-03-11T22:10:28Z

What changes were proposed in this pull request?

The intent of the renewLock in the DefaultCertificateClient is to ensure no more than one thread changes the on disk certificate material at a given time. The problem is that the lock is belonging to an instance of the CertificateClient implementation, and with that there might be multiple instances from the lock.
As we do not use any of the properties of a ReentrantLock, I simply think we should synchronize the renewal and the on disk operations on the class.
In the meantime, I switched to shutdownNow instead of shutDown in the executor, after HDDS-8134 we do not have multiple executors running at the same time, but the service after close should not schedule any new runs but should finish the current run. This is not a foolproof implementation, but a good enough one for now. We are working on this code and there are further changes to how it is structured, so I think we should wait for the final form to get to a good solution, where only one executor is present per certificate material.

An other warning is also fixed by this patch, as notificationReceivers are also guarded by a synchronized block locking on the Set itself, but the instance is not final. This is not really a problem, as we do not tackle with the instance anymore, but seemed to be something that we can fix together.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8135

How was this patch tested?

Existing tests should suffice.

kerneltime · 2023-03-13T16:32:46Z

@Galsza can you please take a look.

kerneltime · 2023-03-13T16:34:29Z

cc @SaketaChalamchala can you take a look as well.

kerneltime · 2023-03-13T16:35:14Z

cc @DaveTeng0 can you take a look please.

fapifta · 2023-03-13T20:13:27Z

Converted back the PR to draft, as I realized that the problem with the tests are not just simple glitches.
In the failing tests, there are still multiple executors at the same time, and locking on the static context of the class combined with an "infinite" (Integer.MAX_VALUE) number of retries in one thread in the test causes the test to timeout whatever it takes.

I will explore some other solutions, and we will discuss with @ChenSammi the original author on how we want to solve this, but the solution itself as it seems will require a deeper thought and a broader change.
I don't like the simple idea of making the executor static as well, and then guard that way for the first sight, but I am open to discuss that as an option as well.

We don't need to rush at the moment, as in a production system, there should not be competing threads as far as I can tell.

ChenSammi · 2023-03-14T03:00:37Z

Looks like this situation is caused by not closed certificate clients. When all unnecessary clients are closed in HDDS-8134, I think then there will be no such problem. @fapifta

…faultCertificateClient.

fapifta · 2023-04-03T20:02:45Z

Yes @ChenSammi you were right, after solving the closure of clients properly, the problem with this patch gone away. And as if it is ensured that only one such client exists, then the problem should not come back.

ChenSammi · 2023-04-04T07:11:03Z

The patch LGTM +1. Thanks @fapifta

* master: (155 commits) update readme (apache#4535) HDDS-8374. Disable flaky unit test: TestContainerStateCounts HDDS-8016. updated the ozone doc for linked bucket and deletion async limitation (apache#4526) HDDS-8237. [Snapshot] loadDb() used by SstFiltering service creates extraneous directories. (apache#4446) HDDS-8035. Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics (apache#4362) HDDS-8039. Allow container inspector to run from ozone debug. (apache#4337) HDDS-8304. [Snapshot] Reduce flakiness in testSkipTrackingWithZeroSnapshot (apache#4487) HDDS-7974. [Snapshot] KeyDeletingService to be aware of Ozone snapshots (apache#4486) HDDS-8368. ReplicationManager: Create ContainerReplicaOp with correct target Datanode (apache#4532) HDDS-8358. Fix the space usage comparator in ContainerBalancerSelectionCriteria (apache#4527) HDDS-8359. ReplicationManager: Fix getContainerReplicationHealth() so that it builds ContainerCheckRequest correctly (apache#4528) HDDS-8361. Useless object in TestOzoneBlockTokenIdentifier (apache#4517) HDDS-8325. Consolidate and refine RocksDB metrics of services (apache#4506) HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. (apache#4381) HDDS-8127. Exclude deleted containers from Recon container count (apache#4440) HDDS-8364. ReadReplicas may give wrong results with topology-aware read enabled (apache#4522) HDDS-8354. Avoid WARNING about ObjectEndpoint#get (apache#4515) HDDS-8324. DN data cache gets removed randomly asking for data from disk (apache#4499) HDDS-8291. Upgrade to Hadoop 3.3.5 (apache#4484) HDDS-8355. Mark TestOMRatisSnapshots#testInstallSnapshot as flaky ...

kerneltime requested a review from ChenSammi March 13, 2023 16:34

kerneltime requested a review from neils-dev March 13, 2023 16:35

fapifta marked this pull request as draft March 13, 2023 20:07

HDDS-8135. Incorrect synchronization during certificate renewal in De…

65d3665

…faultCertificateClient.

fapifta force-pushed the HDDS-8135 branch from b35a9d2 to 65d3665 Compare April 3, 2023 14:23

fapifta marked this pull request as ready for review April 3, 2023 14:37

ChenSammi merged commit e22a8f6 into apache:master Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. #4381

HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. #4381

Uh oh!

fapifta commented Mar 11, 2023

Uh oh!

kerneltime commented Mar 13, 2023

Uh oh!

kerneltime commented Mar 13, 2023

Uh oh!

kerneltime commented Mar 13, 2023

Uh oh!

fapifta commented Mar 13, 2023

Uh oh!

ChenSammi commented Mar 14, 2023

Uh oh!

fapifta commented Apr 3, 2023

Uh oh!

ChenSammi commented Apr 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. #4381

HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. #4381

Uh oh!

Conversation

fapifta commented Mar 11, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

kerneltime commented Mar 13, 2023

Uh oh!

kerneltime commented Mar 13, 2023

Uh oh!

kerneltime commented Mar 13, 2023

Uh oh!

fapifta commented Mar 13, 2023

Uh oh!

ChenSammi commented Mar 14, 2023

Uh oh!

fapifta commented Apr 3, 2023

Uh oh!

ChenSammi commented Apr 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants