-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. #4381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@Galsza can you please take a look. |
|
cc @SaketaChalamchala can you take a look as well. |
|
cc @DaveTeng0 can you take a look please. |
|
Converted back the PR to draft, as I realized that the problem with the tests are not just simple glitches. I will explore some other solutions, and we will discuss with @ChenSammi the original author on how we want to solve this, but the solution itself as it seems will require a deeper thought and a broader change. We don't need to rush at the moment, as in a production system, there should not be competing threads as far as I can tell. |
…faultCertificateClient.
|
Yes @ChenSammi you were right, after solving the closure of clients properly, the problem with this patch gone away. And as if it is ensured that only one such client exists, then the problem should not come back. |
|
The patch LGTM +1. Thanks @fapifta |
* master: (155 commits) update readme (apache#4535) HDDS-8374. Disable flaky unit test: TestContainerStateCounts HDDS-8016. updated the ozone doc for linked bucket and deletion async limitation (apache#4526) HDDS-8237. [Snapshot] loadDb() used by SstFiltering service creates extraneous directories. (apache#4446) HDDS-8035. Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics (apache#4362) HDDS-8039. Allow container inspector to run from ozone debug. (apache#4337) HDDS-8304. [Snapshot] Reduce flakiness in testSkipTrackingWithZeroSnapshot (apache#4487) HDDS-7974. [Snapshot] KeyDeletingService to be aware of Ozone snapshots (apache#4486) HDDS-8368. ReplicationManager: Create ContainerReplicaOp with correct target Datanode (apache#4532) HDDS-8358. Fix the space usage comparator in ContainerBalancerSelectionCriteria (apache#4527) HDDS-8359. ReplicationManager: Fix getContainerReplicationHealth() so that it builds ContainerCheckRequest correctly (apache#4528) HDDS-8361. Useless object in TestOzoneBlockTokenIdentifier (apache#4517) HDDS-8325. Consolidate and refine RocksDB metrics of services (apache#4506) HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. (apache#4381) HDDS-8127. Exclude deleted containers from Recon container count (apache#4440) HDDS-8364. ReadReplicas may give wrong results with topology-aware read enabled (apache#4522) HDDS-8354. Avoid WARNING about ObjectEndpoint#get (apache#4515) HDDS-8324. DN data cache gets removed randomly asking for data from disk (apache#4499) HDDS-8291. Upgrade to Hadoop 3.3.5 (apache#4484) HDDS-8355. Mark TestOMRatisSnapshots#testInstallSnapshot as flaky ...
What changes were proposed in this pull request?
The intent of the renewLock in the DefaultCertificateClient is to ensure no more than one thread changes the on disk certificate material at a given time. The problem is that the lock is belonging to an instance of the CertificateClient implementation, and with that there might be multiple instances from the lock.
As we do not use any of the properties of a ReentrantLock, I simply think we should synchronize the renewal and the on disk operations on the class.
In the meantime, I switched to shutdownNow instead of shutDown in the executor, after HDDS-8134 we do not have multiple executors running at the same time, but the service after close should not schedule any new runs but should finish the current run. This is not a foolproof implementation, but a good enough one for now. We are working on this code and there are further changes to how it is structured, so I think we should wait for the final form to get to a good solution, where only one executor is present per certificate material.
An other warning is also fixed by this patch, as notificationReceivers are also guarded by a synchronized block locking on the Set itself, but the instance is not final. This is not really a problem, as we do not tackle with the instance anymore, but seemed to be something that we can fix together.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8135
How was this patch tested?
Existing tests should suffice.