-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager #4688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tCloseContainerWithDelayByLeaseManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sumitagrawl for working on this. Repeated run with the patch failed ~3%.
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4933903111
Sorry, wrong link. The correct one is this: https://github.com/adoroszlai/hadoop-ozone/actions/runs/4934872460
…tCloseContainerWithDelayByLeaseManager
…tCloseContainerWithDelayByLeaseManager
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumitagrawl , thanks a lot for working on this! Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13058203/4688_review.patch
| private final long defaultTimeout; | ||
| private final Object monitor = new Object(); | ||
| private Map<T, Lease<T>> activeLeases; | ||
| private BlockingQueue<T> leaseKeyBlockingQueue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use Semaphore in this case; see https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Semaphore.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szetszwo Thanks for review, I have done changes as per patch, but added extra release in shutdown() as possibility for concurrency issue. i.e.
- shutdown --> disable MonitorThread (set running to false)
- interrupt Monitor thread
For step 2, if MonitorThread waiting, then interrupt will work else ignored. So later moves to waiting for semaphore, then there is no way to exist thread.
So added extra "semaphore.release()" in shutdown to ensure exit from tryAcquire() wait in above case.
| checkStatus(); | ||
| LOG.debug("Shutting down LeaseManager service"); | ||
| leaseMonitor.disable(); | ||
| synchronized (monitor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shutdown() should be synchronized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szetszwo currently I have reverted this as shutdown is threadsafe. Adding synchronization needs protect all member variables as findbug, which is not required.
…tCloseContainerWithDelayByLeaseManager
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
|
Thanks @sumitagrawl for updating the patch, test passed in 900 runs. Please take a look at findbugs failure though: |
|
@sumitagrawl , please fix findbugs warnings. |
…tCloseContainerWithDelayByLeaseManager
|
I don't understand findbugs. How does making |
Is it the case that findbugs won't warn for existing bugs? If this PR adds |
I doubt that. It only sees the current state of the code. We've also seen findbugs being triggered in apparently unrelated files (#4506 (comment)), and the problem persisted even after the PR was merged, until fixed by another PR. I would guess this may be caused by findbugs working on compiled bytecode, not on sources. But I tend to agree with its finding here, we should synchronize consistently (see also #4578 (review)). |
|
Indeed, we should use a nonblocking solution and not synchronize at all. We probably need to redesign BTW, the following code does not make sense at all. Could a thread in dead state be started? Probably not. //LeaseManager.start()
leaseMonitorThread.setUncaughtExceptionHandler((thread, throwable) -> {
// Let us just restart this thread after logging an error.
// if this thread is not running we cannot handle Lease expiry.
LOG.error("LeaseMonitor thread encountered an error. Thread: {}",
thread.toString(), throwable);
leaseMonitorThread.start();
});I suggest to commit this PR, if it can fix the test failures, and completely rewrite |
Triggered, will post the results. |
|
Thanks @sumitagrawl for the patch, @szetszwo for the review. Test passed 100% in 1000 runs. |
|
@sumitagrawl , thanks for working on this! @adoroszlai , thanks a lot for testing it! |
* master: (78 commits) HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager (apache#4688) HDDS-7241. EC: Reconstruction could fail with orphan blocks. (apache#4718) HDDS-8577. [Snapshot] Disable compaction log when loading metadata for snapshot (apache#4697) HDDS-7080. EC: Offline reconstruction needs better logging (apache#4719) HDDS-8626. Config thread pool in ReplicationServer (apache#4715) HDDS-8616. Underreplication not fixed if all replicas start decommissioning (apache#4711) HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583) HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583) HDDS-8615. Explicitly show EC block type in 'ozone debug chunkinfo' command output (apache#4706) HDDS-8623. Delete duplicate getBucketInfo in OMKeyCommitRequest (apache#4712) HDDS-8339. Recon Show the number of keys marked for Deletion in Recon UI. (apache#4519) HDDS-8572. Support CodecBuffer for protobuf v3 codecs. (apache#4693) HDDS-8010. Improve DN warning message when getBlock does not find the block. (apache#4698) HDDS-8621. IOException is never thrown in SCMRatisServer.getRatisRoles(). (apache#4710) HDDS-8463. S3 key uniqueness in deletedTable (apache#4660) HDDS-8584. Hadoop client write slowly when stream enabled (apache#4703) HDDS-7732. EC: Verify block deletion from missing EC containers (apache#4705) HDDS-8581. Avoid random ports in integration tests (apache#4699) HDDS-8504. ReplicationManager: Pass used and excluded node separately for Under and Mis-Replication (apache#4694) HDDS-8576. Close RocksDB instance in RDBStore if RDBStore's initialization fails after RocksDB instance creation (apache#4692) ...
What changes were proposed in this pull request?
LeaseManager have concurrency issue that, when new event is added just before monitor thread go for wait, then notification will be missed and object remains in queue indefinitely till further new event comes and notify.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8575
How was this patch tested?
running testcase multiple time