HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager #4688

sumitagrawl · 2023-05-09T09:57:08Z

What changes were proposed in this pull request?

LeaseManager have concurrency issue that, when new event is added just before monitor thread go for wait, then notification will be missed and object remains in queue indefinitely till further new event comes and notify.

changed the logic using blocking queue for event notification.
test case change to validate lease acquire and wait with GenericTestUtils.waitFor

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8575

How was this patch tested?

running testcase multiple time

…tCloseContainerWithDelayByLeaseManager

adoroszlai

Thanks @sumitagrawl for working on this. Repeated run with the patch failed ~3%.

~~https://github.com/adoroszlai/hadoop-ozone/actions/runs/4933903111~~

Sorry, wrong link. The correct one is this: https://github.com/adoroszlai/hadoop-ozone/actions/runs/4934872460

…tCloseContainerWithDelayByLeaseManager

szetszwo

@sumitagrawl , thanks a lot for working on this! Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13058203/4688_review.patch

szetszwo · 2023-05-15T09:32:23Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/lease/LeaseManager.java

  private final long defaultTimeout;
-  private final Object monitor = new Object();
  private Map<T, Lease<T>> activeLeases;
+  private BlockingQueue<T> leaseKeyBlockingQueue;


We should use Semaphore in this case; see https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Semaphore.html

@szetszwo Thanks for review, I have done changes as per patch, but added extra release in shutdown() as possibility for concurrency issue. i.e.

shutdown --> disable MonitorThread (set running to false)

interrupt Monitor thread

For step 2, if MonitorThread waiting, then interrupt will work else ignored. So later moves to waiting for semaphore, then there is no way to exist thread.

So added extra "semaphore.release()" in shutdown to ensure exit from tryAcquire() wait in above case.

szetszwo · 2023-05-15T09:33:07Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/lease/LeaseManager.java

    checkStatus();
    LOG.debug("Shutting down LeaseManager service");
    leaseMonitor.disable();
-    synchronized (monitor) {


shutdown() should be synchronized.

@szetszwo currently I have reverted this as shutdown is threadsafe. Adding synchronization needs protect all member variables as findbug, which is not required.

…tCloseContainerWithDelayByLeaseManager

szetszwo

+1 the change looks good.

adoroszlai · 2023-05-15T17:10:15Z

Thanks @sumitagrawl for updating the patch, test passed in 900 runs. Please take a look at findbugs failure though:

M M IS: Inconsistent synchronization of org.apache.hadoop.ozone.lease.LeaseManager.activeLeases; locked 66% of time  Unsynchronized access at LeaseManager.java:[line 169]
M C RV: Return value of java.util.concurrent.Semaphore.tryAcquire(long, TimeUnit) ignored in org.apache.hadoop.ozone.lease.LeaseManager$LeaseMonitor.run()  At LeaseManager.java:[line 269]

https://github.com/apache/ozone/actions/runs/4981223788/jobs/8917653084?pr=4688#step:6:2329

szetszwo · 2023-05-16T07:15:43Z

@sumitagrawl , please fix findbugs warnings.

…tCloseContainerWithDelayByLeaseManager

adoroszlai · 2023-05-16T15:48:26Z

I don't understand findbugs. How does making shutdown unsynchronized solve the inconsistent synchronization problem? Now activeLeases is locked ~50% of the time.

szetszwo · 2023-05-17T01:25:52Z

... How does making shutdown unsynchronized solve the inconsistent synchronization problem? ...

Is it the case that findbugs won't warn for existing bugs? If this PR adds synchronized, then findbugs will consider that the synchronization bug, although is existing, is added by this. Just my guess.

adoroszlai · 2023-05-17T07:14:40Z

Is it the case that findbugs won't warn for existing bugs?

I doubt that. It only sees the current state of the code.

We've also seen findbugs being triggered in apparently unrelated files (#4506 (comment)), and the problem persisted even after the PR was merged, until fixed by another PR.

I would guess this may be caused by findbugs working on compiled bytecode, not on sources.

But I tend to agree with its finding here, we should synchronize consistently (see also #4578 (review)).

szetszwo · 2023-05-17T09:25:02Z

Indeed, we should use a nonblocking solution and not synchronize at all. We probably need to redesign LeaseManager.

BTW, the following code does not make sense at all. Could a thread in dead state be started? Probably not.

//LeaseManager.start()
    leaseMonitorThread.setUncaughtExceptionHandler((thread, throwable) -> {
      // Let us just restart this thread after logging an error.
      // if this thread is not running we cannot handle Lease expiry.
      LOG.error("LeaseMonitor thread encountered an error. Thread: {}",
          thread.toString(), throwable);
      leaseMonitorThread.start();
    });

I suggest to commit this PR, if it can fix the test failures, and completely rewrite LeaseManager in the future. @adoroszlai , would you like to trigger the test again?

adoroszlai · 2023-05-17T10:29:20Z

would you like to trigger the test again?

Triggered, will post the results.

adoroszlai · 2023-05-17T11:50:36Z

Thanks @sumitagrawl for the patch, @szetszwo for the review.

Test passed 100% in 1000 runs.

szetszwo · 2023-05-17T14:10:30Z

@sumitagrawl , thanks for working on this!

@adoroszlai , thanks a lot for testing it!

* master: (78 commits) HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager (apache#4688) HDDS-7241. EC: Reconstruction could fail with orphan blocks. (apache#4718) HDDS-8577. [Snapshot] Disable compaction log when loading metadata for snapshot (apache#4697) HDDS-7080. EC: Offline reconstruction needs better logging (apache#4719) HDDS-8626. Config thread pool in ReplicationServer (apache#4715) HDDS-8616. Underreplication not fixed if all replicas start decommissioning (apache#4711) HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583) HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583) HDDS-8615. Explicitly show EC block type in 'ozone debug chunkinfo' command output (apache#4706) HDDS-8623. Delete duplicate getBucketInfo in OMKeyCommitRequest (apache#4712) HDDS-8339. Recon Show the number of keys marked for Deletion in Recon UI. (apache#4519) HDDS-8572. Support CodecBuffer for protobuf v3 codecs. (apache#4693) HDDS-8010. Improve DN warning message when getBlock does not find the block. (apache#4698) HDDS-8621. IOException is never thrown in SCMRatisServer.getRatisRoles(). (apache#4710) HDDS-8463. S3 key uniqueness in deletedTable (apache#4660) HDDS-8584. Hadoop client write slowly when stream enabled (apache#4703) HDDS-7732. EC: Verify block deletion from missing EC containers (apache#4705) HDDS-8581. Avoid random ports in integration tests (apache#4699) HDDS-8504. ReplicationManager: Pass used and excluded node separately for Under and Mis-Replication (apache#4694) HDDS-8576. Close RocksDB instance in RDBStore if RDBStore's initialization fails after RocksDB instance creation (apache#4692) ...

HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.tes…

224db88

…tCloseContainerWithDelayByLeaseManager

adoroszlai reviewed May 10, 2023

View reviewed changes

HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.tes…

f8cbc68

…tCloseContainerWithDelayByLeaseManager

sumitagrawl requested a review from adoroszlai May 11, 2023 16:17

adoroszlai requested a review from szetszwo May 11, 2023 16:23

HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.tes…

c89abc0

…tCloseContainerWithDelayByLeaseManager

szetszwo reviewed May 15, 2023

View reviewed changes

HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.tes…

3d89d98

…tCloseContainerWithDelayByLeaseManager

sumitagrawl requested a review from szetszwo May 15, 2023 14:00

szetszwo approved these changes May 15, 2023

View reviewed changes

HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.tes…

c7812ef

…tCloseContainerWithDelayByLeaseManager

adoroszlai merged commit bdd3f4e into apache:master May 17, 2023

HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager #4688

HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager #4688

Uh oh!

Conversation

sumitagrawl commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

szetszwo May 15, 2023

Choose a reason for hiding this comment

Uh oh!

sumitagrawl May 15, 2023

Choose a reason for hiding this comment

Uh oh!

szetszwo May 15, 2023

Choose a reason for hiding this comment

Uh oh!

sumitagrawl May 15, 2023

Choose a reason for hiding this comment

Uh oh!

sumitagrawl May 17, 2023

Choose a reason for hiding this comment

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented May 15, 2023

Uh oh!

szetszwo commented May 16, 2023

Uh oh!

adoroszlai commented May 16, 2023

Uh oh!

szetszwo commented May 17, 2023

Uh oh!

adoroszlai commented May 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szetszwo commented May 17, 2023

Uh oh!

adoroszlai commented May 17, 2023

Uh oh!

adoroszlai commented May 17, 2023

Uh oh!

szetszwo commented May 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sumitagrawl commented May 9, 2023 •

edited

Loading

adoroszlai left a comment •

edited

Loading

adoroszlai commented May 17, 2023 •

edited

Loading