Skip to content

Conversation

@sumitagrawl
Copy link
Contributor

@sumitagrawl sumitagrawl commented May 9, 2023

What changes were proposed in this pull request?

LeaseManager have concurrency issue that, when new event is added just before monitor thread go for wait, then notification will be missed and object remains in queue indefinitely till further new event comes and notify.

  • changed the logic using blocking queue for event notification.
  • test case change to validate lease acquire and wait with GenericTestUtils.waitFor

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8575

How was this patch tested?

running testcase multiple time

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumitagrawl sumitagrawl requested a review from adoroszlai May 11, 2023 16:17
@adoroszlai adoroszlai requested a review from szetszwo May 11, 2023 16:23
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumitagrawl , thanks a lot for working on this! Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13058203/4688_review.patch

private final long defaultTimeout;
private final Object monitor = new Object();
private Map<T, Lease<T>> activeLeases;
private BlockingQueue<T> leaseKeyBlockingQueue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szetszwo Thanks for review, I have done changes as per patch, but added extra release in shutdown() as possibility for concurrency issue. i.e.

  1. shutdown --> disable MonitorThread (set running to false)
  2. interrupt Monitor thread

For step 2, if MonitorThread waiting, then interrupt will work else ignored. So later moves to waiting for semaphore, then there is no way to exist thread.

So added extra "semaphore.release()" in shutdown to ensure exit from tryAcquire() wait in above case.

checkStatus();
LOG.debug("Shutting down LeaseManager service");
leaseMonitor.disable();
synchronized (monitor) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutdown() should be synchronized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szetszwo currently I have reverted this as shutdown is threadsafe. Adding synchronization needs protect all member variables as findbug, which is not required.

@sumitagrawl sumitagrawl requested a review from szetszwo May 15, 2023 14:00
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@adoroszlai
Copy link
Contributor

Thanks @sumitagrawl for updating the patch, test passed in 900 runs. Please take a look at findbugs failure though:

M M IS: Inconsistent synchronization of org.apache.hadoop.ozone.lease.LeaseManager.activeLeases; locked 66% of time  Unsynchronized access at LeaseManager.java:[line 169]
M C RV: Return value of java.util.concurrent.Semaphore.tryAcquire(long, TimeUnit) ignored in org.apache.hadoop.ozone.lease.LeaseManager$LeaseMonitor.run()  At LeaseManager.java:[line 269]

https://github.com/apache/ozone/actions/runs/4981223788/jobs/8917653084?pr=4688#step:6:2329

@szetszwo
Copy link
Contributor

@sumitagrawl , please fix findbugs warnings.

@adoroszlai
Copy link
Contributor

I don't understand findbugs. How does making shutdown unsynchronized solve the inconsistent synchronization problem? Now activeLeases is locked ~50% of the time.

@szetszwo
Copy link
Contributor

... How does making shutdown unsynchronized solve the inconsistent synchronization problem? ...

Is it the case that findbugs won't warn for existing bugs? If this PR adds synchronized, then findbugs will consider that the synchronization bug, although is existing, is added by this. Just my guess.

@adoroszlai
Copy link
Contributor

adoroszlai commented May 17, 2023

Is it the case that findbugs won't warn for existing bugs?

I doubt that. It only sees the current state of the code.

We've also seen findbugs being triggered in apparently unrelated files (#4506 (comment)), and the problem persisted even after the PR was merged, until fixed by another PR.

I would guess this may be caused by findbugs working on compiled bytecode, not on sources.

But I tend to agree with its finding here, we should synchronize consistently (see also #4578 (review)).

@szetszwo
Copy link
Contributor

Indeed, we should use a nonblocking solution and not synchronize at all. We probably need to redesign LeaseManager.

BTW, the following code does not make sense at all. Could a thread in dead state be started? Probably not.

//LeaseManager.start()
    leaseMonitorThread.setUncaughtExceptionHandler((thread, throwable) -> {
      // Let us just restart this thread after logging an error.
      // if this thread is not running we cannot handle Lease expiry.
      LOG.error("LeaseMonitor thread encountered an error. Thread: {}",
          thread.toString(), throwable);
      leaseMonitorThread.start();
    });

I suggest to commit this PR, if it can fix the test failures, and completely rewrite LeaseManager in the future. @adoroszlai , would you like to trigger the test again?

@adoroszlai
Copy link
Contributor

would you like to trigger the test again?

Triggered, will post the results.

@adoroszlai adoroszlai merged commit bdd3f4e into apache:master May 17, 2023
@adoroszlai
Copy link
Contributor

Thanks @sumitagrawl for the patch, @szetszwo for the review.

Test passed 100% in 1000 runs.

@szetszwo
Copy link
Contributor

@sumitagrawl , thanks for working on this!

@adoroszlai , thanks a lot for testing it!

errose28 added a commit to errose28/ozone that referenced this pull request May 17, 2023
* master: (78 commits)
  HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager (apache#4688)
  HDDS-7241. EC: Reconstruction could fail with orphan blocks. (apache#4718)
  HDDS-8577. [Snapshot] Disable compaction log when loading metadata for snapshot (apache#4697)
  HDDS-7080. EC: Offline reconstruction needs better logging (apache#4719)
  HDDS-8626. Config thread pool in ReplicationServer (apache#4715)
  HDDS-8616. Underreplication not fixed if all replicas start decommissioning (apache#4711)
  HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583)
  HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583)
  HDDS-8615. Explicitly show EC block type in 'ozone debug chunkinfo' command output (apache#4706)
  HDDS-8623. Delete duplicate getBucketInfo in OMKeyCommitRequest (apache#4712)
  HDDS-8339. Recon Show the number of keys marked for Deletion in Recon UI. (apache#4519)
  HDDS-8572. Support CodecBuffer for protobuf v3 codecs. (apache#4693)
  HDDS-8010. Improve DN warning message when getBlock does not find the block. (apache#4698)
  HDDS-8621. IOException is never thrown in SCMRatisServer.getRatisRoles(). (apache#4710)
  HDDS-8463. S3 key uniqueness in deletedTable (apache#4660)
  HDDS-8584. Hadoop client write slowly when stream enabled (apache#4703)
  HDDS-7732. EC: Verify block deletion from missing EC containers (apache#4705)
  HDDS-8581. Avoid random ports in integration tests (apache#4699)
  HDDS-8504. ReplicationManager: Pass used and excluded node separately for Under and Mis-Replication (apache#4694)
  HDDS-8576. Close RocksDB instance in RDBStore if RDBStore's initialization fails after RocksDB instance creation (apache#4692)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants