Skip to content

Conversation

@sumitagrawl
Copy link
Contributor

What changes were proposed in this pull request?

Added check for close of DB before db access, And a counter if any operation in progress
This counter is used for close to make sure operation is closed, and max wait for 5 sec for force close as strategy.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7284

How was this patch tested?

A test is performed to check if IO Exception after close is comming, as earlier it was causing Java crash,

@sumitagrawl
Copy link
Contributor Author

@nandakumar131 Please review

@sumitagrawl sumitagrawl changed the title HDDS-7284 JVM crash for rocksdb for read/write after close HDDS-7284. JVM crash for rocksdb for read/write after close Oct 6, 2022
@ChenSammi
Copy link
Contributor

@sumitagrawl , thanks for reporting and fixing this issue. Since JVM is crashed for there is still read/write after close, which means there is some issue with current Recon code, threads that read/write rocksdb should be stopped before rocksdb close. Can we try to do that first? We should guarantee close RocksDB at the end, than adding a layer of mechanism to mitigate the code issue.

Regarding two solutions, I would prefer solution two, because you cannot make sure how long the wait time is enough for solution one.

@sumitagrawl
Copy link
Contributor Author

@sumitagrawl , thanks for reporting and fixing this issue. Since JVM is crashed for there is still read/write after close, which means there is some issue with current Recon code, threads that read/write rocksdb should be stopped before rocksdb close. Can we try to do that first? We should guarantee close RocksDB at the end, than adding a layer of mechanism to mitigate the code issue.

Regarding two solutions, I would prefer solution two, because you cannot make sure how long the wait time is enough for solution one.

As discussed,

  1. Solution 1 is done to avoid the crash of integration test, till graceful shutdown is properly implemented and without performance impact.
  2. Graceful shutdown for Recon needs to be created but its low priority one

@sumitagrawl
Copy link
Contributor Author

@sumitagrawl , thanks for reporting and fixing this issue. Since JVM is crashed for there is still read/write after close, which means there is some issue with current Recon code, threads that read/write rocksdb should be stopped before rocksdb close. Can we try to do that first? We should guarantee close RocksDB at the end, than adding a layer of mechanism to mitigate the code issue.
Regarding two solutions, I would prefer solution two, because you cannot make sure how long the wait time is enough for solution one.

As discussed,

  1. Solution 1 is done to avoid the crash of integration test, till graceful shutdown is properly implemented and without performance impact.
  2. Graceful shutdown for Recon needs to be created but its low priority one

To avoid jvm dump, infinite wait is added for close till all usages are completed.

@nandakumar131 nandakumar131 merged commit 2f11175 into apache:master Oct 25, 2022
@nandakumar131
Copy link
Contributor

Thanks @sumitagrawl for the fix.

errose28 added a commit to errose28/ozone that referenced this pull request Oct 26, 2022
* master: (718 commits)
  HDDS-7342. Move encryption-related code from MultipartCryptoKeyInputStream to OzoneCryptoInputStream (apache#3852)
  HDDS-7413. Fix logging while marking container state unhealthy (apache#3887)
  Revert "HDDS-7253. Fix exception when '/' in key name (apache#3774)"
  HDDS-7396. Force close non-RATIS containers in ReplicationManager (apache#3877)
  HDDS-7121. Support namespace summaries (du, dist & counts) for legacy FS buckets (apache#3746)
  HDDS-7258. Cleanup the allocated but uncommitted blocks (apache#3778)
  HDDS-7381. Cleanup of VolumeManagerImpl (apache#3873)
  HDDS-7253. Fix exception when '/' in key name (apache#3774)
  HDDS-7182. Add property to control RocksDB max open files (apache#3843)
  HDDS-7284. JVM crash for rocksdb for read/write after close (apache#3801)
  HDDS-7368. [Multi-Tenant] Add Volume Existence check in preExecute for OMTenantCreateRequest (apache#3869)
  HDDS-7403. README Security Improvement (apache#3879)
  HDDS-7199. Implement new mix workload Read/Write Freon command (apache#3872)
  HDDS-7248. Recon: Expand the container status page to show all unhealthy container states (apache#3837)
  HDDS-7141. Recon: Improve Disk Usage Page (apache#3789)
  HDDS-7369. Fix wrong order of command arguments in Nonrolling-Upgrade.md (apache#3866)
  HDDS-6210. EC: Add EC metrics (apache#3851)
  HDDS-7355. non-primordial scm fail to get signed cert from primordial SCM when converting an unsecure cluster to secure (apache#3859)
  HDDS-7356. Update SCM-HA.zh.md to match the English version (apache#3861)
  HDDS-6930. SCM,OM,RECON should not print ERROR and exit with code 1 on successful shutdown (apache#3848)
  ...

Conflicts:
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestLegacyReplicationManager.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants