-
Notifications
You must be signed in to change notification settings - Fork 590
HDDS-7284. JVM crash for rocksdb for read/write after close #3801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@nandakumar131 Please review |
|
@sumitagrawl , thanks for reporting and fixing this issue. Since JVM is crashed for there is still read/write after close, which means there is some issue with current Recon code, threads that read/write rocksdb should be stopped before rocksdb close. Can we try to do that first? We should guarantee close RocksDB at the end, than adding a layer of mechanism to mitigate the code issue. Regarding two solutions, I would prefer solution two, because you cannot make sure how long the wait time is enough for solution one. |
As discussed,
|
To avoid jvm dump, infinite wait is added for close till all usages are completed. |
|
Thanks @sumitagrawl for the fix. |
* master: (718 commits) HDDS-7342. Move encryption-related code from MultipartCryptoKeyInputStream to OzoneCryptoInputStream (apache#3852) HDDS-7413. Fix logging while marking container state unhealthy (apache#3887) Revert "HDDS-7253. Fix exception when '/' in key name (apache#3774)" HDDS-7396. Force close non-RATIS containers in ReplicationManager (apache#3877) HDDS-7121. Support namespace summaries (du, dist & counts) for legacy FS buckets (apache#3746) HDDS-7258. Cleanup the allocated but uncommitted blocks (apache#3778) HDDS-7381. Cleanup of VolumeManagerImpl (apache#3873) HDDS-7253. Fix exception when '/' in key name (apache#3774) HDDS-7182. Add property to control RocksDB max open files (apache#3843) HDDS-7284. JVM crash for rocksdb for read/write after close (apache#3801) HDDS-7368. [Multi-Tenant] Add Volume Existence check in preExecute for OMTenantCreateRequest (apache#3869) HDDS-7403. README Security Improvement (apache#3879) HDDS-7199. Implement new mix workload Read/Write Freon command (apache#3872) HDDS-7248. Recon: Expand the container status page to show all unhealthy container states (apache#3837) HDDS-7141. Recon: Improve Disk Usage Page (apache#3789) HDDS-7369. Fix wrong order of command arguments in Nonrolling-Upgrade.md (apache#3866) HDDS-6210. EC: Add EC metrics (apache#3851) HDDS-7355. non-primordial scm fail to get signed cert from primordial SCM when converting an unsecure cluster to secure (apache#3859) HDDS-7356. Update SCM-HA.zh.md to match the English version (apache#3861) HDDS-6930. SCM,OM,RECON should not print ERROR and exit with code 1 on successful shutdown (apache#3848) ... Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestLegacyReplicationManager.java
What changes were proposed in this pull request?
Added check for close of DB before db access, And a counter if any operation in progress
This counter is used for close to make sure operation is closed, and max wait for 5 sec for force close as strategy.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7284
How was this patch tested?
A test is performed to check if IO Exception after close is comming, as earlier it was causing Java crash,