HDDS-7284. JVM crash for rocksdb for read/write after close #3801

sumitagrawl · 2022-10-06T06:31:03Z

What changes were proposed in this pull request?

Added check for close of DB before db access, And a counter if any operation in progress
This counter is used for close to make sure operation is closed, and max wait for 5 sec for force close as strategy.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7284

How was this patch tested?

A test is performed to check if IO Exception after close is comming, as earlier it was causing Java crash,

sumitagrawl · 2022-10-06T06:33:01Z

@nandakumar131 Please review

ChenSammi · 2022-10-10T05:02:45Z

@sumitagrawl , thanks for reporting and fixing this issue. Since JVM is crashed for there is still read/write after close, which means there is some issue with current Recon code, threads that read/write rocksdb should be stopped before rocksdb close. Can we try to do that first? We should guarantee close RocksDB at the end, than adding a layer of mechanism to mitigate the code issue.

Regarding two solutions, I would prefer solution two, because you cannot make sure how long the wait time is enough for solution one.

sumitagrawl · 2022-10-11T04:49:02Z

@sumitagrawl , thanks for reporting and fixing this issue. Since JVM is crashed for there is still read/write after close, which means there is some issue with current Recon code, threads that read/write rocksdb should be stopped before rocksdb close. Can we try to do that first? We should guarantee close RocksDB at the end, than adding a layer of mechanism to mitigate the code issue.

Regarding two solutions, I would prefer solution two, because you cannot make sure how long the wait time is enough for solution one.

As discussed,

Solution 1 is done to avoid the crash of integration test, till graceful shutdown is properly implemented and without performance impact.
Graceful shutdown for Recon needs to be created but its low priority one

sumitagrawl · 2022-10-12T13:31:23Z

@sumitagrawl , thanks for reporting and fixing this issue. Since JVM is crashed for there is still read/write after close, which means there is some issue with current Recon code, threads that read/write rocksdb should be stopped before rocksdb close. Can we try to do that first? We should guarantee close RocksDB at the end, than adding a layer of mechanism to mitigate the code issue.
Regarding two solutions, I would prefer solution two, because you cannot make sure how long the wait time is enough for solution one.

As discussed,

Solution 1 is done to avoid the crash of integration test, till graceful shutdown is properly implemented and without performance impact.

Graceful shutdown for Recon needs to be created but its low priority one

To avoid jvm dump, infinite wait is added for close till all usages are completed.

nandakumar131 · 2022-10-25T08:17:03Z

Thanks @sumitagrawl for the fix.

* master: (718 commits) HDDS-7342. Move encryption-related code from MultipartCryptoKeyInputStream to OzoneCryptoInputStream (apache#3852) HDDS-7413. Fix logging while marking container state unhealthy (apache#3887) Revert "HDDS-7253. Fix exception when '/' in key name (apache#3774)" HDDS-7396. Force close non-RATIS containers in ReplicationManager (apache#3877) HDDS-7121. Support namespace summaries (du, dist & counts) for legacy FS buckets (apache#3746) HDDS-7258. Cleanup the allocated but uncommitted blocks (apache#3778) HDDS-7381. Cleanup of VolumeManagerImpl (apache#3873) HDDS-7253. Fix exception when '/' in key name (apache#3774) HDDS-7182. Add property to control RocksDB max open files (apache#3843) HDDS-7284. JVM crash for rocksdb for read/write after close (apache#3801) HDDS-7368. [Multi-Tenant] Add Volume Existence check in preExecute for OMTenantCreateRequest (apache#3869) HDDS-7403. README Security Improvement (apache#3879) HDDS-7199. Implement new mix workload Read/Write Freon command (apache#3872) HDDS-7248. Recon: Expand the container status page to show all unhealthy container states (apache#3837) HDDS-7141. Recon: Improve Disk Usage Page (apache#3789) HDDS-7369. Fix wrong order of command arguments in Nonrolling-Upgrade.md (apache#3866) HDDS-6210. EC: Add EC metrics (apache#3851) HDDS-7355. non-primordial scm fail to get signed cert from primordial SCM when converting an unsecure cluster to secure (apache#3859) HDDS-7356. Update SCM-HA.zh.md to match the English version (apache#3861) HDDS-6930. SCM,OM,RECON should not print ERROR and exit with code 1 on successful shutdown (apache#3848) ... Conflicts: hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/replication/TestLegacyReplicationManager.java

HDDS-7284 JVM crash for rocksdb for read/write after close

4321296

nandakumar131 requested a review from ChenSammi October 6, 2022 07:03

HDDS-7284 JVM crash for rocksdb for read/write after close

4c00b3a

sumitagrawl changed the title ~~HDDS-7284 JVM crash for rocksdb for read/write after close~~ HDDS-7284. JVM crash for rocksdb for read/write after close Oct 6, 2022

HDDS-7284 JVM crash for rocksdb for read/write after close

31d9349

HDDS-7284 JVM crash for rocksdb for read/write after close

40ddf76

adoroszlai requested a review from nandakumar131 October 15, 2022 08:27

nandakumar131 approved these changes Oct 25, 2022

View reviewed changes

nandakumar131 merged commit 2f11175 into apache:master Oct 25, 2022

kaijchen pushed a commit that referenced this pull request Oct 25, 2022

HDDS-7284. JVM crash for rocksdb for read/write after close (#3801)

ffdbf14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-7284. JVM crash for rocksdb for read/write after close #3801

HDDS-7284. JVM crash for rocksdb for read/write after close #3801

Uh oh!

sumitagrawl commented Oct 6, 2022

Uh oh!

sumitagrawl commented Oct 6, 2022

Uh oh!

ChenSammi commented Oct 10, 2022

Uh oh!

sumitagrawl commented Oct 11, 2022

Uh oh!

sumitagrawl commented Oct 12, 2022

Uh oh!

nandakumar131 commented Oct 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-7284. JVM crash for rocksdb for read/write after close #3801

HDDS-7284. JVM crash for rocksdb for read/write after close #3801

Uh oh!

Conversation

sumitagrawl commented Oct 6, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sumitagrawl commented Oct 6, 2022

Uh oh!

ChenSammi commented Oct 10, 2022

Uh oh!

sumitagrawl commented Oct 11, 2022

Uh oh!

sumitagrawl commented Oct 12, 2022

Uh oh!

nandakumar131 commented Oct 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants