HDDS-4408: Datanode State Machine Thread should keep alive during the whole lifetime of Datanode #1533

GlenGeng-awx · 2020-10-29T09:49:49Z

What changes were proposed in this pull request?

Datanode State Machine Thread should keep alive during the whole lifetime of Datanode, since it periodic generates heartbeat tasks which trigger DN to actively talk with DN. If this thread crashes, DN will become a zombie: although it is alive, heartbeats between itself and SCM are stopped.

In Tencent internal production environment, we got several dead DNs which can never come back without a restart.

We found that the thread "Datanode State Machine Thread - 0" does not exist in the output of jstack, thus no HeartbeatEndpointTask will be created, this DN will soon become dead and can not recover unless being restarted.

After checked the .out log, we saw that OOM occurred in thread "Datanode State Machine Thread", which should be responsible for this issue:

300010.579: Total time for which application threads were stopped: 3.0848769 seconds, Stopping threads took: 0.0000943 seconds
Exception in thread "Datanode State Machine Thread - 0" java.lang.OutOfMemoryError: Java heap space
300010.579: Application time: 0.0001554 seconds
300010.580: Total time for which application threads were stopped: 0.0015600 seconds, Stopping threads took: 0.0002747 seconds
300010.581: Application time: 0.0004684 seconds
{Heap before GC invocations=13766 (full 11664):
 PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000, 0x0000000800000000, 0x0000000800000000)
 eden space 3388416K, 100% used [0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
 from space 53248K, 0% used [0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
 to space 53248K, 0% used [0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
 ParOldGen total 6990848K, used 6990848K [0x0000000580000000, 0x000000072ab00000, 0x000000072ab00000)
 object space 6990848K, 100% used [0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
 Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
 class space used 5922K, capacity 6372K, committed 6744K, reserved 1048576K

BTW, after running DN for more than a week, we see a lot of "java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since we configured a dead Recon, we guess this could be an evidence for HDDS-4404.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4408

How was this patch tested?

CI

mukul1987 · 2020-10-29T13:17:15Z

-1, lets analyze the heapdump to understand the root cause of the memory consumption.

GlenGeng-awx · 2020-10-30T07:50:16Z

@mukul1987 I consider the previous description for this PR/Jira is not properly, I've updated them.

First of all, the key point here is not about OOM. The key point is Datanode State Machine Thread should keep alive during the whole lifetime of Datanode. Without this thread, DN is just a zombie.

Let's talk about the OOM. The root cause for OOM is quite straightforward: as explained in HDDS-4404, we have a dead Recon, the report for Recon is blocked in StateContext, after running for more than 1 week, the DN finally suffers from continually OOM.

btw, the -Xmx is 10G

arp7 · 2020-11-09T17:33:39Z

...rc/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java

Would it be better for the DN to just self-terminate if there is an uncaught exception in the state machine thread? Do we know what was the exact uncaught exception?

Yes, we know the uncaught exception, it is the OOM due to queued reports for a dead Recon. You may find the detail from above conversations.

Either a HA solution or a fast fail solution is fine to me.
BTW, the restart thread solution is already used in cmdProcessThread and leaseMonitorThread.

prashantpogde · 2020-11-09T18:07:57Z

@GlenGeng thank you for the patch.
This can be addressed in other ways also

let the datanode die and restart
handle this at recon end. Identify this at recon and get it restarted. Why are other data nodes not exhibiting the same behavior ? it should be same for other data nodes as well eventually.
do not cache reports if the queue exceeds some limit.

Overall this is an unhealthy situation for the whole cluster. How would keeping the datanode state machine thread alive help this situation ? Instead it would keep reporting in heartbeat that data node is healthy.

GlenGeng-awx · 2020-11-10T04:23:44Z

@prashantpogde

Why are other data nodes not exhibiting the same behavior ? it should be same for other data nodes as well eventually.

You are right. All DNs in our cluster are eventually affected by this OOM issue.

let the datanode die and restart

Agree. Restarting the thread will not help DN to recover to a healthy state, for this OOM issue.
This is just a defensive strategy: there might be other runtime exception like NPE or error that DN is still in a healthy state, in which case restarting this thread will be a decent solution.

…uncaught exception.

ChenSammi · 2020-11-13T02:31:40Z

The last patch LGTM + 1.

I will merge it soon since we need this patch in our production environment.

Thanks @GlenGeng for the contribution, @mukul1987 @arp7 @prashantpogde for the review and suggestion.

* master: (53 commits) HDDS-4458. Fix Max Transaction ID value in OM. (apache#1585) HDDS-4442. Disable the location information of audit logger to reduce overhead (apache#1567) HDDS-4441. Add metrics for ACL related operations.(Addendum for HA). (apache#1584) HDDS-4081. Create ZH translation of StorageContainerManager.md in doc. (apache#1558) HDDS-4080. Create ZH translation of OzoneManager.md in doc. (apache#1541) HDDS-4079. Create ZH translation of Containers.md in doc. (apache#1539) HDDS-4184. Add Features menu for Chinese document. (apache#1547) HDDS-4235. Ozone client FS path validation is not present in OFS. (apache#1582) HDDS-4338. Fix the issue that SCM web UI banner shows "HDFS SCM". (apache#1583) HDDS-4337. Implement RocksDB options cache for new datanode DB utilities. (apache#1544) HDDS-4083. Create ZH translation of Recon.md in doc (apache#1575) HDDS-4453. Replicate closed container for random selected datanodes. (apache#1574) HDDS-4408: terminate Datanode when Datanode State Machine Thread got uncaught exception. (apache#1533) HDDS-4443. Recon: Using Mysql database throws exception and fails startup (apache#1570) HDDS-4315. Use Epoch to generate unique ObjectIDs (apache#1480) HDDS-4455. Fix typo in README.md doc (apache#1578) HDDS-4441. Add metrics for ACL related operations. (apache#1571) HDDS-4437. Avoid unnecessary builder conversion in setting volume Quota/Owner request (apache#1564) HDDS-4417. Simplify Ozone client code with configuration object (apache#1542) HDDS-4363. Add metric to track the number of RocksDB open/close operations. (apache#1530) ...

GlenGeng-awx changed the title ~~HDDS-4408: Datanode State Machine Thread needs handle OutOfMemoryError~~ Datanode State Machine Thread should keep alive during the whole lifetime of Datanode Oct 30, 2020

GlenGeng-awx changed the title ~~Datanode State Machine Thread should keep alive during the whole lifetime of Datanode~~ HDDS-4408: Datanode State Machine Thread should keep alive during the whole lifetime of Datanode Oct 30, 2020

arp7 reviewed Nov 9, 2020

View reviewed changes

HDDS-4408: terminate Datanode when Datanode State Machine Thread got …

aeaa76a

…uncaught exception.

GlenGeng-awx force-pushed the HDDS-4408 branch from 4e88b00 to aeaa76a Compare November 12, 2020 11:08

ChenSammi merged commit 277a589 into apache:master Nov 13, 2020

arp7 requested review from avijayanhwx and swagle January 11, 2021 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-4408: Datanode State Machine Thread should keep alive during the whole lifetime of Datanode #1533

HDDS-4408: Datanode State Machine Thread should keep alive during the whole lifetime of Datanode #1533

Uh oh!

GlenGeng-awx commented Oct 29, 2020 •

edited

Loading

Uh oh!

mukul1987 commented Oct 29, 2020

Uh oh!

GlenGeng-awx commented Oct 30, 2020 •

edited

Loading

Uh oh!

arp7 Nov 9, 2020

Uh oh!

GlenGeng-awx Nov 10, 2020

Uh oh!

prashantpogde commented Nov 9, 2020

Uh oh!

GlenGeng-awx commented Nov 10, 2020 •

edited

Loading

Uh oh!

ChenSammi commented Nov 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HDDS-4408: Datanode State Machine Thread should keep alive during the whole lifetime of Datanode #1533

HDDS-4408: Datanode State Machine Thread should keep alive during the whole lifetime of Datanode #1533

Uh oh!

Conversation

GlenGeng-awx commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

mukul1987 commented Oct 29, 2020

Uh oh!

GlenGeng-awx commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arp7 Nov 9, 2020

Choose a reason for hiding this comment

Uh oh!

GlenGeng-awx Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

prashantpogde commented Nov 9, 2020

Uh oh!

GlenGeng-awx commented Nov 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChenSammi commented Nov 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

GlenGeng-awx commented Oct 29, 2020 •

edited

Loading

GlenGeng-awx commented Oct 30, 2020 •

edited

Loading

GlenGeng-awx commented Nov 10, 2020 •

edited

Loading