Missing detection of disk availability when sending heartbeats. #287

cserwen · 2023-04-24T06:58:00Z

In RocketMQ DLedger, when a disk which stores data or index fails, a status which No-Master may occur for brokerGroup.
The following is the process of the problem：

# Process

Initial state: term=2, n2 is the Master node

2023-04-13 02:04:53, disk failure, term=3, n0 becomes Master

2023-04-13 02:04:53 INFO QuorumAckChecker-n0 - [n0][LEADER] term=3 ledgerBegin=133702137569 ledgerEnd=137547196422 committed=137547196422 watermarks={3:{"n0":137547196422,"n1":137547196422,"n2":-1}}  
2023-04-13 02:04:56 WARN DLedgerServer-ScheduledExecutor - preferredLeaderId = n2 is not online

2023-04-13 02:05:08, term=3, n0 detected that n2 is online, handing over the Master role.

2023-04-13 02:05:08 INFO DLedgerServer-ScheduledExecutor - preferredLeaderId = n2, which has the smallest fall behind index = 12 and is decided to be transferee.  
2023-04-13 02:05:08 INFO DLedgerServer-ScheduledExecutor - handleLeadershipTransfer: LeadershipTransferRequest{transferId='null', transfereeId='n2', takeLeadershipLedgerIndex=0, group='null', remoteId='null', lo  calId='null', code=200, leaderId='null', term=3}

2023-04-13 02:05:08，term=4, n2 becomes master

2023-04-13 02:05:08 INFO StateMaintainer - [n2] [PARSE_VOTE_RESULT] cost=1 term=4 memberNum=3 allNum=2 acceptedNum=2 notReadyTermNum=0 biggerLedgerNum=0 alreadyHasLeader=false maxTerm=4 result=PASSED
2023-04-13 02:05:08 INFO StateMaintainer - [n2] [VOTE_RESULT] has been elected to be the leader in term 4
2023-04-13 02:05:08 INFO StateMaintainer - TakeLeadershipTask finished. request=LeadershipTransferRequest{transferId='n0', transfereeId='n2', takeLeadershipLedgerIndex=137547318699, group='c4cloudsrv-miot-rocketmq-raft8', remoteId='n2', localId='n0', code=200, leaderId='n0', term=3}, response=LeadershipTransferResponse{group='null', remoteId='null', localId='null', code=200, leaderId='null', term=4}, term=4, role=LEADER
2023-04-13 02:05:08 INFO StateMaintainer - [n2] [ChangeRoleToLeader] from term: 4 and currTerm: 4

2023-04-13 02:05:27, term=4, Master role transparently passed to Broker failed

2023-04-13 02:05:08 INFO DLegerRoleChangeHandler_1 - Begin handling broker role change term=4 role=LEADER currStoreRole=SLAVE
2023-04-13 02:05:27 INFO DLegerRoleChangeHandler_1 - [MONITOR]Failed handling broker role change term=4 role=LEADER currStoreRole=SLAVE cost=19334
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code

In the final state, DLedger chose n2 as the Master, but there was no Master for the Broker group, and the election has not been re-initiated since then.

# Question

Why did n0 and n1 not initiate an election after n2 became Master in term 4.

If the Master has not sent heartbeats to the follower, the follower will trigger the election; but if the heartbeat has been sent normally, the slave node will not initiate the election.

The memberState object lock is used to detect disk failures. When writing a message, the lock will be held. If the disk fails, the lock will not be released in time, and the heartbeat thread will not acquire the lock, thus detecting the disk failure. It can be seen that writing messages is a trigger to detect disk failures, but if the client no longer writes messages, the heartbeat thread can always acquire the lock, and it keeps sending heartbeats.

# TODO

If no data is written, the node where the faulty disk is located will also become the Master. Therefore, I think it is necessary to add a task to regularly detect whether the disks are available. to avoid this situation.

The text was updated successfully, but these errors were encountered:

humkum · 2023-04-24T07:11:12Z

I'd like to follow this issue. Plz assign this issue to me, thanks.

TheR1sing3un · 2023-04-24T07:14:37Z

I'd like to follow this issue. Plz assign this issue to me, thanks.

Welcome~ You can write a brief improvement proposal~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing detection of disk availability when sending heartbeats. #287

Missing detection of disk availability when sending heartbeats. #287

cserwen commented Apr 24, 2023

humkum commented Apr 24, 2023

TheR1sing3un commented Apr 24, 2023

Missing detection of disk availability when sending heartbeats. #287

Missing detection of disk availability when sending heartbeats. #287

Comments

cserwen commented Apr 24, 2023

# Process

# Question

# TODO

humkum commented Apr 24, 2023

TheR1sing3un commented Apr 24, 2023