-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HDFS-17861. The mis-behavior of commitBlockSynchronization may cause standby namenode and observer namenode crash. #8120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
…standby namenode and observer namenode crash.
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
@Hexiaoqiao @zhangshuyan0 @ayushtkn Have uploaded an unit test. please help review when you have free time~ Thanks a lot. |
|
Hi, @zhtttylz. I found other merge requests also have the same failed unit test. Could you please check it? |
|
💔 -1 overall
This message was automatically generated. |
Thanks for the heads-up! I’ve noticed the same failure in a few other PRs as well. I’ll investigate |
Great! Thanks ahead. |
|
+1. LGTM. |
haiyang1987
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. LGTM.
| dfs.mkdirs(dir); | ||
| dfs.setErasureCodingPolicy(dir, ecPolicy.getName()); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove extra spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haiyang1987 Have fixed. Thanks a lot for reviewing.
|
💔 -1 overall
This message was automatically generated. |

Description of PR
Recently, our standby namenode and observer namenode crash suddenly. The error stack is as below :
There are also some other logs:
After diving into FSNamesystem#commitBlockSynchronizationsource method, we find the root cause of this critical problem is as below:
1、
deleteblock == true, So we remove a block of INodeFile successfully.2、
closeFile == true,we try to closeFileCommitBlocks.3、 In method closeFileCommitBlocks, we met an
IllegalStateException Exception, this cause edit log write failed.At this moment, the blocks metadatas in active namenode are different from standby(or observer) namenode.
4、In the next block recovery progress, all operations are success. we write an OP_CLOSE edit log.
5、Standby namenode apply that OP_CLOSE edit log and found metadata not match, then crashed.
Below are the key codes snippets:
How to test ?
Add an unit test. The result is as below:
Before applying this patch: