HDFS-16598. Fix DataNode FsDatasetImpl lock issue without GS checks.#4366
HDFS-16598. Fix DataNode FsDatasetImpl lock issue without GS checks.#4366Hexiaoqiao merged 4 commits intoapache:trunkfrom
Conversation
|
@Hexiaoqiao @ayushtkn This bug was introduced by HDFS-16534. Because the block GS of client may be smaller than DN when pipeline recovery failed. Please help me review it, thanks. |
|
💔 -1 overall
This message was automatically generated. |
|
@ZanderXu Thanks for your catches. Sorry I don't get why HDFS-16534 could lead GS inconsistency between client and datanode. Thanks again. |
|
In the original PipelineRecovery process, if the pipeline recovery failed, the block GS maybe inconsistency. So during Pipeline Recovery, GS inconsistency is expected. HDFS-16534 has a bug in handling inconsistent GS, and caused All datanodes XXX are bad. Aborting... |
|
Thanks for your feedback.But I don't quite understand why this PR can solves the problem.It seems that you only change getReplicaInfo(x) to getReplicaInfo(x,x), the logic in two method was almost same. |
|
getReplicaInfo(ExtendedBlock b) will check gs, and getReplicaInfo(String bpid, long blkid) will not check the gs. |
|
@ZanderXu LGTM. When we acquire volume lock we should invoke getReplicaInfo(..) to get volume uuid for replica.So the GS is not necessary check for acquire lock stage.The original method will do check GS If GS check is necessary. |
|
@MingXiangLi Thanks for you review. |
@ZanderXu @MingXiangLi |
|
Thanks @slfan1989 for your comment. |
@ZanderXu Thanks for the great catch here.
It is good question. |
@Hexiaoqiao do you mean to change all getReplicaInfo(ExtendedBlock b) to getReplicaInfo(String bpid, long blkid) in fine-grained lock in this PR? |
Sure. It is not necessary to check GS when acquire BP/VOLUME lock which is totally not related with GS IMO. Thanks. |
|
Got, I will do it. |
@Hexiaoqiao @MingXiangLi @slfan1989 I have update the patch, please help me review it. Thanks About this jira, please refer to UT |
| return info; | ||
| } | ||
|
|
||
| ReplicaInfo getReplicaInfoForLock(ExtendedBlock b) |
There was a problem hiding this comment.
Great progress here. How about to integrate getReplicaInfoForLock ang getStorageUuid together to getStorageUuidForLock?
String getStorageUuidForLock(ExtendedBlock b)
throws ReplicaNotFoundException {
return getReplicaInfo(b.getBlockPoolId(), b.getBlockId()).getStorageUuid();
}
There was a problem hiding this comment.
Update the patch, please review again.
…ther to getStorageUuidForLock
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
…ataset.impl.TestWriteToReplica.testAppend
|
@ZanderXu thanks for your contribution. |
|
Yetus reported one failed unit test, |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
@Hexiaoqiao he The failed UT |
Hexiaoqiao
left a comment
There was a problem hiding this comment.
LGTM. +1. Will commit if no other more comments. cc @MingXiangLi
|
LGTM.@ZanderXu thanks for your contribution. |
|
Committed to trunk. Javadoc issue reported by Yetus is not related to this PR. Which is tracking by another ticket. |
|
Thanks @Hexiaoqiao @MingXiangLi and @slfan1989 for your reviews. |
…pache#4366). Contributed by ZanderXu. Reviewed-by: Mingxiang Li <liaiphag0@gmail.com> Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
Detail info please refer HDFS-16598.
All datanodes [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]] are bad. Aborting...