HDFS-16533. COMPOSITE_CRC failed between replicated file and striped file due to invalid requested length#4155
Conversation
|
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Could you elaborate what this check is? Looking at the test case I assume these few lines distinguish replicated vs striped blocks. Am I right? How about turning them into a helper method that is more readable?
There was a problem hiding this comment.
Thanks @jojochuang for your comment.
First, i will explain the goal of UT:
- Use the same context to create a replicated file and a striped file.
- Set the conf the use COMPOSITE_CRC
- Expected the same checksum result of any length from the replicated file and the striped file via getFileChecksum
Second, i will explain the root cause:
- blockLocations in line 104, contains a blocks and a lastLocatedBlock
- the last block in blocks maybe not the same with lastLocatedBlock when the input length is less than file length.
- so, we cannot always compare with the lastLocatedBlock to get the composer length in line 336.
There was a problem hiding this comment.
Whether it is a replicated file or striped file, for a block, we will obtain a 4-bytes composer crc, and the actual size corresponding to the crc is very important, because line 336 will use it to compute the composer crc.
Suppose a file has 4 blocks, number block1, block2, block3 and block4 respectively, and the size of each blocks is 10MB, 10MB, 10MB, 7MB. And i use getFilecheck(mockFile, 29MB). The correct consumedLastBlockLength in line 336 should be 9MB, but the result of the current logic is 7MB which from the last block size of the file. So we will get an error composer crc.
|
@jojochuang @xiao-chen Please help me review the latest PR, thanks. |
|
💔 -1 overall
This message was automatically generated. |
|
Update the patch to fix check style |
|
Another way to fix this bug like: |
|
💔 -1 overall
This message was automatically generated. |
|
The failed UTs is not related to the modification, and I will create a new PR to fix the failed UTs. |
|
Both org.apache.hadoop.hdfs.TestClientProtocolForPipelineRecovery.testPipelineRecoveryOnRestartFailure and org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaPlacement.testSynchronousEviction are introduced by HDFS-16534. @jojochuang please help me review it again. |
|
@Hexiaoqiao @jojochuang Could you help me review this patch? The failed UTs not caused by this modification, and has been solved in other jira. |
|
🎊 +1 overall
This message was automatically generated. |
jojochuang
left a comment
There was a problem hiding this comment.
I wish the description could be clearer. I think you found a corner case where the length to check is longer than the actual file length, but the initial jira description wasn't clear about it as it sounded like it was due to some randomness. Otherwise it couldn't explain why the existing test didn't cover this bug.
…um with a special length.
|
@jojochuang Sir, I have updated the description and the patch based on the latest trunk. Please help me review it again. |
|
💔 -1 overall
This message was automatically generated. |
jojochuang
left a comment
There was a problem hiding this comment.
+1. The patch makes sense now. Thank you.
| consumedLastBlockLength = length - sumBlockLengths; | ||
| } | ||
| LocatedBlock nextBlock = locatedBlocks.get(i); | ||
| long consumedLastBlockLength = Math.min(length - sumBlockLengths, nextBlock.getBlockSize()); |
There was a problem hiding this comment.
Thank you. The new code looks much cleaner/clearer now!
There was a problem hiding this comment.
Thanks for your review.
|
@jojochuang Thank you very much for your review and suggestions. |
…file due to invalid requested length. (apache#4155) Co-authored-by: zengqiang.xu <zengqiang.xu@shopee.com>
When I modifying some codes about
getFileChecksumand found thatgetFileChecksumwill return an error composite checksum with a special length.The code that is suspected to have a bug is as follows:
Suppose there is a file with 4 blocks, namely block1, block2, block3 and block4, and the size of each blocks is 10MB, 10MB, 10MB, 7MB respectively.
When I try to get a composite checksum by
getFileChecksumwith length 29MB, an incorrect checksum is returned. Because of 29MB, only involved block1(10MB), block2(10MB) and block3(9MB), doesn't involve the last block4 of the file.But code line323 ~ line331 is checking the last block4 of the file, caused an incorrect composite checksum. In theory, we just need to check the remaining length with block3 not last block of the file, block4.