-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-11475: Verify EC reconstruction correctness #7401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks @devabhishekpal for working on this. Please wait for clean CI run in fork before opening PR (or marking as "ready for review").
Since this is controlled by a new config, which defaults to the old behavior, I don't think it is validated by any unit tests. |
|
The approach used here, is to take the chunk buffer, which holds the real data just written to the block, and calculate the checksum on it. However that is duplicating work, as the act of writing the data through the ECBlockOutput stream already performs that checksum and persists it in the block metadata as part of the put block. I have had to look at this for some time to try to understand the current flow. Its been a long time since this EC code was written, and the checksum stuff was not written by me. @aswinshakil might be a good person for a second look. Starting in the ECReconstructionCoordinator, there is code where it calls Inside I think that approach will work, and it avoids calculating the checksum from the data a second time. |
aswinshakil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @devabhishekpal. I have posted some comments below, some test cases to validate this would be good.
| throws OzoneChecksumException { | ||
|
|
||
| // If we have say 100 bytes per checksum, in the stripe the first 100 bytes should | ||
| // correspond to the fist chunk checksum, next 100 should be the second chunk checksum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A chunk can a multiple checksum depending on the size of the chunk and bytesPerCrc.
For example, If we have EC 3-2-1024k. We have 1 MB chunk, The calculation would be correct if the bytesPerCrc is also 1MB. ButbytesPerCrc is configurable. But by default #6331 changes this value to 16KB. Which means we would have (1024/16) = 16 checksums for each chunk. We need to take that into account as well.
You can take a look at #7230 I have added changes to split the stripeChecksum into parts. But the core idea is the one I mentioned above.
| int bytesPerChecksum = checksumData.getBytesPerChecksum(); | ||
|
|
||
| int checksumIdxStart = (bytesPerChecksum * chunkIndex); | ||
| ByteString expectedChecksum = stripeChecksum.substring(checksumIdxStart, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of ByteString and substring and we can use ByteBuffer for fine grained byte level buffer manipulation. ECBlockChecksumComputer#computeCompositeCrc() has similar implementation for this.
| int bytesPerChecksum = recreatedChunkChecksum.getBytesPerChecksum(); | ||
| int parityLength = (int) (Math.ceil((double)ecChunkSize / bytesPerChecksum) * 4L * parityCount); | ||
| // Ignore the parity bits | ||
| stripeChecksum.limit(checksumSize - parityLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we limiting due to parity? We could be reconstructing a parity index, and it should have checksum too. Or, does the stripe checksum not contain the parity checksums? I cannot remember how this was designed, but if you are reducing the effective stripeChecksum length to remove parity, then parity is likely included in the stripechecksum.
| // Number of Checksums per Chunk = (chunkSize / bytesPerChecksum) | ||
| // So the checksum should start from (numOfBytesPerChecksum * (chunkIdx * numOfChecksumPerChunk) | ||
|
|
||
| int checksumIdxStart = (ecChunkSize * chunkIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not align with the comment above, as we are not considering numOfBytesPerChecksum or numOfChecksumPerChunk ?
Also I am not sure about the above calculation.
What if the bytes per checksum is 100, and the chunksize is 1000, but only 80 bytes were written? I that case, we would expect a stripe (for EC-3-2) that looks like:
Index_1: 80 bytes of data, 4 bytes of checksum.
Index_2: 0 bytes
Index_3: 0 bytes
Index_4: 80 bytes of data, 4 bytes of checksum.
Index_5: 80 bytes of data, 4 bytes of checksum.
Similar, if you have 1080 bytes written, then index 1 and 2 will have data, but index 2 has shorter data and a shorter checksum. The logic is different (and simpler) for a full stripe than a partial stripe.
|
Thanks @devabhishekpal for the patch. Please revisit if/when you have time. |
What changes were proposed in this pull request?
HDDS-11475: Verify EC reconstruction correctness
Please describe your PR in detail:
private StripeWriteStatus commitStripeWrite(ECChunkBuffers stripe)What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11475
How was this patch tested?
Unit tests