Skip to content

Conversation

@devabhishekpal
Copy link
Contributor

What changes were proposed in this pull request?

HDDS-11475: Verify EC reconstruction correctness

Please describe your PR in detail:

  • In current implementation the stripe checksum is formed in ECKeyOutputStream in private StripeWriteStatus commitStripeWrite(ECChunkBuffers stripe)
  • To verify the recreated data we can use the stripe checksum - which is a concatenation of all the chunk checksums in the stripe and compare the recreated chunk checksum with the stripe checksum at recreated index to verify the correct data was recreated
Example, for EC 3-2 we will have chunks c1, c2, c3, c4, c5
(stripe checksum) s1 = add checksum of(c1 to c5)

Let recreated chunk be c2
Then:
checksum(c2) == 2nd checksum in s1 checksum sequence

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11475

How was this patch tested?

Unit tests

@devabhishekpal devabhishekpal marked this pull request as draft November 6, 2024 21:48
@devabhishekpal devabhishekpal marked this pull request as ready for review November 8, 2024 11:00
@adoroszlai adoroszlai marked this pull request as draft November 8, 2024 11:43
@adoroszlai
Copy link
Contributor

Thanks @devabhishekpal for working on this.

Please wait for clean CI run in fork before opening PR (or marking as "ready for review").

Unit tests

Since this is controlled by a new config, which defaults to the old behavior, I don't think it is validated by any unit tests.

@sodonnel
Copy link
Contributor

sodonnel commented Nov 8, 2024

The approach used here, is to take the chunk buffer, which holds the real data just written to the block, and calculate the checksum on it.

However that is duplicating work, as the act of writing the data through the ECBlockOutput stream already performs that checksum and persists it in the block metadata as part of the put block.

I have had to look at this for some time to try to understand the current flow. Its been a long time since this EC code was written, and the checksum stuff was not written by me. @aswinshakil might be a good person for a second look.

Starting in the ECReconstructionCoordinator, there is code where it calls executePutBlock(...) on the reconstructed streams. Here, I think, is where we can validate the checks match the stripe checksum:

        for (ECBlockOutputStream targetStream : allStreams) {

         // You can get the current chunkList and its checksums calculated while writing. These are what will be written
         // as part of the putBlock call. However if we get them here, each chunk has its checksums.
         // Using blockDataGroup, which is all the blockData that existed on the containers prior to any reconstruction, we can
         // search it for one which contains the stripChecksum. We know it lives in replicaIndex=1 or any parity, however you
         // many not have index 1 (it could be getting reconstructed) or all the parities, but you must have at least 1 of them
         // to make the thing reconstructable. There you must search until it can be found.
         // 
         // targetStream.getContainerBlockData().getChunksList().get(0).getChecksumData();
         // blockDataGroup[0].getChunks().get(0).getStripeChecksum();
         //
         // From above, if you have the chunkList and hence its checksums for the current stream, and you can locate
         // the existing stripe checksum in the blockDataGroup, then you can "simply" iterate the chunkList:
         //  
         //  List<Chunk> chunks =  targetStream.getContainerBlockData().getChunksList();
         //  List<Chunk> existingChunks = blockDataGroup[0].getChunks();
         // for (int i = 0; i < chunks.length; i++ ) {
         //      validateChecksum(chunks.get(i).getChecksumData(), existingChunks.get(i).getStripChecksum());
         // }
         //

          targetStream.executePutBlock(true, true, blockLocationInfo.getLength(), blockDataGroup);
          checkFailures(targetStream, targetStream.getCurrentPutBlkResponseFuture());
        }

Inside validateChecksum() you need to figure out how to index into the strip checksum to find the relevant part of it to compare against the chunkchecksum.

I think that approach will work, and it avoids calculating the checksum from the data a second time.

Copy link
Member

@aswinshakil aswinshakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @devabhishekpal. I have posted some comments below, some test cases to validate this would be good.

throws OzoneChecksumException {

// If we have say 100 bytes per checksum, in the stripe the first 100 bytes should
// correspond to the fist chunk checksum, next 100 should be the second chunk checksum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A chunk can a multiple checksum depending on the size of the chunk and bytesPerCrc.
For example, If we have EC 3-2-1024k. We have 1 MB chunk, The calculation would be correct if the bytesPerCrc is also 1MB. ButbytesPerCrc is configurable. But by default #6331 changes this value to 16KB. Which means we would have (1024/16) = 16 checksums for each chunk. We need to take that into account as well.

You can take a look at #7230 I have added changes to split the stripeChecksum into parts. But the core idea is the one I mentioned above.

int bytesPerChecksum = checksumData.getBytesPerChecksum();

int checksumIdxStart = (bytesPerChecksum * chunkIndex);
ByteString expectedChecksum = stripeChecksum.substring(checksumIdxStart,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of ByteString and substring and we can use ByteBuffer for fine grained byte level buffer manipulation. ECBlockChecksumComputer#computeCompositeCrc() has similar implementation for this.

int bytesPerChecksum = recreatedChunkChecksum.getBytesPerChecksum();
int parityLength = (int) (Math.ceil((double)ecChunkSize / bytesPerChecksum) * 4L * parityCount);
// Ignore the parity bits
stripeChecksum.limit(checksumSize - parityLength);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we limiting due to parity? We could be reconstructing a parity index, and it should have checksum too. Or, does the stripe checksum not contain the parity checksums? I cannot remember how this was designed, but if you are reducing the effective stripeChecksum length to remove parity, then parity is likely included in the stripechecksum.

// Number of Checksums per Chunk = (chunkSize / bytesPerChecksum)
// So the checksum should start from (numOfBytesPerChecksum * (chunkIdx * numOfChecksumPerChunk)

int checksumIdxStart = (ecChunkSize * chunkIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not align with the comment above, as we are not considering numOfBytesPerChecksum or numOfChecksumPerChunk ?

Also I am not sure about the above calculation.

What if the bytes per checksum is 100, and the chunksize is 1000, but only 80 bytes were written? I that case, we would expect a stripe (for EC-3-2) that looks like:

Index_1: 80 bytes of data, 4 bytes of checksum.
Index_2: 0 bytes
Index_3: 0 bytes
Index_4: 80 bytes of data, 4 bytes of checksum.
Index_5: 80 bytes of data, 4 bytes of checksum.

Similar, if you have 1080 bytes written, then index 1 and 2 will have data, but index 2 has shorter data and a shorter checksum. The logic is different (and simpler) for a full stripe than a partial stripe.

@adoroszlai
Copy link
Contributor

Thanks @devabhishekpal for the patch. Please revisit if/when you have time.

@adoroszlai adoroszlai closed this Mar 27, 2025
@devabhishekpal devabhishekpal deleted the HDDS-11475 branch May 5, 2025 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants