HDDS-11475: Verify EC reconstruction correctness #7401

devabhishekpal · 2024-11-06T21:48:47Z

What changes were proposed in this pull request?

HDDS-11475: Verify EC reconstruction correctness

Please describe your PR in detail:

In current implementation the stripe checksum is formed in ECKeyOutputStream in private StripeWriteStatus commitStripeWrite(ECChunkBuffers stripe)
To verify the recreated data we can use the stripe checksum - which is a concatenation of all the chunk checksums in the stripe and compare the recreated chunk checksum with the stripe checksum at recreated index to verify the correct data was recreated

Example, for EC 3-2 we will have chunks c1, c2, c3, c4, c5
(stripe checksum) s1 = add checksum of(c1 to c5)

Let recreated chunk be c2
Then:
checksum(c2) == 2nd checksum in s1 checksum sequence

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11475

How was this patch tested?

Unit tests

adoroszlai · 2024-11-08T11:45:36Z

Thanks @devabhishekpal for working on this.

Please wait for clean CI run in fork before opening PR (or marking as "ready for review").

Unit tests

Since this is controlled by a new config, which defaults to the old behavior, I don't think it is validated by any unit tests.

sodonnel · 2024-11-08T17:25:14Z

The approach used here, is to take the chunk buffer, which holds the real data just written to the block, and calculate the checksum on it.

However that is duplicating work, as the act of writing the data through the ECBlockOutput stream already performs that checksum and persists it in the block metadata as part of the put block.

I have had to look at this for some time to try to understand the current flow. Its been a long time since this EC code was written, and the checksum stuff was not written by me. @aswinshakil might be a good person for a second look.

Starting in the ECReconstructionCoordinator, there is code where it calls executePutBlock(...) on the reconstructed streams. Here, I think, is where we can validate the checks match the stripe checksum:

        for (ECBlockOutputStream targetStream : allStreams) {

         // You can get the current chunkList and its checksums calculated while writing. These are what will be written
         // as part of the putBlock call. However if we get them here, each chunk has its checksums.
         // Using blockDataGroup, which is all the blockData that existed on the containers prior to any reconstruction, we can
         // search it for one which contains the stripChecksum. We know it lives in replicaIndex=1 or any parity, however you
         // many not have index 1 (it could be getting reconstructed) or all the parities, but you must have at least 1 of them
         // to make the thing reconstructable. There you must search until it can be found.
         // 
         // targetStream.getContainerBlockData().getChunksList().get(0).getChecksumData();
         // blockDataGroup[0].getChunks().get(0).getStripeChecksum();
         //
         // From above, if you have the chunkList and hence its checksums for the current stream, and you can locate
         // the existing stripe checksum in the blockDataGroup, then you can "simply" iterate the chunkList:
         //  
         //  List<Chunk> chunks =  targetStream.getContainerBlockData().getChunksList();
         //  List<Chunk> existingChunks = blockDataGroup[0].getChunks();
         // for (int i = 0; i < chunks.length; i++ ) {
         //      validateChecksum(chunks.get(i).getChecksumData(), existingChunks.get(i).getStripChecksum());
         // }
         //

          targetStream.executePutBlock(true, true, blockLocationInfo.getLength(), blockDataGroup);
          checkFailures(targetStream, targetStream.getCurrentPutBlkResponseFuture());
        }

Inside validateChecksum() you need to figure out how to index into the strip checksum to find the relevant part of it to compare against the chunkchecksum.

I think that approach will work, and it avoids calculating the checksum from the data a second time.

aswinshakil

Thanks for working on this @devabhishekpal. I have posted some comments below, some test cases to validate this would be good.

aswinshakil · 2024-12-02T23:36:01Z

...r-service/src/main/java/org/apache/hadoop/ozone/container/ec/reconstruction/ECValidator.java

+    throws OzoneChecksumException {
+
+    // If we have say 100 bytes per checksum, in the stripe the first 100 bytes should
+    // correspond to the fist chunk checksum, next 100 should be the second chunk checksum


A chunk can a multiple checksum depending on the size of the chunk and bytesPerCrc.
For example, If we have EC 3-2-1024k. We have 1 MB chunk, The calculation would be correct if the bytesPerCrc is also 1MB. ButbytesPerCrc is configurable. But by default #6331 changes this value to 16KB. Which means we would have (1024/16) = 16 checksums for each chunk. We need to take that into account as well.

You can take a look at #7230 I have added changes to split the stripeChecksum into parts. But the core idea is the one I mentioned above.

aswinshakil · 2024-12-02T23:39:44Z

...r-service/src/main/java/org/apache/hadoop/ozone/container/ec/reconstruction/ECValidator.java

+    int bytesPerChecksum = checksumData.getBytesPerChecksum();
+
+    int checksumIdxStart = (bytesPerChecksum * chunkIndex);
+    ByteString expectedChecksum = stripeChecksum.substring(checksumIdxStart,


Instead of ByteString and substring and we can use ByteBuffer for fine grained byte level buffer manipulation. ECBlockChecksumComputer#computeCompositeCrc() has similar implementation for this.

sodonnel · 2024-12-10T14:53:28Z

...r-service/src/main/java/org/apache/hadoop/ozone/container/ec/reconstruction/ECValidator.java

+    int bytesPerChecksum = recreatedChunkChecksum.getBytesPerChecksum();
+    int parityLength = (int) (Math.ceil((double)ecChunkSize / bytesPerChecksum) * 4L * parityCount);
+    // Ignore the parity bits
+    stripeChecksum.limit(checksumSize - parityLength);


Why are we limiting due to parity? We could be reconstructing a parity index, and it should have checksum too. Or, does the stripe checksum not contain the parity checksums? I cannot remember how this was designed, but if you are reducing the effective stripeChecksum length to remove parity, then parity is likely included in the stripechecksum.

sodonnel · 2024-12-10T15:05:50Z

...r-service/src/main/java/org/apache/hadoop/ozone/container/ec/reconstruction/ECValidator.java

+    //    Number of Checksums per Chunk = (chunkSize / bytesPerChecksum)
+    // So the checksum should start from (numOfBytesPerChecksum * (chunkIdx * numOfChecksumPerChunk)
+
+    int checksumIdxStart = (ecChunkSize * chunkIndex);


This does not align with the comment above, as we are not considering numOfBytesPerChecksum or numOfChecksumPerChunk ?

Also I am not sure about the above calculation.

What if the bytes per checksum is 100, and the chunksize is 1000, but only 80 bytes were written? I that case, we would expect a stripe (for EC-3-2) that looks like:

Index_1: 80 bytes of data, 4 bytes of checksum.
Index_2: 0 bytes
Index_3: 0 bytes
Index_4: 80 bytes of data, 4 bytes of checksum.
Index_5: 80 bytes of data, 4 bytes of checksum.

Similar, if you have 1080 bytes written, then index 1 and 2 will have data, but index 2 has shorter data and a shorter checksum. The logic is different (and simpler) for a full stripe than a partial stripe.

adoroszlai · 2025-03-27T18:36:57Z

Thanks @devabhishekpal for the patch. Please revisit if/when you have time.

devabhishekpal added 4 commits October 29, 2024 11:50

HDDS-11475. Validate EC reconstruction on DN

6f8f447

Initial ECValidator

8322bbb

Added intial validator call

e7eba94

Implement the validator

5516cce

devabhishekpal marked this pull request as draft November 6, 2024 21:48

devabhishekpal marked this pull request as ready for review November 8, 2024 11:00

adoroszlai marked this pull request as draft November 8, 2024 11:43

devabhishekpal added 2 commits November 14, 2024 14:01

Todo reconstructor

db7b00d

Final Validator implementation

21146d3

aswinshakil reviewed Dec 2, 2024

View reviewed changes

Fixed validator implementation for comparison of checksums

f4b2cea

sodonnel reviewed Dec 10, 2024

View reviewed changes

adoroszlai closed this Mar 27, 2025

devabhishekpal deleted the HDDS-11475 branch May 5, 2025 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-11475: Verify EC reconstruction correctness #7401

HDDS-11475: Verify EC reconstruction correctness #7401

Uh oh!

devabhishekpal commented Nov 6, 2024

Uh oh!

adoroszlai commented Nov 8, 2024

Uh oh!

sodonnel commented Nov 8, 2024

Uh oh!

aswinshakil left a comment

Uh oh!

aswinshakil Dec 2, 2024

Uh oh!

aswinshakil Dec 2, 2024

Uh oh!

sodonnel Dec 10, 2024

Uh oh!

sodonnel Dec 10, 2024

Uh oh!

adoroszlai commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-11475: Verify EC reconstruction correctness #7401

HDDS-11475: Verify EC reconstruction correctness #7401

Uh oh!

Conversation

devabhishekpal commented Nov 6, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai commented Nov 8, 2024

Uh oh!

sodonnel commented Nov 8, 2024

Uh oh!

aswinshakil left a comment

Choose a reason for hiding this comment

Uh oh!

aswinshakil Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

aswinshakil Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

sodonnel Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

sodonnel Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants