Skip to content

Conversation

@siddhantsangwan
Copy link
Contributor

@siddhantsangwan siddhantsangwan commented Apr 12, 2024

What changes were proposed in this pull request?

In this scenario, EC reconstruction is failing in the new Ozone version for EC data written in the old version because the stripeChecksum field in the ChunkInfo proto breaks compatibility between the two versions.
Please check the Jira for a description and analysis of the problem. In this pull request, I changed the code so that reconstruction doesn't fail if stripeChecksum is missing. Also added some logging.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10652

How was this patch tested?

Added a unit test.

…ption: None of the block data have checksum"
break;
} else {
ChunkInfo chunk = chunks.get(0);
LOG.info("The first chunk in block with index {} does not have stripeChecksum. BlockID: {}, Block size: {}." +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checksum data is only stored in replica indexes 1 and the parity indexes. I think the BlockData[] array could hold any indexes, so it would be expected for some of them to not have it. Therefore this log might be a bit noisy and cause confusion.

I think it would be OK to make this debug, or just depend on the log you added below that warns if there was none found in any indexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, I'll make it a debug log.

}

BlockData checksumBlockData = null;
boolean foundStripeChecksum = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use checkSumBlockData == null to check if it was found, rather than adding a new boolean I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using this boolean to log only when stripeChecksum wasn't found, independent of whether checksumData was found or not. It's just to indicate that stripeChecksum wasn't found in this version, which can be helpful when debugging.

  1. Use the boolean to log when stripeChecksum wasn't found in any index.
  2. Instead of throwing, log at line 160 if checksumData was also not found (based on your comment below).

if (!foundStripeChecksum) {
LOG.warn("Could not find stripeChecksum in any index for blockData with BlockID {}, length {} and " +
"blockGroupLength {}.", blockID, blockData.length, blockGroupLength);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change the exception at line 160 to not throw if nothing has been found too. Infact, this log could just go into the else at line 160, rather than the throw.

@siddhantsangwan
Copy link
Contributor Author

@sodonnel Thanks for reviewing. Addressed your comments in the latest commit, yet to add testing.

.hasStripeChecksum()) {
checksumBlockData = bd;
break;
if (chunks != null && chunks.size() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original logic had:

if (chunks != null && chunks.size() > 0 && chunks.get(0)
          .hasStripeChecksum()) {

So that checksumBlockData is only set if hasStripeChecksum()) returns true. With the change you have made, checksumBlockData will be set if hasChecksumData(), but it might not have stripeChecksum in it.

Then at line 155 it will enter the IF block and at line 174 I am not sure what will happen if stripeCheckSum is missing.

All the IF statement at 155 does is copy in the stripeChecksum if it exists. So if it does not exist, there is no point in going into that IF at all, as we will just be copying the original chunkChecksum out and back in again.

Following on from this - we only set checksumBlockData if there is a stripeChecksum, then you can also remove the foundStripeChecksum boolean as checksumBlockData != null means the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so what you're saying is that there's no point in going inside the if block at line 155 if stripeChecksum isn't found because that code is only setting stripeChecksum. So overall, the only change we need here is to log instead of throw at line 184. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's it.

}
}

if (!foundStripeChecksum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this log down to the else statement that is currently at line 184, as that else only executed if foundStripeChecksum is false (via chunkChecksumData == null).

@siddhantsangwan siddhantsangwan marked this pull request as ready for review April 17, 2024 19:29
@siddhantsangwan
Copy link
Contributor Author

I've added a unit test. The PR is ready for another round of reviews.

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adoroszlai adoroszlai changed the title HDDS-10652. [Upgrade][EC] Reconstruction failing with "java.io.IOException: None of the block data have checksum" HDDS-10652. EC Reconstruction fails with "IOException: None of the block data have checksum" after upgrade Apr 18, 2024
@adoroszlai adoroszlai merged commit 99a5703 into apache:master Apr 18, 2024
@adoroszlai
Copy link
Contributor

Thanks @siddhantsangwan for the fix, @sodonnel for the review.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024
…ock data have checksum" after upgrade (apache#6520)

(cherry picked from commit 99a5703)
swamirishi pushed a commit to swamirishi/ozone that referenced this pull request Jun 10, 2024
…ne of the block data have checksum" after upgrade (apache#6520)

(cherry picked from commit 99a5703)

 Conflicts:
	hadoop-hdds/client/src/test/java/org/apache/hadoop/hdds/scm/storage/TestBlockOutputStreamCorrectness.java

Change-Id: I97d0bcc631e1d0b70a1850a1e113623074918ef3
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
…ock data have checksum" after upgrade (apache#6520)

(cherry picked from commit 99a5703)
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
…ock data have checksum" after upgrade (apache#6520)

(cherry picked from commit 99a5703)
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
…ock data have checksum" after upgrade (apache#6520)

(cherry picked from commit 99a5703)
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024
…ock data have checksum" after upgrade (apache#6520)

(cherry picked from commit 99a5703)
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024
…ock data have checksum" after upgrade (apache#6520)

(cherry picked from commit 99a5703)
vtutrinov pushed a commit to vtutrinov/ozone that referenced this pull request Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants