-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-10652. EC Reconstruction fails with "IOException: None of the block data have checksum" after upgrade #6520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ption: None of the block data have checksum"
| break; | ||
| } else { | ||
| ChunkInfo chunk = chunks.get(0); | ||
| LOG.info("The first chunk in block with index {} does not have stripeChecksum. BlockID: {}, Block size: {}." + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checksum data is only stored in replica indexes 1 and the parity indexes. I think the BlockData[] array could hold any indexes, so it would be expected for some of them to not have it. Therefore this log might be a bit noisy and cause confusion.
I think it would be OK to make this debug, or just depend on the log you added below that warns if there was none found in any indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, I'll make it a debug log.
| } | ||
|
|
||
| BlockData checksumBlockData = null; | ||
| boolean foundStripeChecksum = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use checkSumBlockData == null to check if it was found, rather than adding a new boolean I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using this boolean to log only when stripeChecksum wasn't found, independent of whether checksumData was found or not. It's just to indicate that stripeChecksum wasn't found in this version, which can be helpful when debugging.
- Use the boolean to log when stripeChecksum wasn't found in any index.
- Instead of throwing, log at line 160 if checksumData was also not found (based on your comment below).
| if (!foundStripeChecksum) { | ||
| LOG.warn("Could not find stripeChecksum in any index for blockData with BlockID {}, length {} and " + | ||
| "blockGroupLength {}.", blockID, blockData.length, blockGroupLength); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to change the exception at line 160 to not throw if nothing has been found too. Infact, this log could just go into the else at line 160, rather than the throw.
|
@sodonnel Thanks for reviewing. Addressed your comments in the latest commit, yet to add testing. |
| .hasStripeChecksum()) { | ||
| checksumBlockData = bd; | ||
| break; | ||
| if (chunks != null && chunks.size() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original logic had:
if (chunks != null && chunks.size() > 0 && chunks.get(0)
.hasStripeChecksum()) {
So that checksumBlockData is only set if hasStripeChecksum()) returns true. With the change you have made, checksumBlockData will be set if hasChecksumData(), but it might not have stripeChecksum in it.
Then at line 155 it will enter the IF block and at line 174 I am not sure what will happen if stripeCheckSum is missing.
All the IF statement at 155 does is copy in the stripeChecksum if it exists. So if it does not exist, there is no point in going into that IF at all, as we will just be copying the original chunkChecksum out and back in again.
Following on from this - we only set checksumBlockData if there is a stripeChecksum, then you can also remove the foundStripeChecksum boolean as checksumBlockData != null means the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so what you're saying is that there's no point in going inside the if block at line 155 if stripeChecksum isn't found because that code is only setting stripeChecksum. So overall, the only change we need here is to log instead of throw at line 184. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's it.
| } | ||
| } | ||
|
|
||
| if (!foundStripeChecksum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this log down to the else statement that is currently at line 184, as that else only executed if foundStripeChecksum is false (via chunkChecksumData == null).
|
I've added a unit test. The PR is ready for another round of reviews. |
sodonnel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks @siddhantsangwan for the fix, @sodonnel for the review. |
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ne of the block data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703) Conflicts: hadoop-hdds/client/src/test/java/org/apache/hadoop/hdds/scm/storage/TestBlockOutputStreamCorrectness.java Change-Id: I97d0bcc631e1d0b70a1850a1e113623074918ef3
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ock data have checksum" after upgrade (apache#6520) (cherry picked from commit 99a5703)
…ock data have checksum" after upgrade (apache#6520)
What changes were proposed in this pull request?
In this scenario, EC reconstruction is failing in the new Ozone version for EC data written in the old version because the
stripeChecksumfield in theChunkInfoproto breaks compatibility between the two versions.Please check the Jira for a description and analysis of the problem. In this pull request, I changed the code so that reconstruction doesn't fail if
stripeChecksumis missing. Also added some logging.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10652
How was this patch tested?
Added a unit test.