-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-11887. Recon - Identify container replicas difference based on content checksums #7942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11887. Recon - Identify container replicas difference based on content checksums #7942
Conversation
Change-Id: Ic58b5b195c36799104370490c61a7e7dd8466eb7
Change-Id: Iad0013190331886714ba1643bc671cdc0bd27242
Change-Id: I4ebc33680cc17f8f0be4c7f39ccdb0fa6cde00fb
Change-Id: Ie6f7e6d9f869057be371f3466621f5ea6ef03f68
Change-Id: I542a4badc17f21229f6586446e7fbb7b82de3518
Change-Id: Ie7d09f16121a5cf633db23f2c02f4e360f664ca6
Change-Id: Ib8b3412c8080663d176028eeffda1d71f8988aef
Change-Id: I1b33049e75f4a40a7ab84b604625ab93775ffc6a
Change-Id: I75ef680eb643eddc37b33141ffd50f5c33c555b7
Change-Id: I516826fa8bdfeb2ea557081c8b4983c94ba4260e
Change-Id: Iaed2368bcd737aaacaac6a2038bb15fed5aff674
Change-Id: I41ab7a2d5e210fb958bf151383c4c2991950606f
Change-Id: I4f27e5275ccee3150e976e859e964f25bd196172
Change-Id: I4036472a8e2a3c360b19fbee1a5c05ee3065b38a
devmadhuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dombizita for the patch. Few comments. Pls check.
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTask.java
Show resolved
Hide resolved
| required int64 lastSeenTime = 3; | ||
| required int64 bcsId = 4; | ||
| optional string state = 5; | ||
| optional int64 dataChecksum = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this new field in proto, will there be any decoding issues for existing data in replica_history table in upgrade case when try to load ? Has this been tested ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this as an optional field, so that is saving us from some problems regarding compatibility (similarly how it was done in #4443). But I'm not sure if we should handle the case when there is no dataChecksum and we call this
Lines 96 to 101 in 2874c87
| public static ContainerReplicaHistory fromProto( | |
| ContainerReplicaHistoryProto proto) { | |
| return new ContainerReplicaHistory(UUID.fromString(proto.getUuid()), | |
| proto.getFirstSeenTime(), proto.getLastSeenTime(), proto.getBcsId(), | |
| proto.getState(), proto.getDataChecksum()); | |
| } |
If we should, than the
state field is also missing this and we should do it for that as well.
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, just one comment in addition to Devesh's review above.
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/persistence/ContainerHistory.java
Show resolved
Hide resolved
|
#7942 (comment) made me realise that with introducing the Lines 46 to 54 in 59aaa5c
What do you think, how should we handle this? It should be in both states and it should show up in both unhealthy container tables, right? We have stats for each container state count and the overall count, how should we do that? Count it to both states, but we shouldn't count them twice to the overall count? I'll look over my patch and check if I should change anything based on this, as I think this will bring new scenarios into the picture, how the unhealthy containers are handled in Recon. Let me know if I miss something. |
As far as I know, I think , we add the record per container per state in |
Change-Id: I32a5d519087badeeadb6e69fd9dba21cd3170b6e
Thanks for linking these classes @devmadhuu. Based on these I think we should be good storing the |
devmadhuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dombizita for improving the patch. Changes largely LGTM +1.
ArafatKhan2198
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @dombizita on working on this, based on the patch review and our´ initial offline discussion everything looks good!
f47b407
into
apache:HDDS-10239-container-reconciliation
…anner-builds-mt * HDDS-10239-container-reconciliation: (646 commits) HDDS-11763. Implement container repair logic within datanodes. (apache#7474) HDDS-11887. Recon - Identify container replicas difference based on content checksums (apache#7942) HDDS-12397. Persist putBlock for closed container. (apache#7943) HDDS-12361. Mark testGetSnapshotDiffReportJob as flaky HDDS-12375. Random object created and used only once (apache#7933) HDDS-12335. Fix ozone admin namespace summary to give complete output (apache#7908) HDDS-12364. Require Override annotation for overridden methods (apache#7923) HDDS-11530. Support listMultipartUploads max uploads and markers (apache#7817) HDDS-11867. Remove code paths for non-Ratis SCM. (apache#7911) HDDS-12363. Add import options to IntelliJ IDEA style settings (apache#7921) HDDS-12215. Mark testContainerStateMachineRestartWithDNChangePipeline as flaky HDDS-12331. BlockOutputStream.failedServers is not thread-safe (apache#7885) HDDS-12188. Move server-only upgrade classes from hdds-common to hdds-server-framework (apache#7903) HDDS-12362. Remove temporary checkstyle suppression file (apache#7920) HDDS-12343. Fix spotbugs warnings in Recon (apache#7902) HDDS-12286. Fix license headers and imports for ozone-tools (apache#7919) HDDS-12284. Fix license headers and imports for ozone-s3-secret-store (apache#7917) HDDS-12275. Fix license headers and imports for ozone-integration-test (apache#7904) HDDS-12164. Rename and deprecate DFSConfigKeysLegacy config keys (apache#7803) HDDS-12283. Fix license headers and imports for ozone-recon-codegen (apache#7916) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeWriter.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeTestUtils.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/TestContainerMerkleTreeWriter.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueContainerCheck.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestBackgroundContainerDataScanner.java hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestOnDemandContainerDataScanner.java hadoop-ozone/dist/src/main/smoketest/admincli/container.robot hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/scanner/TestBackgroundContainerDataScannerIntegration.java hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/scanner/TestBackgroundContainerMetadataScannerIntegration.java hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/scanner/TestOnDemandContainerDataScannerIntegration.java
What changes were proposed in this pull request?
#6506 introduced a new container replica field,
dataChecksum, which will tell us a specific container replica's checksum. In this patch I'm adding the functionality in Recon to behave based on this value.Recon already processes the container reports from the datanodes and based on these the
ContainerHealthTaskdoes certain things. It'll scan through the containers and mark containers unhealthy if needed (e.g. missing, under replicated, etc.). I'm introducing a new unhealthy container state in Recon, calledREPLICA_MISMATCH. A container will be markedREPLICA_MISMATCHby Recon if at least one of their replicas data checksum is different from the others.To be able to tell which replica has which data checksum value, I added the
dataChecksumfield to ContainerReplicaHistoryProto as an optional field (just like in #4443 thestatefield). In the ContainerEndpoint when we are requesting the /unhealthy containers, we will be able to get the information for each replica (here).This patch only does the backend changes, in the Recon UI we'll add the new table in HDDS-12395.
As I'm adding a new unhealthy container state in Recon, it'll introduce a new state in that column in the DB (where Recon maintains the unhealthy containers). With the help of the Recon Schema Versioning I added a new upgrade action,
UnhealthyContainerReplicaMismatchAction, to handle this.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11887
How was this patch tested?
Updated and added new unit tests: https://github.com/dombizita/ozone/actions/runs/13454086954