-
Notifications
You must be signed in to change notification settings - Fork 590
HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file #2987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…every blocks every replica.
…n key maps its every blocks every replica.
| * Class that for a given key downloads every blocks every replica and creates | ||
| * a manifest file with information about the replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please change the comment to be more clear - Class that downloads every replica for all the blocks associated with a given key. It also generates a manifest file with information about the downloaded replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JyotinderSingh, thanks for the correction, I changed it.
| * a manifest file with information about the replicas. | ||
| */ | ||
| @CommandLine.Command(name = "read-replicas", | ||
| description = "Reads a given keys every blocks every replica.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you modify the description to say Reads every replica for all the blocks associated with a given key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the correction, I changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you missed changing this description in the updated patch, could you look into it? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't changed it here, I changed it in the documentation at the ClientProtocol interface. Thanks!
|
Please add tests to verify correct behavior for the newly added |
| blockJson.addProperty("block index", blockIndex); | ||
| blockJson.addProperty("container id", block.getKey().getContainerID()); | ||
| blockJson.addProperty("local id", block.getKey().getLocalID()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you change the property names to be in camel case instead of having spaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems a reasonable, as it will be easier to work with JSON file. Changed it.
| import org.apache.hadoop.ozone.om.helpers.OzoneFileStatus; | ||
| import org.apache.hadoop.ozone.om.helpers.RepeatedOmKeyInfo; | ||
| import org.apache.hadoop.ozone.om.helpers.S3SecretValue; | ||
| import org.apache.hadoop.ozone.om.helpers.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace star import with individual imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, replaced it.
| import org.apache.hadoop.ozone.om.helpers.S3SecretValue; | ||
| import org.apache.hadoop.ozone.om.helpers.ServiceInfo; | ||
| import org.apache.hadoop.ozone.om.helpers.ServiceInfoEx; | ||
| import org.apache.hadoop.ozone.om.helpers.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace star import with individual imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, replaced them.
Hi @JyotinderSingh, thanks for taking your time to look at my work. I didn't add any tests yet, as that is my next subtask on my jira (https://issues.apache.org/jira/browse/HDDS-5739). I will add integration level tests to the read-replicas CLI tool and I am already working on it, I'm planning to finish it this week. |
|
Thanks for the contribution and for updating the patch @dombizita, the changes look good to me. I just had one minor comment that I've added inline. |
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/ReadReplicas.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/ReadReplicas.java
Outdated
Show resolved
Hide resolved
| .getObject(OzoneClientConfig.class); | ||
| clientConfig.setChecksumVerify(false); | ||
| configuration.setFromObject(clientConfig); | ||
| clientProtocolWithoutChecksum = new RpcClient(configuration, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration object here can be created a bit more easily. The ultimate goal would be to have an identical configuration for the two clients one with checksum verification turned on, the other with it turned off.
You may use the copy constructor of OzoneConfiguration and get the original config by getConf() as the base config to copy, while to set the property you can use the setBoolean(String, boolean) method to set to property value without converting OzoneConfiguration to OzoneClientConfig and then set it back.
An other case I just realized while writing this up is when checksum verification in the client's base config is turned off, as in that case both clients will have it turned off, and none of them will show checksum problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it was much easier this way, I changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplification of the code is done, and great, but the other problem I mentioned have been remained.
This code will not give you any error status if in the original configuration that comes from the getConf() method is already has the checksum verification set to false.
Note that this means that the first client we use does not check checksum failures, and then we create a second client that is meant to read data from blocks with checksum failures, but we will not detect any checksum failures with the first client, as checksum verification is turned off there as well.
Ultimately we would like to get the checksum errors, so if the original config turns off checksum verification, then we need a client set to verify the checksum, and use that for initial reads here I guess, so that we still detect checksum errors, and still we can download the data with the client which is set to skip the checksum validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I didn't think it through, that the original configuration can have the checksum verification turned off. Changed the code, so whether the configuration is set to check or not the checksum, I will have two clients, one with checksum verification and one without.
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/rpc/RpcClient.java
Show resolved
Hide resolved
fapifta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Zita, thank you for contributing this new debug tool, I would like to request one general thing, and also I have a few comments inline.
I would generally suggest to create String constants and put them at the top of the class, for property names used in the manifest file, it would make it easier to see what properties we expect to have in the manifest if they are at one place. This is a stylistic suggestion, I can accept the current code in this regard, so I let this change for you to decide if it is worth it.
Hi @fapifta, thank you so much for the review. I have taken your advices and changed the mentioned things. I also created the String constants, the code is much clearer this way. |
|
Thank you @dombizita to address my comments, I still have one thing that has to be solved, and one suggestion regarding the constant names. With the constants, I would do one more thing, I would try to add a bit more descriptive names, like: |
|
Thank you @dombizita for addressing my comments. +1, let's wait for CI, and if green, I will go ahead and merge this one later on. |
fapifta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @dombizita as discussed please follow up on the tests for the new tool under its separate JIRA.
I am merging the changes shortly, as CI is as well green already.
* master: (27 commits) HDDS-6206. Application errors must not flood system log (apache#3001) HDDS-6181. Change SCMHAInvocationHandler#invokeRatis() logging to TRACE (apache#2992) HDDS-5529. For any IOexception from @REPLicated method we should throw it (apache#2788) HDDS-6201. Fix NPE for DataScanner with scanned container deleted by others. (apache#3005) HDDS-6239. ozonesecure-mr failing with No URLs in mirrorlist (apache#3029) HDDS-6227. Test helpers should observe naming conditions (apache#3020) HDDS-6205. Add CLI command to display the latest Replication Manager report (apache#3013) HDDS-6192. feature/Observability.md translated to Chinese (apache#2994) HDDS-6219. Switch to RATIS ReplicationType from STAND_ALONE in our tests. (apache#3014) HDDS-6211. [Docs] Image styling on deployed site does not replicate local builds. (apache#3007) HDDS-6177. Extend container info command to include replica details (apache#2995) HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file (apache#2987) HDDS-6191. Intermittent failure in TestDeleteWithSlowFollower (apache#3015) HDDS-6216. Move OMOpenKeysDeleteRequest to package om.request.key (apache#3011) HDDS-6203. CleanUp incomplete gz files during Container move (apache#3000) HDDS-3408. Rename ChunkLayOutVersion to ContainerLayoutVersion. (apache#2983) HDDS-6109. Preserve the underlying exception raised in client lib. (apache#2989) HDDS-6135. SCM Container DB bootstrap on Recon startup for SCM HA. (apache#2972) HDDS-6202. Avoid using jmh-generator-annprocess since it is GPL2.0. (apache#2998) HDDS-6163. Fix PATH in docker image (apache#2967) ... Conflicts: hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueContainerCheck.java
* master: (27 commits) HDDS-6206. Application errors must not flood system log (apache#3001) HDDS-6181. Change SCMHAInvocationHandler#invokeRatis() logging to TRACE (apache#2992) HDDS-5529. For any IOexception from @REPLicated method we should throw it (apache#2788) HDDS-6201. Fix NPE for DataScanner with scanned container deleted by others. (apache#3005) HDDS-6239. ozonesecure-mr failing with No URLs in mirrorlist (apache#3029) HDDS-6227. Test helpers should observe naming conditions (apache#3020) HDDS-6205. Add CLI command to display the latest Replication Manager report (apache#3013) HDDS-6192. feature/Observability.md translated to Chinese (apache#2994) HDDS-6219. Switch to RATIS ReplicationType from STAND_ALONE in our tests. (apache#3014) HDDS-6211. [Docs] Image styling on deployed site does not replicate local builds. (apache#3007) HDDS-6177. Extend container info command to include replica details (apache#2995) HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file (apache#2987) HDDS-6191. Intermittent failure in TestDeleteWithSlowFollower (apache#3015) HDDS-6216. Move OMOpenKeysDeleteRequest to package om.request.key (apache#3011) HDDS-6203. CleanUp incomplete gz files during Container move (apache#3000) HDDS-3408. Rename ChunkLayOutVersion to ContainerLayoutVersion. (apache#2983) HDDS-6109. Preserve the underlying exception raised in client lib. (apache#2989) HDDS-6135. SCM Container DB bootstrap on Recon startup for SCM HA. (apache#2972) HDDS-6202. Avoid using jmh-generator-annprocess since it is GPL2.0. (apache#2998) HDDS-6163. Fix PATH in docker image (apache#2967) ...
…ansport support (#3074) * HDDS-6149. Remove unused keytabs (#2960) * HDDS-6094. Some unit tests are skipped due to using JUnit4 runner (#2909) * HDDS-6075. OzoneConfiguration constructor overrides input configuration keys. (#2921) * HDDS-4177. SCM Container DB bootstrap on Recon startup (#2942) * HDDS-6086. Compute MD5MD5CRC file checksum using chunk checksums from DataNodes (#2919) * HDDS-6148. Validate ContainerBalancerConfiguration before start ContainerBalancer (#2957) * HDDS-6161. SCM StateMachine failing to reinitialize doesn't terminate the process. (#2971) * HDDS-6134. Move replication-specific config to ReplicationServer (#2943) * HDDS-4010. S3G startup fails when multiple service ids are configured. (#2976) * HDDS-6170. Add metrics to replication manager to track container health states (#2975) * HDDS-3231. Cleanup KeyManagerImpl (#2961) * HDDS-5927. Improve defaults in ContainerBalancerConfiguration (#2892) * HDDS-6157. More consistent synchronization in InputStreams (#2965) * HDDS-4743. [FSO] Add FSO variant of ITestOzoneContractDistcp. (#2980) * HDDS-6114. Intermittent error due to Failed to init RocksDB (#2947) * HDDS-6175. Use s3Auth during proxy during decrypt in RpcClient. (#2981) * HDDS-6175. Use s3Auth during proxy during decrypt in RpcClient. * HDDS-5740. Enable ratis by default for SCM. (#2637) * HDDS-6183. Intermittent failure in TestKeyDeletingService.checkIfDeleteServiceWithFailingSCM. (#2991) * HDDS-4190. Intermittent failure in TestOMVolumeSetOwnerRequest and TestOMVolumeSetQuotaRequest. (#2982) * HDDS-6120. Compute block checksum using chunk checksums (#2930) * HDDS-6147. Add ability in OM to get limited delta updates (#2956) * HDDS-6195. Remove unused jmh-core dependency. (#2997) * HDDS-6167. Update ozone-runner version to 20211202-1 (#2969) * HDDS-6171. Create an API to change Bucket Owner (#2988) * HDDS-6163. Fix PATH in docker image (#2967) * HDDS-6202. Avoid using jmh-generator-annprocess since it is GPL2.0. (#2998) * HDDS-6135. SCM Container DB bootstrap on Recon startup for SCM HA. (#2972) * HDDS-6109. Preserve the underlying exception raised in client lib. (#2989) * HDDS-3408. Rename ChunkLayOutVersion to ContainerLayoutVersion. (#2983) * HDDS-6203. CleanUp incomplete gz files during Container move (#3000) * HDDS-6216. Move OMOpenKeysDeleteRequest to package om.request.key (#3011) * HDDS-6191. Intermittent failure in TestDeleteWithSlowFollower (#3015) * HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file (#2987) * HDDS-6177. Extend container info command to include replica details (#2995) * HDDS-6211. [Docs] Image styling on deployed site does not replicate local builds. (#3007) * HDDS-6219. Switch to RATIS ReplicationType from STAND_ALONE in our tests. (#3014) * HDDS-6192. feature/Observability.md translated to Chinese (#2994) * HDDS-6205. Add CLI command to display the latest Replication Manager report (#3013) * HDDS-6227. Test helpers should observe naming conditions (#3020) * HDDS-6239. ozonesecure-mr failing with No URLs in mirrorlist (#3029) * HDDS-6201. Fix NPE for DataScanner with scanned container deleted by others. (#3005) * HDDS-5529. For any IOexception from @REPLicated method we should throw it (#2788) * HDDS-6181. Change SCMHAInvocationHandler#invokeRatis() logging to TRACE (#2992) * HDDS-6206. Application errors must not flood system log (#3001) * HDDS-6245. Add BucketLayout logging to Audit Logs (#3040) * HDDS-6238 Reduce memory requirements for list keys. (#3032) * HDDS-2919. Intermittent failure in TestRDBStore (#3028) * HDDS-6253. Unnecessary duplicate smoketest after defaulting to FSO (#3036) * HDDS-6204. Cleanup handling malformed authorization header (#2999) * HDDS-6169. Selective checks: skip junit tests on ozone-runner image update (#2974) * HDDS-6270. Use a dedicated file instead of /etc/passwd for xcompat acceptance test (#3050) * HDDS-6273. Amend doc SecuringTDE.md (#3047) * HDDS-6140. Selective checks: skip unit check for integration-test changes (#2948) * HDDS-6215. Recon get limited delta updates from OM (#3009) * HDDS-6125. Recon get limited delta updates from OM * HDDS-6215. Fix unit test * trigger new CI check * HDDS-6215. Fix typo * trigger new CI check Co-authored-by: Symious <[email protected]> * HDDS-6226. Run tests for selective CI checks in CI (#3019) * HDDS-6247. Avoid logging stack trace for user input problems (#3039) * HDDS-6208. New checkstyle: WhitespaceAround (#3003) * HDDS-6289. Upgrade acceptance test log flooded with parse error (#3063) Co-authored-by: Siyao Meng <[email protected]> * Trigger Build * Fix integration test for added configuation field for selecting OmTransport for s3 gateway - TestOzoneConfigurationFields (added config key not in xml). Co-authored-by: Doroszlai, Attila <[email protected]> Co-authored-by: Lokesh Jain <[email protected]> Co-authored-by: Aswin Shakil Balasubramanian <[email protected]> Co-authored-by: Wei-Chiu Chuang <[email protected]> Co-authored-by: Symious <[email protected]> Co-authored-by: Bharat Viswanadham <[email protected]> Co-authored-by: Stephen O'Donnell <[email protected]> Co-authored-by: GeorgeJahad <[email protected]> Co-authored-by: Siddhant Sangwan <[email protected]> Co-authored-by: Jyotinder Singh <[email protected]> Co-authored-by: Shashikant Banerjee <[email protected]> Co-authored-by: Tsz-Wo Nicholas Sze <[email protected]> Co-authored-by: Kiyoshi Mizumaru <[email protected]> Co-authored-by: Ritesh H Shukla <[email protected]> Co-authored-by: Nibiru <[email protected]> Co-authored-by: Kaijie Chen <[email protected]> Co-authored-by: Zita Dombi <[email protected]> Co-authored-by: Istvan Fajth <[email protected]> Co-authored-by: steinsgateted <[email protected]> Co-authored-by: Gui Hecheng <[email protected]> Co-authored-by: Jackson Yao <[email protected]> Co-authored-by: Keyi Song <[email protected]> Co-authored-by: UENISHI Kota <[email protected]> Co-authored-by: Siyao Meng <[email protected]> Co-authored-by: Symious <[email protected]>
What changes were proposed in this pull request?
In this change I added a debug CLI tool, called read-replicas. The tool for a given key will download every blocks every replicas (from each datanode) and also creates a manifest file with informations about the key, its blocks and replicas.
The command can be used like this:
ozone debug read-replicas /vol1/bucket1/testfileIt will create a directory with the volume, bucket and key name; it also adds a timestamp, so you can re-run it. The downloaded replicas and the manifest file will be in this directory. The manifest file will look something like this:
If there is a checksum mismatch with one of the replicas it will be marked in the manifest file and the corrupt replica is still going to be downloaded. If one of the datanodes is down (and still alive) it will also be marked in the manifest file.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6128
How was this patch tested?
Built the project successfully and tested the command locally in docker environment. Specific tests will be added as the next sub-task.