HDDS-6109 Preserve the underlying exception raised in client lib. #2989

kerneltime · 2022-01-13T23:34:47Z

What changes were proposed in this pull request?

The underlying exceptions should not be mapped to RuntimeException.
This results in higher layers not retrying on valid failure scenarios.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6109?filter=-1

How was this patch tested?

Included integration tests that introduce a single and dual replica failure both of which should result in the unflushed data to be written to a new location.

kerneltime · 2022-01-13T23:37:03Z

Ref: 565972c #2795

kerneltime · 2022-01-13T23:37:26Z

cc @hanishakoneru

bharatviswa504

Overall LGTM.
Thank You @kerneltime for the root causing and fixing this issue.

bharatviswa504 · 2022-01-14T01:52:17Z

...test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java

There are lot of space changes, is it intentional?? If it is not, can you revert those.

adoroszlai

Thanks @kerneltime for the patch.

...test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java

adoroszlai · 2022-01-14T10:06:17Z

@kerneltime Do we need to restart SCM and datanodes between test cases? If so, do need as many as 10 datanodes?

The previous change in this test class added:

ozone/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java

Lines 174 to 182 in 3eb7235

    
           @BeforeEach 
        
           public void restartDatanode() 
        
               throws InterruptedException, TimeoutException, AuthenticationException, 
        
               IOException { 
        
             for (int i=0; i < cluster.getHddsDatanodes().size(); i++) { 
        
               cluster.restartHddsDatanode(i, true); 
        
             } 
        
             cluster.restartStorageContainerManager(true); 
        
           }

@BeforeEach is from JUnit5, but the rest of the class uses JUnit4 annotations. Thus restartDatanode() is never called. The corresponding JUnit4 annotation is @Before. (Note: I plan to change this test to JUnit5 in HDDS-6095.)

Fixing the annotation would increase test execution time significantly from 100 seconds. I don't know how much, because Surefire fork gets killed at 20 minutes.

Also, if the annotation was fixed, SCM and datanodes would be restarted even before the first test case, which is unnecessary. It may be better to change the cluster to per-test.

master with per-test cluster:

Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 236.074 s - in org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures

This patch with per-test cluster:

Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 298.905 s - in org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures

This patch with per-test cluster with 6 datanodes (not much improvement here):

Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 292.42 s - in org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures

kerneltime · 2022-01-18T04:41:00Z

@kerneltime Do we need to restart SCM and datanodes between test cases? If so, do need as many as 10 datanodes?

The previous change in this test class added:

ozone/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java

Lines 174 to 182 in 3eb7235

@BeforeEach

public void restartDatanode()

throws InterruptedException, TimeoutException, AuthenticationException,

IOException {

for (int i=0; i < cluster.getHddsDatanodes().size(); i++) {

cluster.restartHddsDatanode(i, true);

}

cluster.restartStorageContainerManager(true);

}

@BeforeEach is from JUnit5, but the rest of the class uses JUnit4 annotations. Thus restartDatanode() is never called. The corresponding JUnit4 annotation is @Before. (Note: I plan to change this test to JUnit5 in HDDS-6095.)

Fixing the annotation would increase test execution time significantly from 100 seconds. I don't know how much, because Surefire fork gets killed at 20 minutes.

Also, if the annotation was fixed, SCM and datanodes would be restarted even before the first test case, which is unnecessary. It may be better to change the cluster to per-test.

master with per-test cluster:
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 236.074 s - in org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures
This patch with per-test cluster:
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 298.905 s - in org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures
This patch with per-test cluster with 6 datanodes (not much improvement here):
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 292.42 s - in org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures

Let me look more deeper on the performance.
These tests had 10 but I don't think we need 10 DNs. We would need more than 3. Let me fix the annotation and look into performance. Without the restart, the failures induced can lead to other tests being flaky.

The underlying exceptions should not be mapped to RuntimeException. This results in higher layers not retrying on valid failure scenarios.

kerneltime · 2022-01-19T04:40:11Z

Thanks @adoroszlai for the review and catching the version for the annotation.
I updated the tests and ran them multiple times to make sure they are not flaky. Hopefully that is enough, will see what happens with the CI run. This should avoid a massive increase in overall test time as well as test the intended code path.

As long as the container that the objects were initially flushed to, is moved to state other than open, the retry logic kicks in to find a different pipeline and complete the write and flush. Thus, expecting a count of 2 for the number of locations seems valid. Before the fix if the exception raised was inherited from IOException the retry logic would not kick in and the flush()/close() would fail.

adoroszlai

Thanks @kerneltime for updating the patch.

adoroszlai · 2022-01-21T07:21:24Z

Thanks @kerneltime for the patch, @bharatviswa504 for the review.

@REPLicated

* master: (27 commits) HDDS-6206. Application errors must not flood system log (apache#3001) HDDS-6181. Change SCMHAInvocationHandler#invokeRatis() logging to TRACE (apache#2992) HDDS-5529. For any IOexception from @REPLicated method we should throw it (apache#2788) HDDS-6201. Fix NPE for DataScanner with scanned container deleted by others. (apache#3005) HDDS-6239. ozonesecure-mr failing with No URLs in mirrorlist (apache#3029) HDDS-6227. Test helpers should observe naming conditions (apache#3020) HDDS-6205. Add CLI command to display the latest Replication Manager report (apache#3013) HDDS-6192. feature/Observability.md translated to Chinese (apache#2994) HDDS-6219. Switch to RATIS ReplicationType from STAND_ALONE in our tests. (apache#3014) HDDS-6211. [Docs] Image styling on deployed site does not replicate local builds. (apache#3007) HDDS-6177. Extend container info command to include replica details (apache#2995) HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file (apache#2987) HDDS-6191. Intermittent failure in TestDeleteWithSlowFollower (apache#3015) HDDS-6216. Move OMOpenKeysDeleteRequest to package om.request.key (apache#3011) HDDS-6203. CleanUp incomplete gz files during Container move (apache#3000) HDDS-3408. Rename ChunkLayOutVersion to ContainerLayoutVersion. (apache#2983) HDDS-6109. Preserve the underlying exception raised in client lib. (apache#2989) HDDS-6135. SCM Container DB bootstrap on Recon startup for SCM HA. (apache#2972) HDDS-6202. Avoid using jmh-generator-annprocess since it is GPL2.0. (apache#2998) HDDS-6163. Fix PATH in docker image (apache#2967) ... Conflicts: hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueContainerCheck.java

@REPLicated

* master: (27 commits) HDDS-6206. Application errors must not flood system log (apache#3001) HDDS-6181. Change SCMHAInvocationHandler#invokeRatis() logging to TRACE (apache#2992) HDDS-5529. For any IOexception from @REPLicated method we should throw it (apache#2788) HDDS-6201. Fix NPE for DataScanner with scanned container deleted by others. (apache#3005) HDDS-6239. ozonesecure-mr failing with No URLs in mirrorlist (apache#3029) HDDS-6227. Test helpers should observe naming conditions (apache#3020) HDDS-6205. Add CLI command to display the latest Replication Manager report (apache#3013) HDDS-6192. feature/Observability.md translated to Chinese (apache#2994) HDDS-6219. Switch to RATIS ReplicationType from STAND_ALONE in our tests. (apache#3014) HDDS-6211. [Docs] Image styling on deployed site does not replicate local builds. (apache#3007) HDDS-6177. Extend container info command to include replica details (apache#2995) HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file (apache#2987) HDDS-6191. Intermittent failure in TestDeleteWithSlowFollower (apache#3015) HDDS-6216. Move OMOpenKeysDeleteRequest to package om.request.key (apache#3011) HDDS-6203. CleanUp incomplete gz files during Container move (apache#3000) HDDS-3408. Rename ChunkLayOutVersion to ContainerLayoutVersion. (apache#2983) HDDS-6109. Preserve the underlying exception raised in client lib. (apache#2989) HDDS-6135. SCM Container DB bootstrap on Recon startup for SCM HA. (apache#2972) HDDS-6202. Avoid using jmh-generator-annprocess since it is GPL2.0. (apache#2998) HDDS-6163. Fix PATH in docker image (apache#2967) ...

@REPLicated

…ansport support (#3074) * HDDS-6149. Remove unused keytabs (#2960) * HDDS-6094. Some unit tests are skipped due to using JUnit4 runner (#2909) * HDDS-6075. OzoneConfiguration constructor overrides input configuration keys. (#2921) * HDDS-4177. SCM Container DB bootstrap on Recon startup (#2942) * HDDS-6086. Compute MD5MD5CRC file checksum using chunk checksums from DataNodes (#2919) * HDDS-6148. Validate ContainerBalancerConfiguration before start ContainerBalancer (#2957) * HDDS-6161. SCM StateMachine failing to reinitialize doesn't terminate the process. (#2971) * HDDS-6134. Move replication-specific config to ReplicationServer (#2943) * HDDS-4010. S3G startup fails when multiple service ids are configured. (#2976) * HDDS-6170. Add metrics to replication manager to track container health states (#2975) * HDDS-3231. Cleanup KeyManagerImpl (#2961) * HDDS-5927. Improve defaults in ContainerBalancerConfiguration (#2892) * HDDS-6157. More consistent synchronization in InputStreams (#2965) * HDDS-4743. [FSO] Add FSO variant of ITestOzoneContractDistcp. (#2980) * HDDS-6114. Intermittent error due to Failed to init RocksDB (#2947) * HDDS-6175. Use s3Auth during proxy during decrypt in RpcClient. (#2981) * HDDS-6175. Use s3Auth during proxy during decrypt in RpcClient. * HDDS-5740. Enable ratis by default for SCM. (#2637) * HDDS-6183. Intermittent failure in TestKeyDeletingService.checkIfDeleteServiceWithFailingSCM. (#2991) * HDDS-4190. Intermittent failure in TestOMVolumeSetOwnerRequest and TestOMVolumeSetQuotaRequest. (#2982) * HDDS-6120. Compute block checksum using chunk checksums (#2930) * HDDS-6147. Add ability in OM to get limited delta updates (#2956) * HDDS-6195. Remove unused jmh-core dependency. (#2997) * HDDS-6167. Update ozone-runner version to 20211202-1 (#2969) * HDDS-6171. Create an API to change Bucket Owner (#2988) * HDDS-6163. Fix PATH in docker image (#2967) * HDDS-6202. Avoid using jmh-generator-annprocess since it is GPL2.0. (#2998) * HDDS-6135. SCM Container DB bootstrap on Recon startup for SCM HA. (#2972) * HDDS-6109. Preserve the underlying exception raised in client lib. (#2989) * HDDS-3408. Rename ChunkLayOutVersion to ContainerLayoutVersion. (#2983) * HDDS-6203. CleanUp incomplete gz files during Container move (#3000) * HDDS-6216. Move OMOpenKeysDeleteRequest to package om.request.key (#3011) * HDDS-6191. Intermittent failure in TestDeleteWithSlowFollower (#3015) * HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file (#2987) * HDDS-6177. Extend container info command to include replica details (#2995) * HDDS-6211. [Docs] Image styling on deployed site does not replicate local builds. (#3007) * HDDS-6219. Switch to RATIS ReplicationType from STAND_ALONE in our tests. (#3014) * HDDS-6192. feature/Observability.md translated to Chinese (#2994) * HDDS-6205. Add CLI command to display the latest Replication Manager report (#3013) * HDDS-6227. Test helpers should observe naming conditions (#3020) * HDDS-6239. ozonesecure-mr failing with No URLs in mirrorlist (#3029) * HDDS-6201. Fix NPE for DataScanner with scanned container deleted by others. (#3005) * HDDS-5529. For any IOexception from @REPLicated method we should throw it (#2788) * HDDS-6181. Change SCMHAInvocationHandler#invokeRatis() logging to TRACE (#2992) * HDDS-6206. Application errors must not flood system log (#3001) * HDDS-6245. Add BucketLayout logging to Audit Logs (#3040) * HDDS-6238 Reduce memory requirements for list keys. (#3032) * HDDS-2919. Intermittent failure in TestRDBStore (#3028) * HDDS-6253. Unnecessary duplicate smoketest after defaulting to FSO (#3036) * HDDS-6204. Cleanup handling malformed authorization header (#2999) * HDDS-6169. Selective checks: skip junit tests on ozone-runner image update (#2974) * HDDS-6270. Use a dedicated file instead of /etc/passwd for xcompat acceptance test (#3050) * HDDS-6273. Amend doc SecuringTDE.md (#3047) * HDDS-6140. Selective checks: skip unit check for integration-test changes (#2948) * HDDS-6215. Recon get limited delta updates from OM (#3009) * HDDS-6125. Recon get limited delta updates from OM * HDDS-6215. Fix unit test * trigger new CI check * HDDS-6215. Fix typo * trigger new CI check Co-authored-by: Symious <[email protected]> * HDDS-6226. Run tests for selective CI checks in CI (#3019) * HDDS-6247. Avoid logging stack trace for user input problems (#3039) * HDDS-6208. New checkstyle: WhitespaceAround (#3003) * HDDS-6289. Upgrade acceptance test log flooded with parse error (#3063) Co-authored-by: Siyao Meng <[email protected]> * Trigger Build * Fix integration test for added configuation field for selecting OmTransport for s3 gateway - TestOzoneConfigurationFields (added config key not in xml). Co-authored-by: Doroszlai, Attila <[email protected]> Co-authored-by: Lokesh Jain <[email protected]> Co-authored-by: Aswin Shakil Balasubramanian <[email protected]> Co-authored-by: Wei-Chiu Chuang <[email protected]> Co-authored-by: Symious <[email protected]> Co-authored-by: Bharat Viswanadham <[email protected]> Co-authored-by: Stephen O'Donnell <[email protected]> Co-authored-by: GeorgeJahad <[email protected]> Co-authored-by: Siddhant Sangwan <[email protected]> Co-authored-by: Jyotinder Singh <[email protected]> Co-authored-by: Shashikant Banerjee <[email protected]> Co-authored-by: Tsz-Wo Nicholas Sze <[email protected]> Co-authored-by: Kiyoshi Mizumaru <[email protected]> Co-authored-by: Ritesh H Shukla <[email protected]> Co-authored-by: Nibiru <[email protected]> Co-authored-by: Kaijie Chen <[email protected]> Co-authored-by: Zita Dombi <[email protected]> Co-authored-by: Istvan Fajth <[email protected]> Co-authored-by: steinsgateted <[email protected]> Co-authored-by: Gui Hecheng <[email protected]> Co-authored-by: Jackson Yao <[email protected]> Co-authored-by: Keyi Song <[email protected]> Co-authored-by: UENISHI Kota <[email protected]> Co-authored-by: Siyao Meng <[email protected]> Co-authored-by: Symious <[email protected]>

kerneltime force-pushed the HDDS-6109 branch from 879f9b5 to b5e6c5c Compare January 13, 2022 23:59

bharatviswa504 reviewed Jan 14, 2022

View reviewed changes

adoroszlai reviewed Jan 14, 2022

View reviewed changes

...test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java Outdated Show resolved Hide resolved

...test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java Outdated Show resolved Hide resolved

HDDS-6109 Preserve the underlying exception raised in client lib.

fb7bab6

The underlying exceptions should not be mapped to RuntimeException. This results in higher layers not retrying on valid failure scenarios.

kerneltime force-pushed the HDDS-6109 branch from b5e6c5c to fb7bab6 Compare January 19, 2022 04:35

adoroszlai approved these changes Jan 19, 2022

View reviewed changes

adoroszlai merged commit 66aadb0 into apache:master Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-6109 Preserve the underlying exception raised in client lib. #2989

HDDS-6109 Preserve the underlying exception raised in client lib. #2989

Uh oh!

kerneltime commented Jan 13, 2022

Uh oh!

kerneltime commented Jan 13, 2022 •

edited

Loading

Uh oh!

kerneltime commented Jan 13, 2022

Uh oh!

bharatviswa504 left a comment

Uh oh!

bharatviswa504 Jan 14, 2022 •

edited

Loading

Uh oh!

adoroszlai left a comment

Uh oh!

Uh oh!

Uh oh!

adoroszlai commented Jan 14, 2022

Uh oh!

kerneltime commented Jan 18, 2022 •

edited

Loading

Uh oh!

kerneltime commented Jan 19, 2022 •

edited

Loading

Uh oh!

adoroszlai left a comment

Uh oh!

adoroszlai commented Jan 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-6109 Preserve the underlying exception raised in client lib. #2989

HDDS-6109 Preserve the underlying exception raised in client lib. #2989

Uh oh!

Conversation

kerneltime commented Jan 13, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

kerneltime commented Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kerneltime commented Jan 13, 2022

Uh oh!

bharatviswa504 left a comment

Choose a reason for hiding this comment

Uh oh!

bharatviswa504 Jan 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adoroszlai commented Jan 14, 2022

Uh oh!

kerneltime commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kerneltime commented Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jan 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kerneltime commented Jan 13, 2022 •

edited

Loading

bharatviswa504 Jan 14, 2022 •

edited

Loading

kerneltime commented Jan 18, 2022 •

edited

Loading

kerneltime commented Jan 19, 2022 •

edited

Loading