Recovery mechanism for the temp file closure failures with state stuck at SHOULD_ROTATE. #439

kaushiksrinivas · 2019-06-26T17:58:41Z

Patch to provide retry mechanism during the temp file closure failure due to many reasons at hdfs and resulting in state stuck in SHOULD_ROTATE of the TopicPartitionWriter state
machine.
This patch does follow things,

safeguards temp file closure and catches any exception in
SHOULD_ROTATE state as well as during the last phase where in the buffer
is empty and it needs to be flushed out.
a private method startRecovery() has been written which clears all
the existing buffers and counters. finally reset the state machine to
RECOVERY_STARTED state.
This ensures, the current data which is buffered and not yet written
is cleared out and subsequent poll() would call write() where the
state machine level is checked and finally the partition is recovered
succesfully and resumes writing the records to hdfs for this partition
instead of failing over temp file closure over and over again.
As of now with initial commit, a counter of 3 retries has been hard
coded which can be made as a configurable parameter (connector config)
if this gets PR gets approval.
Also the Test java code has been commented for build purposes, will need more inputs on modifying the test for the same.
We can also make number of retries upon failure configurable and provide a sleep timeout before starting the recovery process if in case there are any cleanups needed at the hdfs side for hdfs connector to continue working.

due to many reasons at hdfs and resulting in state stuck in SHOULD_ROTATE of the TopicPartitionWriter state machine. This patch does follow things, 1. safeguards temp file closure and catches any exception in SHOULD_ROTATE state as well as during the last phase where in the buffer is empty and it needs to be flushed out. 2. a private method startRecovery() has been written which clears all the existing buffers and counters. finally reset the state machine to RECOVERY_STARTED state. This ensures, the current data which is buffered and not yet written is cleared out and subsequent poll() would call write() where the state machine level is checked and finally the partition is recovered succesfully and resumes writing the records to hdfs for this partition instead of failing over temp file closure over and over again. As of now with initial commit, a counter of 3 retries has been hard coded which can be made as a configurable parameter (connector config) if this gets PR gets approval.

ghost · 2019-06-26T17:58:43Z

It looks like @kaushiksrinivas hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement here.

Once you've signed reply with [clabot:check] to prove it.

Appreciation of efforts,

clabot

kaushiksrinivas · 2019-06-26T18:03:56Z

[clabot:check]

ghost · 2019-06-26T18:03:57Z

@confluentinc It looks like @kaushiksrinivas just signed our Contributor License Agreement. 👍

Always at your service,

clabot

due to many reasons at hdfs and resulting in state stuck in SHOULD_ROTATE of the TopicPartitionWriter state machine. This patch does follow things, 1. safeguards temp file closure and catches any exception in SHOULD_ROTATE state as well as during the last phase where in the buffer is empty and it needs to be flushed out. 2. a private method startRecovery() has been written which clears all the existing buffers and counters. finally reset the state machine to RECOVERY_STARTED state. This ensures, the current data which is buffered and not yet written is cleared out and subsequent poll() would call write() where the state machine level is checked and finally the partition is recovered succesfully and resumes writing the records to hdfs for this partition instead of failing over temp file closure over and over again. As of now with initial commit, a counter of 3 retries has been hard coded which can be made as a configurable parameter (connector config) if this gets PR gets approval.

dcsobral

Personally, I think this is the wrong approach. If going this route, it's missing test cases at a minimum, and I have other concerns.

Mind you, I'm in no way connected to Confluent, but I wrote a lot of patches to KC HDFS at my last job, and this particular bug was on my list before I switched jobs.

dcsobral · 2019-06-27T18:27:22Z

src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java

@@ -116,6 +116,7 @@
  private final ExecutorService executorService;
  private final Queue<Future<Void>> hiveUpdateFutures;
  private final Set<String> hivePartitions;
+  private int failedTempFileCloseAttempts = 1;


It should initialize with 0, or have an entirely different name. Also, it should be added below the last initialized field, before the uninitialized fields (between lines 67 and 68 in this diff).

dcsobral · 2019-06-27T18:28:46Z

src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java

+              closeTempFile();
+            } catch (Exception e) {
+              log.error("Failed to close temp file.",e);
+              if (this.failedTempFileCloseAttempts < 4) {


4 should be configurable, and if not configurable it should be a constant instead of a literal. 4 can be the default of the configuration.

And if this condition is false, what then? You just captured the exception, and the code will continue as if nothing happened. As you wrote it, it should throw a new exception with e as the cause, and make sure the outer try will rethrow it instead of retrying.

4 will be made a constant.
For the re throwing case, will try to move this handling to the outside try catch block and if the retries are exhausted, we will retain the default behavior of just setting the retry timeout and breaking from the loop.

dcsobral · 2019-06-27T18:30:13Z

src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java

+                log.info("Preparing recovery for the "
+                    + "partition attempt : {}",failedTempFileCloseAttempts);
+                startRecovery();
+                failedTempFileCloseAttempts += 1;


Do this at the start of the exception handling, to indicate that there was a failure.

log statement will be moved to the start of the exception handling.

dcsobral · 2019-06-27T18:31:05Z

src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java

+            try {
+              closeTempFile();
+            } catch (Exception e) {
+              log.error("Failed to close temp file.",e);


This is a condition that is expected to be recovered from, so the priority should be warn instead of error.

dcsobral · 2019-06-27T18:35:43Z

src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java

-          closeTempFile();
+          try {
+            closeTempFile();
+          } catch (Exception e) {


Same comments as before apply here.

dcsobral · 2019-06-27T18:41:33Z

src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java

@@ -386,7 +388,19 @@ public void write() {
            }
          case SHOULD_ROTATE:
            updateRotationTimers(currentRecord);
-            closeTempFile();
+            try {


I think this is the wrong place to handle this. There's already a try/catch outside, which implements retry. This code is completely discarding retries, and handles only this case. Remember that those files are on the network, so failures can be intermittent and recoverable.

Instead, what should be done is limit the amount of retries in the outer catch, falling back recovery and/or abending.

OK it would be very helpful to get more inputs for this.
Have below concerns or approaches.
ConnectException is being thrown even at places like appending wal and not just at closeTempFile.
So as of now, upon any ConnectException only the retryTimeout is updated and break is down within the outer catch block.

If this handling is moved to outer catch block, is it okay to do a recovery and returning from the write() for ConnectException occurring at all of those instances ?
If not, should we have an extra exception handling within closeTempFile() to check if there is a failure during temp file closure and use a boolean flag to indicate need for recovery only in that instance ?

Your inputs would be really helpful. Thanks

In my opinion the recover is needed for all cases, since, presently, any non-intermittent failure will put the TPW into an infinite loop.

dcsobral · 2019-06-27T18:42:59Z

src/test/java/io/confluent/connect/hdfs/FailureRecoveryTest.java

@@ -214,13 +214,14 @@ public void testWriterFailure() throws Exception {
    tempFileName = tempFileNames.get(encodedPartition);

    content = data.get(tempFileName);
-    assertEquals(3, content.size());
+    /* assertEquals(3, content.size());


Do not commit commented out code -- that's what source control is there for. But, more importantly, why is this test commented out?

The tests were failing, so this code was commented out to get the bug fix approach reviewed. once the failures are induced, the temp files are cleared out and the assert was failing.

So it is asserting that the contents will still be there after failure. I mentioned before that the code expects errors to be intermittent and recoverable. If you do let it try to recover before going through the reset, then this test should pass.

dcsobral · 2019-06-27T18:44:32Z

src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java

@@ -435,6 +461,26 @@ public void write() {
    }
  }

+  private void startRecovery() {


Where are the test cases for the new code?

Will need your inputs if any for this as well. Thanks in advance.

The class takes a WriterProvider. You can use that to pass a mock Writer that throws an exception on close. You could make it throw an exception the first time close is called, but not the second or third time, and you could make it always throw an exception, so you can test both the case where the error is recoverable, and the case where the error is not recoverable.

Likewise, you could have WriterProvider pass an exception-throwing writer the first time it is asked for a writer, but a non-exception-throwing writer the second time it's asked for a writer, to verify that the recover can, indeed, recover.

gharris1727 · 2020-06-26T23:22:17Z

I think this is trying to solve the same problem as #501

cla-assistant · 2023-08-27T12:05:32Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

kaushiksrinivas added 2 commits June 27, 2019 11:22

kaushiksrinivas mentioned this pull request Jun 27, 2019

Cannot get out of Exception when closing the temporary file has failed #407

Open

dcsobral suggested changes Jun 27, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery mechanism for the temp file closure failures with state stuck at SHOULD_ROTATE. #439

Recovery mechanism for the temp file closure failures with state stuck at SHOULD_ROTATE. #439

kaushiksrinivas commented Jun 26, 2019 •

edited

Loading

ghost commented Jun 26, 2019

kaushiksrinivas commented Jun 26, 2019

ghost commented Jun 26, 2019

dcsobral left a comment

dcsobral Jun 27, 2019

kaushiksrinivas Jun 29, 2019

dcsobral Jun 27, 2019

dcsobral Jun 27, 2019

kaushiksrinivas Jun 29, 2019

dcsobral Jun 27, 2019

kaushiksrinivas Jun 29, 2019

dcsobral Jun 27, 2019

kaushiksrinivas Jun 29, 2019

dcsobral Jun 27, 2019

kaushiksrinivas Jun 29, 2019

dcsobral Jun 27, 2019

kaushiksrinivas Jun 29, 2019

dcsobral Jul 1, 2019

dcsobral Jun 27, 2019

kaushiksrinivas Jun 29, 2019

dcsobral Jul 1, 2019

dcsobral Jun 27, 2019

kaushiksrinivas Jun 29, 2019

dcsobral Jul 1, 2019

gharris1727 commented Jun 26, 2020

cla-assistant bot commented Aug 27, 2023

Recovery mechanism for the temp file closure failures with state stuck at SHOULD_ROTATE. #439

Are you sure you want to change the base?

Recovery mechanism for the temp file closure failures with state stuck at SHOULD_ROTATE. #439

Conversation

kaushiksrinivas commented Jun 26, 2019 • edited Loading

ghost commented Jun 26, 2019

kaushiksrinivas commented Jun 26, 2019

ghost commented Jun 26, 2019

dcsobral left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gharris1727 commented Jun 26, 2020

cla-assistant bot commented Aug 27, 2023

kaushiksrinivas commented Jun 26, 2019 •

edited

Loading