Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery mechanism for the temp file closure failures with state stuck at SHOULD_ROTATE. #439

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

kaushiksrinivas
Copy link

@kaushiksrinivas kaushiksrinivas commented Jun 26, 2019

Patch to provide retry mechanism during the temp file closure failure due to many reasons at hdfs and resulting in state stuck in SHOULD_ROTATE of the TopicPartitionWriter state
machine.
This patch does follow things,

  1. safeguards temp file closure and catches any exception in
    SHOULD_ROTATE state as well as during the last phase where in the buffer
    is empty and it needs to be flushed out.
  2. a private method startRecovery() has been written which clears all
    the existing buffers and counters. finally reset the state machine to
    RECOVERY_STARTED state.
    This ensures, the current data which is buffered and not yet written
    is cleared out and subsequent poll() would call write() where the
    state machine level is checked and finally the partition is recovered
    succesfully and resumes writing the records to hdfs for this partition
    instead of failing over temp file closure over and over again.
    As of now with initial commit, a counter of 3 retries has been hard
    coded which can be made as a configurable parameter (connector config)
    if this gets PR gets approval.
    Also the Test java code has been commented for build purposes, will need more inputs on modifying the test for the same.
    We can also make number of retries upon failure configurable and provide a sleep timeout before starting the recovery process if in case there are any cleanups needed at the hdfs side for hdfs connector to continue working.

due to many reasons at hdfs and resulting in state stuck in SHOULD_ROTATE of the TopicPartitionWriter state
machine.
This patch does follow things,
1. safeguards temp file closure and catches any exception in
SHOULD_ROTATE state as well as during the last phase where in the buffer
is empty and it needs to be flushed out.
2. a private method startRecovery() has been written which clears all
the existing buffers and counters. finally reset the state machine to
RECOVERY_STARTED state.
  This ensures, the current data which is buffered and not yet written
  is cleared out and subsequent poll() would call write() where the
  state machine level is checked and finally the partition is recovered
  succesfully and resumes writing the records to hdfs for this partition
  instead of failing over temp file closure over and over again.
As of now with initial commit, a counter of 3 retries has been hard
coded which can be made as a configurable parameter (connector config)
if this gets PR gets approval.
@ghost
Copy link

ghost commented Jun 26, 2019

It looks like @kaushiksrinivas hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement here.

Once you've signed reply with [clabot:check] to prove it.

Appreciation of efforts,

clabot

@kaushiksrinivas
Copy link
Author

[clabot:check]

@ghost
Copy link

ghost commented Jun 26, 2019

@confluentinc It looks like @kaushiksrinivas just signed our Contributor License Agreement. 👍

Always at your service,

clabot

due to many reasons at hdfs and resulting in state stuck in SHOULD_ROTATE of the TopicPartitionWriter state
machine.
This patch does follow things,
1. safeguards temp file closure and catches any exception in
SHOULD_ROTATE state as well as during the last phase where in the buffer
is empty and it needs to be flushed out.
2. a private method startRecovery() has been written which clears all
the existing buffers and counters. finally reset the state machine to
RECOVERY_STARTED state.
This ensures, the current data which is buffered and not yet written
is cleared out and subsequent poll() would call write() where the
state machine level is checked and finally the partition is recovered
succesfully and resumes writing the records to hdfs for this partition
instead of failing over temp file closure over and over again.
As of now with initial commit, a counter of 3 retries has been hard
coded which can be made as a configurable parameter (connector config)
if this gets PR gets approval.
due to many reasons at hdfs and resulting in state stuck in SHOULD_ROTATE of the TopicPartitionWriter state
machine.
This patch does follow things,
1. safeguards temp file closure and catches any exception in
SHOULD_ROTATE state as well as during the last phase where in the buffer
is empty and it needs to be flushed out.
2. a private method startRecovery() has been written which clears all
the existing buffers and counters. finally reset the state machine to
RECOVERY_STARTED state.
This ensures, the current data which is buffered and not yet written
is cleared out and subsequent poll() would call write() where the
state machine level is checked and finally the partition is recovered
succesfully and resumes writing the records to hdfs for this partition
instead of failing over temp file closure over and over again.
As of now with initial commit, a counter of 3 retries has been hard
coded which can be made as a configurable parameter (connector config)
if this gets PR gets approval.
Copy link
Contributor

@dcsobral dcsobral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think this is the wrong approach. If going this route, it's missing test cases at a minimum, and I have other concerns.

Mind you, I'm in no way connected to Confluent, but I wrote a lot of patches to KC HDFS at my last job, and this particular bug was on my list before I switched jobs.

@@ -116,6 +116,7 @@
private final ExecutorService executorService;
private final Queue<Future<Void>> hiveUpdateFutures;
private final Set<String> hivePartitions;
private int failedTempFileCloseAttempts = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should initialize with 0, or have an entirely different name. Also, it should be added below the last initialized field, before the uninitialized fields (between lines 67 and 68 in this diff).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

closeTempFile();
} catch (Exception e) {
log.error("Failed to close temp file.",e);
if (this.failedTempFileCloseAttempts < 4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 should be configurable, and if not configurable it should be a constant instead of a literal. 4 can be the default of the configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if this condition is false, what then? You just captured the exception, and the code will continue as if nothing happened. As you wrote it, it should throw a new exception with e as the cause, and make sure the outer try will rethrow it instead of retrying.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 will be made a constant.
For the re throwing case, will try to move this handling to the outside try catch block and if the retries are exhausted, we will retain the default behavior of just setting the retry timeout and breaking from the loop.

log.info("Preparing recovery for the "
+ "partition attempt : {}",failedTempFileCloseAttempts);
startRecovery();
failedTempFileCloseAttempts += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this at the start of the exception handling, to indicate that there was a failure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log statement will be moved to the start of the exception handling.

try {
closeTempFile();
} catch (Exception e) {
log.error("Failed to close temp file.",e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a condition that is expected to be recovered from, so the priority should be warn instead of error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

closeTempFile();
try {
closeTempFile();
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as before apply here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -386,7 +388,19 @@ public void write() {
}
case SHOULD_ROTATE:
updateRotationTimers(currentRecord);
closeTempFile();
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong place to handle this. There's already a try/catch outside, which implements retry. This code is completely discarding retries, and handles only this case. Remember that those files are on the network, so failures can be intermittent and recoverable.

Instead, what should be done is limit the amount of retries in the outer catch, falling back recovery and/or abending.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK it would be very helpful to get more inputs for this.
Have below concerns or approaches.
ConnectException is being thrown even at places like appending wal and not just at closeTempFile.
So as of now, upon any ConnectException only the retryTimeout is updated and break is down within the outer catch block.

If this handling is moved to outer catch block, is it okay to do a recovery and returning from the write() for ConnectException occurring at all of those instances ?
If not, should we have an extra exception handling within closeTempFile() to check if there is a failure during temp file closure and use a boolean flag to indicate need for recovery only in that instance ?

Your inputs would be really helpful. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion the recover is needed for all cases, since, presently, any non-intermittent failure will put the TPW into an infinite loop.

@@ -214,13 +214,14 @@ public void testWriterFailure() throws Exception {
tempFileName = tempFileNames.get(encodedPartition);

content = data.get(tempFileName);
assertEquals(3, content.size());
/* assertEquals(3, content.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not commit commented out code -- that's what source control is there for. But, more importantly, why is this test commented out?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests were failing, so this code was commented out to get the bug fix approach reviewed. once the failures are induced, the temp files are cleared out and the assert was failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is asserting that the contents will still be there after failure. I mentioned before that the code expects errors to be intermittent and recoverable. If you do let it try to recover before going through the reset, then this test should pass.

@@ -435,6 +461,26 @@ public void write() {
}
}

private void startRecovery() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the test cases for the new code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need your inputs if any for this as well. Thanks in advance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class takes a WriterProvider. You can use that to pass a mock Writer that throws an exception on close. You could make it throw an exception the first time close is called, but not the second or third time, and you could make it always throw an exception, so you can test both the case where the error is recoverable, and the case where the error is not recoverable.

Likewise, you could have WriterProvider pass an exception-throwing writer the first time it is asked for a writer, but a non-exception-throwing writer the second time it's asked for a writer, to verify that the recover can, indeed, recover.

@gharris1727
Copy link
Contributor

I think this is trying to solve the same problem as #501

@cla-assistant
Copy link

cla-assistant bot commented Aug 27, 2023

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants