Skip to content

[Native] Fix race condition that throws "Output buffers not found"#18776

Merged
aditi-pandit merged 1 commit intoprestodb:masterfrom
karteekmurthys:output-buffer-not-found
Jan 12, 2023
Merged

[Native] Fix race condition that throws "Output buffers not found"#18776
aditi-pandit merged 1 commit intoprestodb:masterfrom
karteekmurthys:output-buffer-not-found

Conversation

@karteekmurthys
Copy link
Contributor

@karteekmurthys karteekmurthys commented Dec 7, 2022

The race condition occurs when task is terminated and its PartitionedOutputBuffer is cleared out by downstream driver.
The task manager tries to access this buffer.

Depends on this change: facebookincubator/velox#3502.

== Testing ==
Simulate a scenario where TaskManager processes result requests for a Task and the buffer is unavailable. Ensure that no exception is thrown and the future associated with the result request is invoked.

== RELEASE NOTES ==

Native Worker

  • Fix race condition that may cause the query to fail with unhelpful "Output buffers not found" error. See :issues:3009 for details.

@karteekmurthys karteekmurthys requested a review from a team as a code owner December 7, 2022 08:34
@karteekmurthys karteekmurthys self-assigned this Dec 7, 2022
@karteekmurthys karteekmurthys marked this pull request as draft December 7, 2022 08:34
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Thank you for debugging and fixing this issue. Would you clarify what was the original error that triggered that scenario ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is reasonable, but programming by exception is an anti pattern. A better design would be to modify PartitionedOutputBufferManager::getData API to not throw, but indicate task-not-found condition somehow else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, document this method.

Use const & for prestoTask and resultRequests parameters.

A better name might be processResultRequests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: getBufferManager -> bufferManager

Use const & for the return type or return raw pointer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop facebook::

pass planFragment as const &

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test name is quite generic, but I believe this test is verifying a particular scenario. Would you document that scenario and see if there is a more specific test name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests seems to be very tightly coupled with the TaskManager's implementation. I'm wondering if there is a way to test the desired behavior without such tight coupling.

@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch from 853f95e to d5fc9a4 Compare December 11, 2022 09:27
@karteekmurthys
Copy link
Contributor Author

@mbasmanova Thanks for your helpful feedback. I have tried to address them.

@mbasmanova
Copy link
Contributor

@karteekmurthys Karteek, this PR is marked as Draft and there are CI failures. Is this ready for review? If so, would you resolve CI failures and change the status of the PR?

@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch 2 times, most recently from 9bf74d2 to 9cf0723 Compare December 18, 2022 06:13
@karteekmurthys karteekmurthys mentioned this pull request Dec 20, 2022
@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch 2 times, most recently from bc9039f to bc1564d Compare December 21, 2022 02:30
@karteekmurthys karteekmurthys marked this pull request as ready for review December 21, 2022 07:34
@karteekmurthys
Copy link
Contributor Author

@mbasmanova would you please review this change.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Karteek, thank you for working on fixing this issue. Some comments below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you document this method to explain how it handles the case when task has finished (failed) before this method was invoked?

Looks like this is a public method. I'd expect it to be private, but I assume it is used in the test. If that's the case, let's comment to clarify that this method is made public only for access from the test and should not be used by any production code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this call suffer from the same problem if buffers have been removed already?

Copy link
Contributor Author

@karteekmurthys karteekmurthys Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova this could hit the same issue as well.

void PartitionedOutputBufferManager::updateBroadcastOutputBuffers(
    const std::string& taskId,
    int numBuffers,
    bool noMoreBuffers) {
  getBuffer(taskId)->updateBroadcastOutputBuffers(numBuffers, noMoreBuffers);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file an issue to follow up and fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a TODO here and refer to the GitHub issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop 'facebook::velox::'

ditto other places

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming like 'f2' is an anti-pattern; here, I assume we can just use 'future'

auto future = std::move(semiFuture).via(&executor).thenValue(...);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant semi-colon

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Please, write a nice commit message following guidelines from item 5 in https://github.com/facebookincubator/velox/blob/main/CONTRIBUTING.md#code

For the release nodes in the PR description, make sure these can be understood by the users of Presto, not developers. Here are some ideas for how to describe this change.

Fix race condition that may cause the query to fail with unhelpful "Output buffers not found" error. See :pr:`18776` for details.

@karteekmurthys karteekmurthys changed the title Handle exceptions after task abort in TaskManager::createOrUpdate Fix race condition that throws unhepful "Output buffers not found" error Dec 22, 2022
@karteekmurthys karteekmurthys changed the title Fix race condition that throws unhepful "Output buffers not found" error [Native] [Bug] Fix race condition that throws unhepful "Output buffers not found" error Dec 22, 2022
@karteekmurthys
Copy link
Contributor Author

@karteekmurthys Please, write a nice commit message following guidelines from item 5 in https://github.com/facebookincubator/velox/blob/main/CONTRIBUTING.md#code

For the release nodes in the PR description, make sure these can be understood by the users of Presto, not developers. Here are some ideas for how to describe this change.

Fix race condition that may cause the query to fail with unhelpful "Output buffers not found" error. See :pr:`18776` for details.

Thanks for the suggestion. I have to tried clean up the commit message and description.

@mbasmanova
Copy link
Contributor

@karteekmurthys I see 2 commits now and do not see improved comment message.

Screenshot 2022-12-22 at 9 32 05 AM

@mbasmanova mbasmanova changed the title [Native] [Bug] Fix race condition that throws unhepful "Output buffers not found" error [Native] Fix race condition that throws unhelpful "Output buffers not found" error Dec 22, 2022
@mbasmanova mbasmanova changed the title [Native] Fix race condition that throws unhelpful "Output buffers not found" error [Native] Fix race condition that throws "Output buffers not found" Dec 22, 2022
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Looks good to me % commit messages. Please, squash the commits and write a nice commit message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: fetch -> fetches

Would you explain how this method handles the case when task has already finished and there is no buffer in the POBM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file an issue to follow up and fix this.

@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch from c4d977f to 98c5d76 Compare December 22, 2022 18:13
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Karteek for working this through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like you are using this anywhere. Please remove it if so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a value after "Results size: ". Do you want to add "0" for number of bytes or print something more meaningful ? You could remove the phrase also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to update this message ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this parameter be const ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could call this method getData(...) also as processResultRequests makes it more vague.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this diff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::boolalpha << false

Wouldn't it be simple to just include "false" string in the message?

complete: false

@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch 2 times, most recently from 063fb27 to 5749763 Compare January 3, 2023 17:54
@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch from 5749763 to 4f96a8a Compare January 3, 2023 19:46
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Karteek. This is looking good minus 2 nits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to update this message ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Just getData(...) should work also. ResultRequests is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a getData() method. Will keep this as for readability.

@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch from 4f96a8a to 0d8f3cc Compare January 9, 2023 17:20
@karteekmurthys
Copy link
Contributor Author

@mbasmanova would you please review this.

The test failure is not related to this change:

Error:  Failures: 
Error:    TestQueues.testQueuedQueryInteraction » ThreadTimeout Method com.facebook.presto.execution.TestQueues.testQueuedQueryInteraction() didn't finish within the time-out 60000

@mbasmanova
Copy link
Contributor

@karteekmurthys I have things I need to work on today and tomorrow. Will try to review this later this week.

@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch from 0d8f3cc to 324d701 Compare January 9, 2023 22:29
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Looks good overall % a small comment below.

@spershin mentioned that this issue shows up intermittently when running e2e tests. Would you run these tests multiple times (10-20) to confirm that this issue no longer shows up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "complete: " << false"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 'result' intended to simulate 'timeout' result? If so, timeout logic looks different. Would it make sense to extract that logic into a helper method and reuse to avoid divergence?

folly::Future<std::unique_ptr<Result>> TaskManager::getResults(

  auto timeoutFn = [token]() {
    auto result = std::make_unique<Result>();
    result->sequence = result->nextSequence = token;
    result->data = folly::IOBuf::create(0);
    result->complete = false;
    return result;
  };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a TODO here and refer to the GitHub issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice note.

@mbasmanova
Copy link
Contributor

@karteekmurthys Please, squash commits.

@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch 2 times, most recently from 66b499c to 7d6d97b Compare January 10, 2023 19:02
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Karteek for the changes.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Looks good to me % comment on the function name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for extracting a helper function. The name suggests that the method does something to respond to a timeout, but it actually just generated a timeout response (without, for example, sending it out). Consider renaming for clarity: createTimeoutResult or getTimeoutResult or makeTimeoutResult.

@mbasmanova
Copy link
Contributor

@spershin mentioned that this issue shows up intermittently when running e2e tests. Would you run these tests multiple times (10-20) to confirm that this issue no longer shows up?

@karteekmurthys Have you tried that? Can you confirm that e2e tests are passing reliably now?

@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch from 7d6d97b to 2691ae1 Compare January 11, 2023 23:08
@aditi-pandit aditi-pandit merged commit 8e6d0e2 into prestodb:master Jan 12, 2023
@karteekmurthys
Copy link
Contributor Author

@spershin mentioned that this issue shows up intermittently when running e2e tests. Would you run these tests multiple times (10-20) to confirm that this issue no longer shows up?

@karteekmurthys Have you tried that? Can you confirm that e2e tests are passing reliably now?

I have run it like 5 times and failed to repro. I will try to run longer and see if it reproduces.

@kaikalur
Copy link
Contributor

@spershin mentioned that this issue shows up intermittently when running e2e tests. Would you run these tests multiple times (10-20) to confirm that this issue no longer shows up?

@karteekmurthys Have you tried that? Can you confirm that e2e tests are passing reliably now?

I have run it like 5 times and failed to repro. I will try to run longer and see if it reproduces.

tsan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants