[Native] Fix race condition that throws "Output buffers not found"#18776
Conversation
mbasmanova
left a comment
There was a problem hiding this comment.
@karteekmurthys Thank you for debugging and fixing this issue. Would you clarify what was the original error that triggered that scenario ?
There was a problem hiding this comment.
This logic is reasonable, but programming by exception is an anti pattern. A better design would be to modify PartitionedOutputBufferManager::getData API to not throw, but indicate task-not-found condition somehow else.
There was a problem hiding this comment.
Please, document this method.
Use const & for prestoTask and resultRequests parameters.
A better name might be processResultRequests.
There was a problem hiding this comment.
naming: getBufferManager -> bufferManager
Use const & for the return type or return raw pointer.
There was a problem hiding this comment.
drop facebook::
pass planFragment as const &
There was a problem hiding this comment.
Test name is quite generic, but I believe this test is verifying a particular scenario. Would you document that scenario and see if there is a more specific test name?
There was a problem hiding this comment.
This tests seems to be very tightly coupled with the TaskManager's implementation. I'm wondering if there is a way to test the desired behavior without such tight coupling.
853f95e to
d5fc9a4
Compare
|
@mbasmanova Thanks for your helpful feedback. I have tried to address them. |
|
@karteekmurthys Karteek, this PR is marked as Draft and there are CI failures. Is this ready for review? If so, would you resolve CI failures and change the status of the PR? |
9bf74d2 to
9cf0723
Compare
bc9039f to
bc1564d
Compare
|
@mbasmanova would you please review this change. |
mbasmanova
left a comment
There was a problem hiding this comment.
@karteekmurthys Karteek, thank you for working on fixing this issue. Some comments below.
There was a problem hiding this comment.
Would you document this method to explain how it handles the case when task has finished (failed) before this method was invoked?
Looks like this is a public method. I'd expect it to be private, but I assume it is used in the test. If that's the case, let's comment to clarify that this method is made public only for access from the test and should not be used by any production code.
There was a problem hiding this comment.
Will this call suffer from the same problem if buffers have been removed already?
There was a problem hiding this comment.
@mbasmanova this could hit the same issue as well.
void PartitionedOutputBufferManager::updateBroadcastOutputBuffers(
const std::string& taskId,
int numBuffers,
bool noMoreBuffers) {
getBuffer(taskId)->updateBroadcastOutputBuffers(numBuffers, noMoreBuffers);
}
There was a problem hiding this comment.
Let's file an issue to follow up and fix this.
There was a problem hiding this comment.
Would you add a TODO here and refer to the GitHub issue?
There was a problem hiding this comment.
drop 'facebook::velox::'
ditto other places
There was a problem hiding this comment.
naming like 'f2' is an anti-pattern; here, I assume we can just use 'future'
auto future = std::move(semiFuture).via(&executor).thenValue(...);
mbasmanova
left a comment
There was a problem hiding this comment.
@karteekmurthys Please, write a nice commit message following guidelines from item 5 in https://github.com/facebookincubator/velox/blob/main/CONTRIBUTING.md#code
For the release nodes in the PR description, make sure these can be understood by the users of Presto, not developers. Here are some ideas for how to describe this change.
Fix race condition that may cause the query to fail with unhelpful "Output buffers not found" error. See :pr:`18776` for details.
Thanks for the suggestion. I have to tried clean up the commit message and description. |
|
@karteekmurthys I see 2 commits now and do not see improved comment message. |
mbasmanova
left a comment
There was a problem hiding this comment.
@karteekmurthys Looks good to me % commit messages. Please, squash the commits and write a nice commit message.
There was a problem hiding this comment.
typo: fetch -> fetches
Would you explain how this method handles the case when task has already finished and there is no buffer in the POBM.
There was a problem hiding this comment.
Let's file an issue to follow up and fix this.
c4d977f to
98c5d76
Compare
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks Karteek for working this through.
There was a problem hiding this comment.
Doesn't seem like you are using this anywhere. Please remove it if so.
There was a problem hiding this comment.
There isn't a value after "Results size: ". Do you want to add "0" for number of bytes or print something more meaningful ? You could remove the phrase also.
There was a problem hiding this comment.
Do you want to update this message ?
There was a problem hiding this comment.
Should this parameter be const ?
There was a problem hiding this comment.
You could call this method getData(...) also as processResultRequests makes it more vague.
There was a problem hiding this comment.
std::boolalpha << false
Wouldn't it be simple to just include "false" string in the message?
complete: false
063fb27 to
5749763
Compare
5749763 to
4f96a8a
Compare
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks Karteek. This is looking good minus 2 nits.
There was a problem hiding this comment.
Do you want to update this message ?
There was a problem hiding this comment.
Nit : Just getData(...) should work also. ResultRequests is redundant.
There was a problem hiding this comment.
There is already a getData() method. Will keep this as for readability.
4f96a8a to
0d8f3cc
Compare
|
@mbasmanova would you please review this. The test failure is not related to this change: |
|
@karteekmurthys I have things I need to work on today and tomorrow. Will try to review this later this week. |
0d8f3cc to
324d701
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
@karteekmurthys Looks good overall % a small comment below.
@spershin mentioned that this issue shows up intermittently when running e2e tests. Would you run these tests multiple times (10-20) to confirm that this issue no longer shows up?
There was a problem hiding this comment.
Remove "complete: " << false"
There was a problem hiding this comment.
Is this 'result' intended to simulate 'timeout' result? If so, timeout logic looks different. Would it make sense to extract that logic into a helper method and reuse to avoid divergence?
folly::Future<std::unique_ptr<Result>> TaskManager::getResults(
auto timeoutFn = [token]() {
auto result = std::make_unique<Result>();
result->sequence = result->nextSequence = token;
result->data = folly::IOBuf::create(0);
result->complete = false;
return result;
};
There was a problem hiding this comment.
Addressed this comment.
There was a problem hiding this comment.
Would you add a TODO here and refer to the GitHub issue?
|
@karteekmurthys Please, squash commits. |
66b499c to
7d6d97b
Compare
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks Karteek for the changes.
mbasmanova
left a comment
There was a problem hiding this comment.
@karteekmurthys Looks good to me % comment on the function name.
There was a problem hiding this comment.
Thank you for extracting a helper function. The name suggests that the method does something to respond to a timeout, but it actually just generated a timeout response (without, for example, sending it out). Consider renaming for clarity: createTimeoutResult or getTimeoutResult or makeTimeoutResult.
@karteekmurthys Have you tried that? Can you confirm that e2e tests are passing reliably now? |
7d6d97b to
2691ae1
Compare
I have run it like 5 times and failed to repro. I will try to run longer and see if it reproduces. |
tsan? |

The race condition occurs when task is terminated and its PartitionedOutputBuffer is cleared out by downstream driver.
The task manager tries to access this buffer.
Depends on this change: facebookincubator/velox#3502.
== Testing ==
Simulate a scenario where TaskManager processes result requests for a Task and the buffer is unavailable. Ensure that no exception is thrown and the future associated with the result request is invoked.
== RELEASE NOTES ==
Native Worker
3009for details.