Skip to content

Do not throw from PartitionedOutputBufferManager::getData#3502

Closed
karteekmurthys wants to merge 4 commits intofacebookincubator:mainfrom
Ahana-Inc:output-buffer-not-found
Closed

Do not throw from PartitionedOutputBufferManager::getData#3502
karteekmurthys wants to merge 4 commits intofacebookincubator:mainfrom
Ahana-Inc:output-buffer-not-found

Conversation

@karteekmurthys
Copy link
Copy Markdown
Contributor

@karteekmurthys karteekmurthys commented Dec 13, 2022

A task being executed by multiple drivers, and if one of the drivers catches an
exception, can terminate the task. This termination will remove any
PartitionedOutputBuffer assigned to that taskId. In a parallel thread, the task
manager may try to access this buffer leading to "Output buffers not found"
exception. This exception is valid, but instead of the exception encountered by
the Driver trickling up to the user, the "Output buffers not found" exception
may sometimes get thrown to the user.

Instead of throwing exception, the getData() will return
a bool indicating whether a matching buffer for a given task id was found or not.

See #3009 for more context.

@karteekmurthys karteekmurthys self-assigned this Dec 13, 2022
@netlify
Copy link
Copy Markdown

netlify bot commented Dec 13, 2022

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 29b779f
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/639cc5c531a0f10009f78d09

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 13, 2022
@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch 2 times, most recently from c93639f to baadf75 Compare December 13, 2022 06:34
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys

There are multiple places that call getBuffer. It is not clear why this particular place should handle the exception and others should not. Would you clarify?

Similar to the other PR, we should avoid programming-by-exception pattern and instead introduce a method getBufferIfExists that return nullptr if buffer doesn't exist. This method can then be used whenever the caller wants to handle missing buffer (i.e. deleteResults and maybe here).

The method documentation needs to be updated to describe the new behavior and a test needs to be added.

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per comments above

@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch from cb8096f to 8634093 Compare December 15, 2022 07:41
@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch from 8634093 to 9fd12d4 Compare December 15, 2022 07:46
Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Karteek, thank you for iterating on this PR. The PR description explains the problem well. Thank you for taking the time to write it up.

This is misleading and must be handled correctly.

Agree, but would you elaborate on what does it mean to "handled correctly"? In other words, would you update the description to explain the solution implemented here?

Some comments below.

I'm seeing PartitionedOutputBufferManagerTest crashing. Please, take a look.

// Retrieves the set of buffers for a query.
std::shared_ptr<PartitionedOutputBuffer> getBuffer(const std::string& taskId);

std::shared_ptr<PartitionedOutputBuffer> getBufferIfExists(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice method. Please, add empty line before this method and document it. Let's use this method in PartitionedOutputBufferManager::deleteResults as well.

@@ -265,7 +284,8 @@ class PartitionedOutputBufferManager {
private:
// Retrieves the set of buffers for a query.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify that this method throws is buffer for the specified task does not exist.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should clarify that this method throws when task is not found.

// Retrieves the set of buffers for the specified task. Throws buffers-no-found exception if not found.

// BufferManager::acknowledge.
using DataAvailableCallback = std::function<
void(std::vector<std::unique_ptr<folly::IOBuf>> pages, int64_t sequence)>;
enum GetDataStatus { SUCCESS, ERR_BUFFER_NOT_FOUND };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
std::vector<std::unique_ptr<folly::IOBuf>> pages;
auto param = std::make_unique<DataAvailableCallbackParam>(
std::move(pages), sequence, ERR_BUFFER_NOT_FOUND);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be simpler to just return 'false' to indicate that buffers for the task do not exist and therefore the 'notify' callback cannot be queued. See my other comment.

}
}

TEST_F(PartitionedOutputBufferManagerTest, getDataOnTaskWithNoBuffer) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps, rename to getDataOnFailedTask and add comments to explain the scenario this test is verifying.

@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch from 7444ed7 to e19b3d6 Compare December 16, 2022 17:55
@karteekmurthys karteekmurthys force-pushed the output-buffer-not-found branch from e19b3d6 to 28f295c Compare December 16, 2022 17:57
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Looks good to me % a few small comments and PR description needing an update to explain the solution implemented here.

}
return it->second;
});
auto buffer = getBufferIfExists(taskId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a cleaner way to write this is

if (auto buffer = getBufferIfExists(taskId)) {
   buffer->deleteResults(destination);
}

int64_t sequence,
DataAvailableCallback notify) {
getBuffer(taskId)->getData(destination, maxBytes, sequence, notify);
auto buffer = getBufferIfExists(taskId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

// Retrieves the set of buffers for a query.
std::shared_ptr<PartitionedOutputBuffer> getBuffer(const std::string& taskId);

// Retrieves the set of buffers for a query if exists. If taskId is not found,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query -> task

"If taskId is not found," -> "Returns null if task not found."

@@ -265,7 +284,8 @@ class PartitionedOutputBufferManager {
private:
// Retrieves the set of buffers for a query.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should clarify that this method throws when task is not found.

// Retrieves the set of buffers for the specified task. Throws buffers-no-found exception if not found.

}
}

TEST_F(PartitionedOutputBufferManagerTest, getDataOnFailedTask) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update existing calls to getData to assert that they return true.

1,
[&notified](
std::vector<std::unique_ptr<folly::IOBuf>> pages, int64_t sequence) {
notified = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simply this using VELOX_UNREACHABLE();

// manager or was removed by a parallel thread must return false. The `notify`
// callback must not be registered.
bool notified = false;
auto ret = bufferManager_->getData(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ret -> found

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karteekmurthys Thank you, Karteek.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova mbasmanova changed the title PartitionedOutputBufferManager::getData notify caller on exception Do not throw from PartitionedOutputBufferManager::getData Dec 16, 2022
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mbasmanova merged this pull request in a91b227.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants