Skip to content

feat:Support multi-threaded asynchronous data upload to object storage.#14472

Open
weixiuli wants to merge 9 commits intofacebookincubator:mainfrom
weixiuli:multi-threaded-s3FileSystem
Open

feat:Support multi-threaded asynchronous data upload to object storage.#14472
weixiuli wants to merge 9 commits intofacebookincubator:mainfrom
weixiuli:multi-threaded-s3FileSystem

Conversation

@weixiuli
Copy link
Copy Markdown

@weixiuli weixiuli commented Aug 14, 2025

#14471
Support multi-threaded asynchronous data upload to object storage.
image

This PR also adds a benchmark named S3AsyncUploadBenchmark, the output of benchmark show below :

============================================================================
[...]/benchmark/S3AsyncUploadBenchmark.cpp     relative  time/iter   iters/s
============================================================================
sync_upload_4M                                             48.82ms     20.48
async_upload_4M                                 108.32%    45.07ms     22.19
sync_upload_8M                                             79.81ms     12.53
async_upload_8M                                 104.41%    76.44ms     13.08
sync_upload_16M                                           138.74ms      7.21
async_upload_16M                                117.11%   118.47ms      8.44
sync_upload_32M                                           260.98ms      3.83
async_upload_32M                                170.03%   153.49ms      6.52
sync_upload_64M                                           509.86ms      1.96
async_upload_64M                                247.14%   206.30ms      4.85
sync_upload_128M                                          998.46ms      1.00
async_upload_128M                               284.82%   350.56ms      2.85
sync_upload_256M                                             2.00s   499.29m
async_upload_256M                               316.53%   632.76ms      1.58
sync_upload_512M                                             4.01s   249.50m
async_upload_512M                               339.35%      1.18s   846.68m
sync_upload_1024M                                            7.84s   127.62m
async_upload_1024M                              342.67%      2.29s   437.32m
sync_upload_2048M                                           16.40s    60.99m
async_upload_2048M                              332.71%      4.93s   202.93m

We used this PR on spark +gluten+ velox incubator-gluten and set the hive.s3.uploadPartAsync to be true in our environment, and the average write performance improved by 85%.

Before this PR: Total Time Across All Tasks: 64.3 h
image

After this PR: Total Time Across All Tasks: 32.4 h
image

@weixiuli weixiuli requested a review from majetideepak as a code owner August 14, 2025 09:51
@netlify
Copy link
Copy Markdown

netlify bot commented Aug 14, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 413eb73
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/68ca60f85c4ee100089ac2f7

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 14, 2025
@weixiuli weixiuli force-pushed the multi-threaded-s3FileSystem branch 3 times, most recently from 0120710 to 8cba9e9 Compare August 14, 2025 11:06
@jinchengchenghh
Copy link
Copy Markdown
Collaborator

What's the config when you got this performance?

and the average write performance improved by 85%.

@weixiuli
Copy link
Copy Markdown
Author

What's the config when you got this performance?

and the average write performance improved by 85%.

We set the hive.s3.uploadPartAsync to true in our environment.

@JkSelf
Copy link
Copy Markdown
Collaborator

JkSelf commented Aug 20, 2025

@weixiuli Thanks for your optimization.
Could you please add a benchmark to your code change to demonstrate the performance improvement?

@weixiuli
Copy link
Copy Markdown
Author

@JkSelf @jinchengchenghh Thanks for your review, PTAL.

@weixiuli weixiuli force-pushed the multi-threaded-s3FileSystem branch 2 times, most recently from 2d6a94e to 9456784 Compare August 26, 2025 01:32
Copy link
Copy Markdown
Collaborator

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work. Leaving some comments.

@JkSelf
Copy link
Copy Markdown
Collaborator

JkSelf commented Aug 26, 2025

@weixiuli Please help to resolve the conflict. Thanks.

@weixiuli weixiuli force-pushed the multi-threaded-s3FileSystem branch from 9456784 to 61dcd92 Compare August 28, 2025 02:50
@weixiuli weixiuli changed the title Support multi-threaded asynchronous data upload to object storage. feat:Support multi-threaded asynchronous data upload to object storage. Aug 28, 2025
@weixiuli weixiuli force-pushed the multi-threaded-s3FileSystem branch 2 times, most recently from 2a07c32 to a1217ff Compare September 1, 2025 02:06
@weixiuli
Copy link
Copy Markdown
Author

weixiuli commented Sep 1, 2025

@JkSelf @jinchengchenghh @majetideepak Could you help to review this PR? Thanks.

@weixiuli weixiuli requested a review from JkSelf September 1, 2025 09:37
@FelixYBW
Copy link
Copy Markdown

FelixYBW commented Sep 3, 2025

@pedroerp could you help to review the PR? @weixiuli confirmed the PR can boost performance in production environment.

@weixiuli weixiuli force-pushed the multi-threaded-s3FileSystem branch from a1217ff to c697ff5 Compare September 3, 2025 13:09
@weixiuli
Copy link
Copy Markdown
Author

weixiuli commented Sep 8, 2025

@pedroerp could you help to review the PR? thanks.

Copy link
Copy Markdown
Collaborator

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weixiuli Thanks for your fix. Left two comments.

Copy link
Copy Markdown
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weixiuli Did you evaluate the S3 Transfer Manager?

@weixiuli
Copy link
Copy Markdown
Author

@weixiuli Did you evaluate the S3 Transfer Manager?

Yes, velox may bypass local disk and directly use TransferManager for parallel uploads to S3, but it requires implementing a custom sink connector that feeds in-memory serialized Velox data into TransferManager’s multipart upload API. This design is a more complex one, while this PR solves the problem of asynchronous upload through a small memory buffle, with less modification and significant performance improvement.

@weixiuli weixiuli force-pushed the multi-threaded-s3FileSystem branch from 5576747 to c4daa54 Compare September 11, 2025 06:57

// Upload the part asynchronously.
void uploadPartAsync(const std::string_view part) {
maxConcurrentUploadNum_->wait();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this wait?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxConcurrentUploadNum_ is a semaphore and controls the concurrency of asynchronous uploads to
S3 for each S3WriteFile, preventing excessive memory usage.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the control flow here. The futures are run during close(). Don't we end up adding more tasks than maxConcurrentUploadNum_ at this point?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the control flow here. The futures are run during close(). Don't we end up adding more tasks than maxConcurrentUploadNum_ at this point?

@majetideepak The maxConcurrentUploadNum_ only limits the concurrency of asynchronous uploads within each S3WriteFile to prevent excessive memory usage. If multiple S3WriteFile instances exist, more upload tasks can still run concurrently , but it should be less than or equal to hive.s3.upload-threads.

DEFINE_BENCHMARKS(2048)
} // namespace

int main(int argc, char** argv) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the output of this benchmark?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark output is same as the description of the PR.

@weixiuli
Copy link
Copy Markdown
Author

cc @majetideepak

Copy link
Copy Markdown
Collaborator

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your updating. Leaving some nits.

s3Config->uploadThreads(),
std::make_shared<folly::NamedThreadFactory>("upload-thread"));
}
} else {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the else branch since we've already initialized it to null in the parameter definition?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since uploadThreadPool_ is declared as a static member of S3WriteFile::Impl, we need to add an else branch to handle the case where hive.s3.upload-part-async changes from true to false in benchmarks and unit tests. In this branch, we should reset uploadThreadPool_, while keeping maxConcurrentUploadNum_ unchanged (i.e., no reset needed).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since tests can run in parallel, this can lead to non-deterministic failures.
Since uploadThreadPool_ is static across different filesystem instances, we have a race condition here.
Can we keep it non-static?

kRetryMode,
kUseProxyFromEnv,
kCredentialsProvider,
KUploadPartAsync,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KUploadPartAsync -> KPartUploadAsync, matching the kPartUploadSize

return config_.find(Keys::kCredentialsProvider)->second;
}

bool uploadPartAsync() const {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@weixiuli
Copy link
Copy Markdown
Author

@weixiuli weixiuli force-pushed the multi-threaded-s3FileSystem branch from 6589956 to 2faf0f4 Compare September 16, 2025 09:50
Copy link
Copy Markdown
Collaborator

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your fix.

@weixiuli
Copy link
Copy Markdown
Author

Thanks @JkSelf cc @jinchengchenghh @majetideepak

@weixiuli
Copy link
Copy Markdown
Author

@majetideepak @jinchengchenghh could you help to review the PR again? thanks.

s3Config->uploadThreads(),
std::make_shared<folly::NamedThreadFactory>("upload-thread"));
}
} else {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since tests can run in parallel, this can lead to non-deterministic failures.
Since uploadThreadPool_ is static across different filesystem instances, we have a race condition here.
Can we keep it non-static?

// Only the last part can be less than partUploadSize_.
VELOX_CHECK(isLast || (!isLast && (part.size() == partUploadSize_)));
auto uploadPartSync = [&](const std::string_view partData) {
Aws::S3::Model::CompletedPart completedPart =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use auto?


// Upload the part asynchronously.
void uploadPartAsync(const std::string_view part) {
maxConcurrentUploadNum_->wait();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the control flow here. The futures are run during close(). Don't we end up adding more tasks than maxConcurrentUploadNum_ at this point?

/// and creates the object.
/// https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
/// https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc
/// S3WriteFile is not thread-safe.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you resolve this issue? Thanks.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you resolve this issue? Thanks.

The current modification of the pr is an optimization of the internal processing logic of S3WriteFile. We control uploadState_.completedParts through uploadStateMutex_ and ensure that The list of parts should be in ascending order when it is closed. Currently, the internal processing of S3WriteFile is thread-safe.

@stale
Copy link
Copy Markdown

stale bot commented Jan 8, 2026

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Jan 8, 2026
@stale stale bot closed this Jan 22, 2026
@FelixYBW
Copy link
Copy Markdown

Using SF2500 TPCDS table, performance is improved by 10%. CPU% is improved obviously:
Without the PR
image
With the PR:
image

@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Jan 27, 2026

@FelixYBW Some of the review comments have not been addressed.
I also think we should probably use the existing connectorIoExecutor instead of creating another executor pool per Writefile

@majetideepak majetideepak reopened this Jan 27, 2026
@stale stale bot removed the stale label Jan 27, 2026
@zhztheplayer
Copy link
Copy Markdown
Collaborator

@weixiuli Would you rebase the PR? I am picking it back for internal testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants