Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add upload functionality to binary columns #2461

Merged
merged 13 commits into from
Jul 8, 2024
Merged

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Jul 2, 2024

Adds support for .url.upload()

When uploading 300,000 small 300kb files, network peaks at about 5Gbps on a 12Gbps machine:
image

Follow-ons:

  • I'm guessing we're compute-bottlenecked by data copies. In this PR we make a copy of the data when we pass the bytes::Bytes object to the AWS SDK's put request builder. More profiling needs to be done to understand issues there.
  • Implement/test GCS puts
  • Implement/test Azure puts
  • Implement/test HTTP puts

@jaychia jaychia changed the title [FEAT] Add upload functinoality to binary columns [FEAT] Add upload functionality to binary columns Jul 2, 2024
@github-actions github-actions bot added the enhancement New feature or request label Jul 2, 2024
Copy link

codecov bot commented Jul 2, 2024

Codecov Report

Attention: Patch coverage is 59.65909% with 142 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@de1a9a0). Learn more about missing BASE report.
Report is 10 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2461   +/-   ##
=======================================
  Coverage        ?   63.48%           
=======================================
  Files           ?      951           
  Lines           ?   107050           
  Branches        ?        0           
=======================================
  Hits            ?    67960           
  Misses          ?    39090           
  Partials        ?        0           
Files Coverage Δ
src/daft-dsl/src/functions/uri/mod.rs 100.00% <100.00%> (ø)
src/daft-io/src/object_io.rs 76.00% <ø> (ø)
daft/expressions/expressions.py 93.69% <93.75%> (ø)
src/daft-io/src/local.rs 90.93% <88.88%> (ø)
src/daft-dsl/src/python.rs 92.15% <86.36%> (ø)
src/daft-io/src/google_cloud.rs 0.00% <0.00%> (ø)
src/daft-io/src/http.rs 47.71% <0.00%> (ø)
src/daft-io/src/azure_blob.rs 0.00% <0.00%> (ø)
src/daft-io/src/stats.rs 90.36% <66.66%> (ø)
src/daft-dsl/src/functions/uri/upload.rs 60.00% <60.00%> (ø)
... and 2 more

Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions, but otherwise LGTM

src/daft-io/src/s3_like.rs Show resolved Hide resolved
tests/integration/io/test_files_roundtrip_s3_minio.py Outdated Show resolved Hide resolved
src/daft-io/src/azure_blob.rs Outdated Show resolved Hide resolved
src/daft-io/src/http.rs Outdated Show resolved Hide resolved
@jaychia jaychia merged commit 9223213 into main Jul 8, 2024
45 checks passed
@jaychia jaychia deleted the jay/write-bytes-s3 branch July 8, 2024 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants