Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Remote parquet streaming #2620

Merged
merged 7 commits into from
Aug 9, 2024
Merged

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Aug 5, 2024

Adds streaming reads to remote parquet files.

The algorithm is similar to that for local parquet files: Read bytes into memory -> get arrow chunk iterator -> emit table per chunk

Q6 Memory Profile:
Screenshot 2024-08-07 at 12 38 18 PM
Streaming
Screenshot 2024-08-07 at 12 38 41 PM
Bulk

@github-actions github-actions bot added the enhancement New feature or request label Aug 5, 2024
@colin-ho colin-ho marked this pull request as ready for review August 6, 2024 16:28
@colin-ho colin-ho requested a review from samster25 August 6, 2024 16:31

let mut arr_iters = Vec::with_capacity(self.arrow_schema.fields.len());
for field in self.arrow_schema.fields.iter() {
let filtered_cols_idx = columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this may be computed for every row_group. We may want to just compute it once and leverage that?

Copy link
Contributor Author

@colin-ho colin-ho Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will they always be the same for all row groups?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, but you should double check!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok managed to refactor it so that it doesn't need idxs, just retrieves the column metadata directly for each field per rowgroup, which should be unique.


let mut range_readers = Vec::with_capacity(filtered_cols_idx.len());
for range in needed_byte_ranges.into_iter() {
let range_reader = ranges.get_range_reader(range).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does block on the first IO request being complete. so you may want to not await this but join_all once you have all the range_readers.

.collect::<Vec<_>>();

let mut range_readers = Vec::with_capacity(filtered_cols_idx.len());
for range in needed_byte_ranges.into_iter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of collecting range_readers into a Vec, you can probably do this in 1 loop, producing decompressed_iters, ptypes, num_values, etc

Copy link

codecov bot commented Aug 7, 2024

Codecov Report

Attention: Patch coverage is 84.01487% with 43 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@cac155e). Learn more about missing BASE report.
Report is 7 commits behind head on main.

Files Patch % Lines
src/daft-parquet/src/stream_reader.rs 58.42% 37 Missing ⚠️
src/daft-parquet/src/file.rs 95.65% 6 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2620   +/-   ##
=======================================
  Coverage        ?   63.88%           
=======================================
  Files           ?      965           
  Lines           ?   111230           
  Branches        ?        0           
=======================================
  Hits            ?    71060           
  Misses          ?    40170           
  Partials        ?        0           
Files Coverage Δ
src/daft-parquet/src/read.rs 69.58% <100.00%> (ø)
src/parquet2/src/read/page/stream.rs 36.89% <100.00%> (ø)
src/daft-parquet/src/file.rs 73.01% <95.65%> (ø)
src/daft-parquet/src/stream_reader.rs 56.95% <58.42%> (ø)

@colin-ho colin-ho requested a review from samster25 August 7, 2024 20:21
@samster25 samster25 merged commit 48632c6 into main Aug 9, 2024
48 checks passed
@samster25 samster25 deleted the colin/remote-parquet-streaming branch August 9, 2024 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants