[FEAT] Streaming Local Parquet Reads #2592

colin-ho · 2024-07-31T22:56:17Z

This PR implements streaming local file reads for parquet.

Memory profiling results on Q6 (native streaming vs python bulk): Native streaming achieves almost 2x lower memory

TPCH Results: Overall achieves parity with python runner, with some exceptions like Q1 achieving 1.75x speedup
tpch_result.txt

Todos in follow up PRs:

Metadata only reads
Remote parquet reads

codecov · 2024-07-31T23:18:40Z

Codecov Report

Attention: Patch coverage is 0% with 394 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@ddabd34). Learn more about missing BASE report.
Report is 4 commits behind head on main.

Files	Patch %	Lines
src/daft-parquet/src/stream_reader.rs	0.00%	205 Missing ⚠️
src/daft-parquet/src/read.rs	0.00%	130 Missing ⚠️
src/daft-micropartition/src/micropartition.rs	0.00%	53 Missing ⚠️
src/daft-parquet/src/file.rs	0.00%	5 Missing ⚠️
src/daft-physical-plan/src/translate.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2592   +/-   ##
=======================================
  Coverage        ?   63.63%           
=======================================
  Files           ?      959           
  Lines           ?   109918           
  Branches        ?        0           
=======================================
  Hits            ?    69943           
  Misses          ?    39975           
  Partials        ?        0

Files	Coverage Δ
src/daft-parquet/src/lib.rs	`50.00% <ø> (ø)`
src/daft-physical-plan/src/translate.rs	`0.00% <0.00%> (ø)`
src/daft-parquet/src/file.rs	`65.83% <0.00%> (ø)`
src/daft-micropartition/src/micropartition.rs	`76.35% <0.00%> (ø)`
src/daft-parquet/src/read.rs	`56.88% <0.00%> (ø)`
src/daft-parquet/src/stream_reader.rs	`49.19% <0.00%> (ø)`

colin-ho · 2024-07-31T23:51:16Z

src/daft-parquet/src/stream_reader.rs

+    // Use block in place to read metadata as the current function is in an asynchronous context.
+    let metadata = match metadata {
+        Some(m) => m,
+        None => read::read_metadata(&mut reader)


This is a blocking call in an asynchronous context. However since it is a metadata read, not sure if it's worth the overhead of calling it in a blocking thread i.e. spawn_blocking / rayon.

this should be just reading the metadata from the local filesystem which should be pretty quick to do once.

universalmind303 · 2024-08-01T00:21:25Z

src/daft-local-execution/src/sources/in_memory.rs

@@ -18,7 +18,7 @@ impl InMemorySource {

 impl Source for InMemorySource {
    #[instrument(name = "InMemorySource::get_data", level = "info", skip(self))]
-    fn get_data(&self) -> SourceStream {
+    fn get_data(&self, in_order: bool) -> SourceStream {


So I know this already existed within MultiSender but it's not very apparent from the code what this flag actually represents.

does it mean:

A: the node does not change the output ordering
B: the data has been ordered
C: the node requires that the data is ordered

Yeah.. still trying to figure out the best abstractions for the new executor, so apologies for the confusion here.

But to answer your question, it's supposed to indicate that the parent node requires that the data received is ordered.

what about using some enums and additional trait methods just to make everything a bit more readable

enum Ordering { Unordered, Ordered, Unknown } trait Source { fn get_data(&self, ordering: &Ordering) -> Data; } pub trait IntermediateOperator { fn execute(&self, input: &Arc<MicroPartition>) -> DaftResult<Arc<MicroPartition>>; fn name(&self) -> &'static str; fn output_ordering(&self) -> Ordering fn required_input_ordering(&self) -> Ordering; } impl MultiSender { fn required_input_ordering(&self) -> Ordering; }

src/daft-local-execution/src/sources/scan_task.rs

src/daft-parquet/src/read.rs

src/daft-micropartition/src/micropartition.rs

colin-ho · 2024-08-02T17:06:22Z

src/daft-csv/src/read.rs

+            // No limit, never early-terminate.
+            None => futures::future::ready(Ok(true)),
+        }
+    });


Just cleaning up some of the streaming CSV code

colin-ho · 2024-08-02T17:06:32Z

src/daft-json/src/read.rs

+            // No limit, never early-terminate.
+            None => futures::future::ready(Ok(true)),
+        }
+    });


Just cleaning up some of the streaming json code

samster25 · 2024-08-02T20:49:23Z

src/daft-parquet/src/stream_reader.rs

+    // Use block in place to read metadata as the current function is in an asynchronous context.
+    let metadata = match metadata {
+        Some(m) => m,
+        None => read::read_metadata(&mut reader)


this should be just reading the metadata from the local filesystem which should be pretty quick to do once.

samster25 · 2024-08-02T20:53:55Z

src/daft-parquet/src/stream_reader.rs

+            rg_metadata,
+            schema.fields.clone(),
+            Some(chunk_size),
+            num_rows,


I believe this should be the num_rows we need from that row_group. Let's say you have 2 row groups of 10 rows each and we request 15 rows.
The first request to the row group should be 10 rows and the second one should be 5.

I believe the row_ranges has this number for each row group

Ahh thats an important catch thanks

samster25 · 2024-08-02T20:59:06Z

src/daft-parquet/src/stream_reader.rs

+        .map(|rg_range| {
+            let expected_num_chunks =
+                f32::ceil(rg_range.num_rows as f32 / chunk_size as f32) as usize;
+            tokio::sync::mpsc::channel(expected_num_chunks)


For channels we should use the std channels rather than the tokio ones which are async emulations of those abstractions.

For even better performance, we can use https://docs.rs/crossbeam/latest/crossbeam/channel/index.html

One of the cool things in crossbeam is that we can use it as an iterator.
https://docs.rs/crossbeam/latest/crossbeam/channel/index.html#iteration

samster25 · 2024-08-02T21:04:13Z

src/daft-parquet/src/stream_reader.rs

+            tokio::sync::mpsc::channel(expected_num_chunks)
+        })
+        .unzip();
+    // Create a channel to send errors to the stream


you shouldn't need to have a seperate channel to send errors. Normally we can just send DaftResult into the output channel.

samster25

🔥

github-actions bot added the enhancement New feature or request label Jul 31, 2024

colin-ho commented Jul 31, 2024

View reviewed changes

desmondcheongzx self-requested a review August 1, 2024 00:03

universalmind303 reviewed Aug 1, 2024

View reviewed changes

src/daft-local-execution/src/sources/scan_task.rs Outdated Show resolved Hide resolved

universalmind303 reviewed Aug 1, 2024

View reviewed changes

src/daft-parquet/src/read.rs Outdated Show resolved Hide resolved

samster25 reviewed Aug 2, 2024

View reviewed changes

Colin Ho and others added 5 commits August 2, 2024 08:53

local files

be7bf82

comments + cleanup

2058d58

in order

f41f331

remove dep

4bc61af

move out of micropartition

a3ea37d

colin-ho force-pushed the colin/streaming-parquet branch from c46d2bc to a3ea37d Compare August 2, 2024 16:39

cleanup

9fc97f0

colin-ho commented Aug 2, 2024

View reviewed changes

Colin Ho added 2 commits August 2, 2024 10:32

add try take while

8726ac8

remove comment

642093d

samster25 reviewed Aug 2, 2024

View reviewed changes

use crossbeam + some nits

618fd7c

colin-ho requested a review from samster25 August 2, 2024 23:29

samster25 approved these changes Aug 3, 2024

View reviewed changes

colin-ho merged commit b616031 into main Aug 3, 2024
44 checks passed

colin-ho deleted the colin/streaming-parquet branch August 3, 2024 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Streaming Local Parquet Reads #2592

[FEAT] Streaming Local Parquet Reads #2592

colin-ho commented Jul 31, 2024 •

edited

Loading

codecov bot commented Jul 31, 2024

colin-ho Jul 31, 2024

samster25 Aug 2, 2024

universalmind303 Aug 1, 2024

colin-ho Aug 1, 2024

universalmind303 Aug 1, 2024

colin-ho Aug 2, 2024

colin-ho Aug 2, 2024

samster25 Aug 2, 2024

samster25 Aug 2, 2024

colin-ho Aug 2, 2024

samster25 Aug 2, 2024

samster25 Aug 2, 2024

samster25 Aug 2, 2024

samster25 left a comment

[FEAT] Streaming Local Parquet Reads #2592

[FEAT] Streaming Local Parquet Reads #2592

Conversation

colin-ho commented Jul 31, 2024 • edited Loading

codecov bot commented Jul 31, 2024

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samster25 left a comment

Choose a reason for hiding this comment

colin-ho commented Jul 31, 2024 •

edited

Loading