-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Streaming Local Parquet Reads #2592
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2592 +/- ##
=======================================
Coverage ? 63.63%
=======================================
Files ? 959
Lines ? 109918
Branches ? 0
=======================================
Hits ? 69943
Misses ? 39975
Partials ? 0
|
// Use block in place to read metadata as the current function is in an asynchronous context. | ||
let metadata = match metadata { | ||
Some(m) => m, | ||
None => read::read_metadata(&mut reader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a blocking call in an asynchronous context. However since it is a metadata read, not sure if it's worth the overhead of calling it in a blocking thread i.e. spawn_blocking / rayon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be just reading the metadata from the local filesystem which should be pretty quick to do once.
@@ -18,7 +18,7 @@ impl InMemorySource { | |||
|
|||
impl Source for InMemorySource { | |||
#[instrument(name = "InMemorySource::get_data", level = "info", skip(self))] | |||
fn get_data(&self) -> SourceStream { | |||
fn get_data(&self, in_order: bool) -> SourceStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I know this already existed within MultiSender
but it's not very apparent from the code what this flag actually represents.
does it mean:
A: the node does not change the output ordering
B: the data has been ordered
C: the node requires that the data is ordered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.. still trying to figure out the best abstractions for the new executor, so apologies for the confusion here.
But to answer your question, it's supposed to indicate that the parent node requires that the data received is ordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about using some enums and additional trait methods just to make everything a bit more readable
enum Ordering {
Unordered,
Ordered,
Unknown
}
trait Source {
fn get_data(&self, ordering: &Ordering) -> Data;
}
pub trait IntermediateOperator {
fn execute(&self, input: &Arc<MicroPartition>) -> DaftResult<Arc<MicroPartition>>;
fn name(&self) -> &'static str;
fn output_ordering(&self) -> Ordering
fn required_input_ordering(&self) -> Ordering;
}
impl MultiSender {
fn required_input_ordering(&self) -> Ordering;
}
c46d2bc
to
a3ea37d
Compare
// No limit, never early-terminate. | ||
None => futures::future::ready(Ok(true)), | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just cleaning up some of the streaming CSV code
// No limit, never early-terminate. | ||
None => futures::future::ready(Ok(true)), | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just cleaning up some of the streaming json code
// Use block in place to read metadata as the current function is in an asynchronous context. | ||
let metadata = match metadata { | ||
Some(m) => m, | ||
None => read::read_metadata(&mut reader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be just reading the metadata from the local filesystem which should be pretty quick to do once.
rg_metadata, | ||
schema.fields.clone(), | ||
Some(chunk_size), | ||
num_rows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be the num_rows we need from that row_group
. Let's say you have 2 row groups of 10 rows each and we request 15 rows.
The first request to the row group should be 10 rows and the second one should be 5.
I believe the row_ranges
has this number for each row group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh thats an important catch thanks
.map(|rg_range| { | ||
let expected_num_chunks = | ||
f32::ceil(rg_range.num_rows as f32 / chunk_size as f32) as usize; | ||
tokio::sync::mpsc::channel(expected_num_chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For channels we should use the std
channels rather than the tokio ones which are async emulations of those abstractions.
For even better performance, we can use https://docs.rs/crossbeam/latest/crossbeam/channel/index.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the cool things in crossbeam is that we can use it as an iterator.
https://docs.rs/crossbeam/latest/crossbeam/channel/index.html#iteration
tokio::sync::mpsc::channel(expected_num_chunks) | ||
}) | ||
.unzip(); | ||
// Create a channel to send errors to the stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn't need to have a seperate channel to send errors. Normally we can just send DaftResult
into the output channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
This PR implements streaming local file reads for parquet.
Memory profiling results on Q6 (native streaming vs python bulk): Native streaming achieves almost 2x lower memory
TPCH Results: Overall achieves parity with python runner, with some exceptions like Q1 achieving 1.75x speedup
tpch_result.txt
Todos in follow up PRs: