-
Notifications
You must be signed in to change notification settings - Fork 297
perf: Improve source buffering #4451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4451 +/- ##
==========================================
- Coverage 78.25% 76.20% -2.06%
==========================================
Files 846 846
Lines 115257 118828 +3571
==========================================
+ Hits 90196 90554 +358
- Misses 25061 28274 +3213
🚀 New features to boost your workflow:
|
@@ -227,6 +227,28 @@ pub fn get_compute_pool_num_threads() -> usize { | |||
get_or_init_compute_runtime_num_worker_threads() | |||
} | |||
|
|||
// Helper function to combine a stream with a future that returns a result | |||
pub fn combine_stream<T, E>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this out because it's a pretty common pattern used in parquet + warc, now local execution and also in the future distributed execution (which i didn't touch for now).
In the future we can have a common/async utilities crate that has all of the async utils but for now i just put it in runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Had a couple of questions for learning sake, will let Desmond approve cause of the IO changes.
@@ -599,6 +598,7 @@ pub async fn local_parquet_stream( | |||
|
|||
let stream_of_streams = | |||
futures::stream::iter(output_receivers.into_iter().map(ReceiverStream::new)); | |||
let parquet_task = async move { parquet_task.await? }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the point of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The combine_stream
method accepts an impl<future<Result<(), E>>>
but the parquet task is of type RuntimeTask<DaftResult<()>>
, of which it's future type is impl<future<DaftResult<DaftResult<()>>>>
.
This is me trying to fold the DaftResult<DaftResult<()>>
into a DaftResult<()>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the ?
gets pushed to the outer function, not the async block. Otherwise async blocks are not like try blocks in nightly. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait i can just do parquet_task.map(|x| x?)
https://docs.rs/futures-util/latest/futures_util/future/trait.FutureExt.html
@@ -120,7 +120,7 @@ impl PipelineNode for SourceNode { | |||
let (destination_sender, destination_receiver) = create_channel(0); | |||
let counting_sender = | |||
CountingSender::new(destination_sender, self.runtime_stats.clone(), progress_bar); | |||
runtime_handle.spawn( | |||
runtime_handle.spawn_local( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, does this force all spawned tasks to execute on the same thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nvm, just a rename, looks like it always used spawn_local under the hood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noice, thanks!
Changes Made
Use spawned tasks to forward the scan task stream instead of using flatten unordered, which internally uses
futuresunordered
and doesn't allow the runtime to poll the streams concurrently.Improves performance of reading many small files, such as in the script here: https://gist.github.com/metadaddy/ec9e645fa0929321b626d8be6e11162e
TLDR:
Total record count: 65.28 seconds -> 20.22 seconds
Record count for 2025-03-31: 6.58 seconds -> 2.95 seconds
Capacity: 5.87 seconds -> 2.58 seconds
Top 10 most common drive models: 48.27 seconds -> 25.40 seconds
Before:
After:
Related Issues
Checklist
docs/mkdocs.yml
navigation