-
Notifications
You must be signed in to change notification settings - Fork 0
20067: fix: race condition in SpillPool caused by buffered stream #240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
ccde430
1b8ef43
102c121
186b699
6139690
89ebeec
21d53a2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,7 +28,7 @@ use std::sync::Arc; | |
|
|
||
| use super::{SpillReaderStream, in_progress_spill_file::InProgressSpillFile}; | ||
| use crate::coop::cooperative; | ||
| use crate::{common::spawn_buffered, metrics::SpillMetrics}; | ||
| use crate::metrics::SpillMetrics; | ||
|
|
||
| /// The `SpillManager` is responsible for the following tasks: | ||
| /// - Reading and writing `RecordBatch`es to raw files based on the provided configurations. | ||
|
|
@@ -41,8 +41,6 @@ pub struct SpillManager { | |
| env: Arc<RuntimeEnv>, | ||
| pub(crate) metrics: SpillMetrics, | ||
| schema: SchemaRef, | ||
| /// Number of batches to buffer in memory during disk reads | ||
| batch_read_buffer_capacity: usize, | ||
| /// general-purpose compression options | ||
| pub(crate) compression: SpillCompression, | ||
| } | ||
|
|
@@ -53,18 +51,10 @@ impl SpillManager { | |
| env, | ||
| metrics, | ||
| schema, | ||
| batch_read_buffer_capacity: 2, | ||
| compression: SpillCompression::default(), | ||
| } | ||
| } | ||
|
|
||
| pub fn with_batch_read_buffer_capacity( | ||
| mut self, | ||
| batch_read_buffer_capacity: usize, | ||
| ) -> Self { | ||
| self.batch_read_buffer_capacity = batch_read_buffer_capacity; | ||
| self | ||
| } | ||
|
|
||
| pub fn with_compression_type(mut self, spill_compression: SpillCompression) -> Self { | ||
| self.compression = spill_compression; | ||
|
|
@@ -186,7 +176,7 @@ impl SpillManager { | |
| max_record_batch_memory, | ||
| ))); | ||
|
|
||
| Ok(spawn_buffered(stream, self.batch_read_buffer_capacity)) | ||
| Ok(stream) | ||
| } | ||
|
|
||
| /// Same as `read_spill_as_stream`, but without buffering. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After removing Severity: low 🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:useful; category:bug; feedback: The Augment AI reviewer is correct! After removing the buffering from |
||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -1441,4 +1441,44 @@ mod tests { | |||||||||||||||||
|
|
||||||||||||||||||
| Ok(()) | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| #[tokio::test(flavor = "multi_thread", worker_threads = 1)] | ||||||||||||||||||
| async fn test_concurrent_writer_reader_race_condition() -> Result<()> { | ||||||||||||||||||
| // stress testing the concurncy in the reader and the reader to make sure there is now race condtion | ||||||||||||||||||
| // going for 100 iterations with a 5 batches per iteration | ||||||||||||||||||
|
Comment on lines
+1447
to
+1448
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Typos in test comment. Line 1447: "concurncy" → "concurrency", "now" → "no", "condtion" → "condition". Proposed fix- // stress testing the concurncy in the reader and the reader to make sure there is now race condtion
- // going for 100 iterations with a 5 batches per iteration
+ // Stress testing the concurrency between the writer and the reader to make sure there is no race condition.
+ // Going for 100 iterations with 5 batches per iteration.📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:good-to-have; category:documentation; feedback: The CodeRabbit AI reviewer is correct! There are few minor typos in the comments which need to be fixed.
Comment on lines
+1447
to
+1448
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The comment contains a few typos and could be clarified. It seems to be testing concurrency between a writer and a reader, not two readers.
Suggested change
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! There are few minor typos in the comments which need to be fixed. |
||||||||||||||||||
| const NUM_BATCHES: usize = 5; | ||||||||||||||||||
| const ITERATIONS: usize = 100; | ||||||||||||||||||
|
|
||||||||||||||||||
| for iteration in 0..ITERATIONS { | ||||||||||||||||||
| let (writer, mut reader) = create_spill_channel(1024 * 1024); | ||||||||||||||||||
|
|
||||||||||||||||||
| let writer_handle = SpawnedTask::spawn(async move { | ||||||||||||||||||
| for i in 0..NUM_BATCHES { | ||||||||||||||||||
| let batch = create_test_batch(i as i32 * 10, 10); | ||||||||||||||||||
| writer.push_batch(&batch).unwrap(); | ||||||||||||||||||
| tokio::task::yield_now().await; | ||||||||||||||||||
| } | ||||||||||||||||||
| }); | ||||||||||||||||||
|
|
||||||||||||||||||
| let reader_handle = SpawnedTask::spawn(async move { | ||||||||||||||||||
| let mut batches_read = 0; | ||||||||||||||||||
| while let Some(result) = reader.next().await { | ||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This Severity: medium 🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:annoying; category:bug; feedback: The Augment AI reviewer is not correct! Depending on the writer to drop is a standard way to decide to stop waiting on read operation for all kinds of channels in Rust. The first/main reason is when receiving None message and the last dropped writer is the second reason. |
||||||||||||||||||
| let _batch = result.unwrap(); | ||||||||||||||||||
| batches_read += 1; | ||||||||||||||||||
| tokio::task::yield_now().await; | ||||||||||||||||||
| } | ||||||||||||||||||
| batches_read | ||||||||||||||||||
| }); | ||||||||||||||||||
|
|
||||||||||||||||||
| writer_handle.join().await.unwrap(); | ||||||||||||||||||
| let batches_read = reader_handle.join().await.unwrap(); | ||||||||||||||||||
|
|
||||||||||||||||||
| assert_eq!( | ||||||||||||||||||
| batches_read, NUM_BATCHES, | ||||||||||||||||||
| "Iteration {iteration}: Expected {NUM_BATCHES} got {batches_read}." | ||||||||||||||||||
| ); | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| Ok(()) | ||||||||||||||||||
| } | ||||||||||||||||||
| } | ||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that
read_spill_as_streamis unbuffered, it has become functionally identical toread_spill_as_stream_unbuffered. To reduce code duplication and improve clarity, consider removingread_spill_as_stream_unbufferedand updating its call sites to use this function instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! After removing the buffering from
read_spill_as_stream()it is not exactly the same asread_spill_as_stream_unbuffered(). If the fix of the initial issue is to drop the pre-fetching then one of these methods should be removed and all its callers should use the other method. But maybe this is the wrong way to fix it. Maybe a better way is to find the reason why a reader gets an invalid EOF/None message and does not wait for more messages.