-
Notifications
You must be signed in to change notification settings - Fork 729
test: simplify IO tests #5228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: simplify IO tests #5228
Changes from 8 commits
df5678b
d1bbd5d
3b21d37
e399444
5fb6d64
a7546e0
9c988cd
3542f68
1a8cc1d
eac2b74
94a5088
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // SPDX-FileCopyrightText: Copyright The Lance Authors | ||
|
|
||
| //! IO statistics tracking for dataset operations | ||
|
|
||
| use pyo3::{pyclass, pymethods}; | ||
|
|
||
| /// IO statistics for dataset operations | ||
| /// | ||
| /// This tracks the number of IO operations and bytes transferred for read and write | ||
| /// operations performed on the dataset's object store. | ||
| /// | ||
| /// Note: Calling `io_stats()` returns the statistics accumulated since the last call | ||
| /// and resets the internal counters (incremental stats pattern). | ||
| #[pyclass(name = "IOStats", module = "_lib", get_all)] | ||
| #[derive(Clone, Debug)] | ||
| pub struct IoStats { | ||
|
Comment on lines
+8
to
+17
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be nice if we could include these docs in mkdocs. I'll make an issue to figure that out. Then we wouldn't need the lengthy |
||
| /// Number of read IO operations performed | ||
| pub read_iops: u64, | ||
| /// Total bytes read from storage | ||
| pub read_bytes: u64, | ||
| /// Number of write IO operations performed | ||
| pub write_iops: u64, | ||
| /// Total bytes written to storage | ||
| pub write_bytes: u64, | ||
| /// Number of disjoint periods where at least one IO is in-flight | ||
| /// | ||
| /// This metric helps understand IO parallelism. A lower number indicates | ||
| /// more parallel IO operations. | ||
| pub num_hops: u64, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. calling these hops/network hops isn't clear to me. Would this be network hops behind the S3 endpoint? That's how I would interpret it but I don't think we have that info
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this just network requests?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I should give an example in the comment. Imagine a process:
That's a total of 12 requests (list, 10 heads, 1 get). But we do them in 3 "hops". Maybe that's not the best term. Could do "stages" or something else.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. Makes me think of dependency chains - not sure if that's a helpful term.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that "hops" is not quite the correct term (though I am used to our usage of it in this way from the Rust unit tests). I also am not aware of any standard term. Is this something we want to expose? Should we mention it is likely meaningless if there are concurrent operations in flight? Or that it can be a somewhat noisy metric?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I'll rename it for |
||
| } | ||
|
|
||
| #[pymethods] | ||
| impl IoStats { | ||
| fn __repr__(&self) -> String { | ||
| format!( | ||
| "IOStats(read_iops={}, read_bytes={}, write_iops={}, write_bytes={}, num_hops={})", | ||
| self.read_iops, self.read_bytes, self.write_iops, self.write_bytes, self.num_hops | ||
| ) | ||
| } | ||
| } | ||
|
|
||
| impl IoStats { | ||
| /// Convert from Lance's internal IoStats type | ||
| pub fn from_lance(stats: lance_io::utils::tracking_store::IoStats) -> Self { | ||
| Self { | ||
| read_iops: stats.read_iops, | ||
| read_bytes: stats.read_bytes, | ||
| write_iops: stats.write_iops, | ||
| write_bytes: stats.write_bytes, | ||
| num_hops: stats.num_hops, | ||
| } | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,6 +26,7 @@ use tracing::instrument; | |
|
|
||
| use crate::object_store::DEFAULT_LOCAL_IO_PARALLELISM; | ||
| use crate::traits::{Reader, Writer}; | ||
| use crate::utils::tracking_store::IOTracker; | ||
|
|
||
| /// Convert an [`object_store::path::Path`] to a [`std::path::Path`]. | ||
| pub fn to_local_path(path: &Path) -> String { | ||
|
|
@@ -86,6 +87,9 @@ pub struct LocalObjectReader { | |
|
|
||
| /// Block size, in bytes. | ||
| block_size: usize, | ||
|
|
||
| /// IO tracker for monitoring read operations. | ||
| io_tracker: Option<Arc<IOTracker>>, | ||
| } | ||
|
|
||
| impl DeepSizeOf for LocalObjectReader { | ||
|
|
@@ -107,11 +111,24 @@ impl LocalObjectReader { | |
| } | ||
|
|
||
| /// Open a local object reader, with default prefetch size. | ||
| /// | ||
| /// For backward compatibility with existing code that doesn't need tracking. | ||
| #[instrument(level = "debug")] | ||
| pub async fn open( | ||
| path: &Path, | ||
| block_size: usize, | ||
| known_size: Option<usize>, | ||
| ) -> Result<Box<dyn Reader>> { | ||
| Self::open_with_tracker(path, block_size, known_size, None).await | ||
| } | ||
|
|
||
| /// Open a local object reader with optional IO tracking. | ||
| #[instrument(level = "debug")] | ||
| pub(crate) async fn open_with_tracker( | ||
| path: &Path, | ||
| block_size: usize, | ||
| known_size: Option<usize>, | ||
| io_tracker: Option<Arc<IOTracker>>, | ||
| ) -> Result<Box<dyn Reader>> { | ||
| let path = path.clone(); | ||
| let local_path = to_local_path(&path); | ||
|
|
@@ -129,6 +146,7 @@ impl LocalObjectReader { | |
| block_size, | ||
| size, | ||
| path, | ||
| io_tracker, | ||
| }) as Box<dyn Reader>) | ||
| }) | ||
| .await? | ||
|
|
@@ -171,7 +189,12 @@ impl Reader for LocalObjectReader { | |
| #[instrument(level = "debug", skip(self))] | ||
| async fn get_range(&self, range: Range<usize>) -> object_store::Result<Bytes> { | ||
| let file = self.file.clone(); | ||
| tokio::task::spawn_blocking(move || { | ||
| let io_tracker = self.io_tracker.clone(); | ||
| let path = self.path.clone(); | ||
| let num_bytes = range.len() as u64; | ||
| let range_u64 = (range.start as u64)..(range.end as u64); | ||
|
|
||
| let result = tokio::task::spawn_blocking(move || { | ||
| let mut buf = BytesMut::with_capacity(range.len()); | ||
| // Safety: `buf` is set with appropriate capacity above. It is | ||
| // written to below and we check all data is initialized at that point. | ||
|
|
@@ -187,14 +210,24 @@ impl Reader for LocalObjectReader { | |
| .map_err(|err: std::io::Error| object_store::Error::Generic { | ||
| store: "LocalFileSystem", | ||
| source: err.into(), | ||
| }) | ||
| }); | ||
|
|
||
| // Record the read operation if tracking is enabled | ||
| if let (Ok(_), Some(tracker)) = (&result, io_tracker.as_ref()) { | ||
| tracker.record_read("get_range", path, num_bytes, Some(range_u64)); | ||
| } | ||
|
|
||
| result | ||
| } | ||
|
|
||
| /// Reads the entire file. | ||
| #[instrument(level = "debug", skip(self))] | ||
| async fn get_all(&self) -> object_store::Result<Bytes> { | ||
| let mut file = self.file.clone(); | ||
| tokio::task::spawn_blocking(move || { | ||
| let io_tracker = self.io_tracker.clone(); | ||
| let path = self.path.clone(); | ||
|
|
||
| let result = tokio::task::spawn_blocking(move || { | ||
| let mut buf = Vec::new(); | ||
| file.read_to_end(buf.as_mut())?; | ||
| Ok(Bytes::from(buf)) | ||
|
|
@@ -203,7 +236,14 @@ impl Reader for LocalObjectReader { | |
| .map_err(|err: std::io::Error| object_store::Error::Generic { | ||
| store: "LocalFileSystem", | ||
| source: err.into(), | ||
| }) | ||
| }); | ||
|
|
||
| // Record the read operation if tracking is enabled | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the optionality required? Elsewhere in the PR it seems to indicate it'll always be enabled
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same question.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can remove it. There is a constructor that doesn't pass a tracker down (used for tests I think). But I can just make it create an empty stats instance and record unconditionally. That can simplify some code. |
||
| if let (Ok(bytes), Some(tracker)) = (&result, io_tracker.as_ref()) { | ||
| tracker.record_read("get_all", path, bytes.len() as u64, None); | ||
| } | ||
|
|
||
| result | ||
| } | ||
| } | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.