-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to specify a read buffer initial capacity when creating ByteStream from a file #1238
Allow to specify a read buffer initial capacity when creating ByteStream from a file #1238
Conversation
This is a great idea, thanks for submitting! We'll get to reviewing this as soon as we're able. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I really like the new functionality here. What I'm less sure about is the potential for a bit of API explosion. (eg. if we ever end up with another parameter on loading bodies from a path) Thinking about some other options, revolving around making a builder for PathBody
.
We would make PathBody
public, and create a builder for it to allow code like:
PathBody::with_capacity(1024).from_file(...).byte_stream()
. For "vanilla" use cases, from_path
and from_file
would still provide a direct interface.
still considering all the possible options though, let me know if you have any other thoughts.
Self::from_file_with_buffer_size(file, DEFAULT_BUFFER_SIZE).await | ||
} | ||
|
||
/// Create a ByteStream from a file, with a specific read buffer initial capacity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs should clarify the unit of the read buffer
@@ -24,19 +24,22 @@ use tokio_util::io::ReaderStream; | |||
pub struct PathBody { | |||
state: State, | |||
len: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rename this to file_size
(now that buffer_size
is also there it's slightly confusing)
Agreed that just adding a function each time there's a new parameter isn't great, and makes for poor discoverability - I like the builder idea, will give it a shot! |
we also recently discussed the need to specify an offset/length into the path. Let's do a builder! |
1852649
to
ab28cfd
Compare
@rcoh Sorry for the delay, updated the PR with a builder. I used the opportunity to also accept the file size as an optional argument, for cases where it's known by the caller (no need to call the metadata in this case). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! just some small cleanups to finalize
} | ||
let body = PathBodyBuilder::from_path(&file) | ||
.with_buffer_size(16384) | ||
// This isn't the right file length - one shouln't done that in real life |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This isn't the right file length - one shouln't done that in real life | |
// This isn't the right file length - one shouldn't do that in real life |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pub fn from_path(path_buf: PathBuf, file_size: u64, buffer_size: usize) -> Self { | ||
PathBody { | ||
state: State::Unloaded(path.to_path_buf()), | ||
len, | ||
state: State::Unloaded(path_buf), | ||
file_size, | ||
buffer_size, | ||
} | ||
} | ||
pub fn from_file(file: File, len: u64) -> Self { | ||
pub fn from_file(file: File, file_size: u64, buffer_size: usize) -> Self { | ||
PathBody { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two functions don't need to be pub anymore, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe they actually need to be public to be able to call PathBodyBuilder directly (as in the example usage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these functions are on PathBody
, not PathBodyBuilder
. In the old code, these were used directly by ByteStream
, but now that uses the builder so these don't need to be pub anymore (I also just verified this locally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I mistakenly thought you were referring to the PathBodyBuilder functions.
|
||
impl PathBodyBuilder { | ||
/// Create a PathBodyBuilder from a path. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should note the default buffer size of 4096
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
self | ||
} | ||
|
||
/// Returns a [ByteStream](crate::byte_stream::ByteStream) from this builder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns a [ByteStream](crate::byte_stream::ByteStream) from this builder. | |
/// Returns a [`ByteStream`](crate::byte_stream::ByteStream) from this builder. |
/// If not used, [byte_stream](PathBodyBuilder::byte_stream) will require an extra call to query the file's metadata. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would probably rephrase this as the inverse to be a bit clearer about what this does:
By pre-specifying the length of the file, this API skips an additional call to retrieve the size from file-system metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, updated.
5f61a77
to
0a0021e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your continued iteration on this! The team got a chance to bike shed the API and this is what we came up with in terms of the UX:
// keep the current API
ByteStream::from_path("myfile.txt").await
// deprecate `ByteStream::from_file` and refer people to `read_from`
// New API `ByteStream::read_from` which returns an empty builder:
// - adds `.path` and `.file` methods to the builder
// - renames `with_buffer_size` to `buffer_size`
// - renames `with_file_size` to `file_size`
// - renames `byte_stream` to `build` (since most people will access this _from_ `ByteStream`
ByteStream::read_from().path(path).build().await?
ByteStream::read_from().file(file).build().await?
ByteStream::read_from().path(path).buffer_size(16092).build().await?
// rename to `PathBodyBuilder` to `FsBuilder` (which uses can get with `byte_stream::FsBuilder`
Thanks for your patience while we tweak this!
pub fn from_path(path_buf: PathBuf, file_size: u64, buffer_size: usize) -> Self { | ||
PathBody { | ||
state: State::Unloaded(path.to_path_buf()), | ||
len, | ||
state: State::Unloaded(path_buf), | ||
file_size, | ||
buffer_size, | ||
} | ||
} | ||
pub fn from_file(file: File, len: u64) -> Self { | ||
pub fn from_file(file: File, file_size: u64, buffer_size: usize) -> Self { | ||
PathBody { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these functions are on PathBody
, not PathBodyBuilder
. In the old code, these were used directly by ByteStream
, but now that uses the builder so these don't need to be pub anymore (I also just verified this locally)
/// Specify the length of the file to read (in bytes). | ||
/// | ||
/// By pre-specifying the length of the file, this API skips an additional call to retrieve the size from file-system metadata. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// |
/// | ||
/// Increasing the read buffer capacity to higher values than the default (4096 bytes) can result in a large reduction | ||
/// in CPU usage, at the cost of memory increase. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// |
@@ -23,20 +30,140 @@ use tokio_util::io::ReaderStream; | |||
/// 3. Provide size hint | |||
pub struct PathBody { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PathBody can also be private
…eam from a file The behaviour of the existing ByteStream::from_file / ByteStream::from_path is unchanged (using a default buffer capacity of 4k, which corresponds to Tokio's ReaderStream default buffer capacity). Using higher buffer sizes can result in a large reduction in CPU during S3 uploads, at the cost of memory increase.
This makes the distinction with the buffer size clearer
- renames `with_buffer_size` to `buffer_size` - renames `with_file_size` to `file_size` - renames `byte_stream` to `build`
c2d3b6a
to
f13845c
Compare
Updated the PR to implement this API. I like it, especially as it it's more discoverable directly from ByteStream. It does expose to 2 potential misuses though - calling neither path nor file on the FsBuilder, or calling both of them. Happy to change the implemented behaviour - I decided to panic when neither has been called (as it's probably a programmer error, it seemed better than returning an error) and favor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I just have two suggestion for docs and then I'm ready to approve and merge.
|
||
#[cfg(feature = "rt-tokio")] | ||
#[cfg_attr(docsrs, doc(cfg(feature = "rt-tokio")))] | ||
pub fn read_from() -> FsBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding a short doc here sending people to the FsBuilder
docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an example (can be the same one from above) would also be great.
And a small side nit: can you put this method above the two deprecated ones? it makes the docs a little nicer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment + example.
can you put this method above the two deprecated ones?
Moved it higher. To be clear, there's only one deprecated method at the method (from_file). Do you want me to deprecate from_path as well (and encourage the use ofByteStream::read_from().path(x)
instead?
rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is looking great! Thanks again!
No functional change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
CI failure is Clippy being upset |
I fixed the clippy errors (hope that's alright). Will try to get this merged today. Thanks again @tanguylebarzic ! |
Missed the clippy errors, thank you for taking care of it and merging! |
Motivation and Context
Tokio's ReaderStream::new called under the hood by ByteStream::from_file / ByteStream::from_path uses a default buffer of 4k. Users may desire to provide a different value, to trade off CPU usage against memory.
Description
This PR adds 2 variations to ByteStream,
ByteStream::from_file_with_buffer_size
andByteStream::from_path_with_buffer_size
, that allow to specify a different buffer size. For a simple app that uploads many large-ish files (65Mb) to S3, using a higher value for the buffer size (8k, 16k, 32k...), I observed a strong decrease in CPU spent (both system and user), at the cost of an increase in memory usage (some figures below).The behaviour of the existing ByteStream::from_file / ByteStream::from_path is unchanged (using a default buffer capacity of 4k, which corresponds to Tokio's ReaderStream default buffer capacity).
Testing
Uploading files to S3 from disk. Some figures from my testing:
As you can see, while there's no reduction in the time to upload these files, using higher buffer sizes reduces the CPU usage considerably (divided by 2.5x), at a cost of more memory usage (2x). In my case, I mostly care about limiting the CPU, so making this trade-off makes sense.
Checklist
CHANGELOG.next.toml
if I made changes to the AWS SDK, generated SDK code, or SDK runtime cratesBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.