Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to specify a read buffer initial capacity when creating ByteStream from a file #1238

Merged

Conversation

tanguylebarzic
Copy link
Contributor

@tanguylebarzic tanguylebarzic commented Mar 4, 2022

Motivation and Context

Tokio's ReaderStream::new called under the hood by ByteStream::from_file / ByteStream::from_path uses a default buffer of 4k. Users may desire to provide a different value, to trade off CPU usage against memory.

Description

This PR adds 2 variations to ByteStream, ByteStream::from_file_with_buffer_size and ByteStream::from_path_with_buffer_size, that allow to specify a different buffer size. For a simple app that uploads many large-ish files (65Mb) to S3, using a higher value for the buffer size (8k, 16k, 32k...), I observed a strong decrease in CPU spent (both system and user), at the cost of an increase in memory usage (some figures below).

The behaviour of the existing ByteStream::from_file / ByteStream::from_path is unchanged (using a default buffer capacity of 4k, which corresponds to Tokio's ReaderStream default buffer capacity).

Testing

Uploading files to S3 from disk. Some figures from my testing:

Buffer size (bytes) Wall time (seconds) User CPU time (seconds) System CPU time (seconds) Total CPU (seconds) Max memory (Mb)
4096 (default) 32.7 28.14 22.3 50.44 47
8196 33.06 14.44 11.69 26.13 97
16392 32.57 14.51 12.36 26.87 89
32784 32.78 11.39 8.93 20.32 103

As you can see, while there's no reduction in the time to upload these files, using higher buffer sizes reduces the CPU usage considerably (divided by 2.5x), at a cost of more memory usage (2x). In my case, I mostly care about limiting the CPU, so making this trade-off makes sense.

Checklist

  • I have updated CHANGELOG.next.toml if I made changes to the AWS SDK, generated SDK code, or SDK runtime crates

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@tanguylebarzic tanguylebarzic requested a review from a team as a code owner March 4, 2022 22:34
@Velfi
Copy link
Contributor

Velfi commented Mar 7, 2022

This is a great idea, thanks for submitting! We'll get to reviewing this as soon as we're able.

@Velfi Velfi requested a review from rcoh March 7, 2022 17:52
Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I really like the new functionality here. What I'm less sure about is the potential for a bit of API explosion. (eg. if we ever end up with another parameter on loading bodies from a path) Thinking about some other options, revolving around making a builder for PathBody.

We would make PathBody public, and create a builder for it to allow code like:

PathBody::with_capacity(1024).from_file(...).byte_stream(). For "vanilla" use cases, from_path and from_file would still provide a direct interface.

still considering all the possible options though, let me know if you have any other thoughts.

Self::from_file_with_buffer_size(file, DEFAULT_BUFFER_SIZE).await
}

/// Create a ByteStream from a file, with a specific read buffer initial capacity.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs should clarify the unit of the read buffer

@@ -24,19 +24,22 @@ use tokio_util::io::ReaderStream;
pub struct PathBody {
state: State,
len: u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename this to file_size (now that buffer_size is also there it's slightly confusing)

@tanguylebarzic
Copy link
Contributor Author

Overall, I really like the new functionality here. What I'm less sure about is the potential for a bit of API explosion. (eg. if we ever end up with another parameter on loading bodies from a path) Thinking about some other options, revolving around making a builder for PathBody.

We would make PathBody public, and create a builder for it to allow code like:

PathBody::with_capacity(1024).from_file(...).byte_stream(). For "vanilla" use cases, from_path and from_file would still provide a direct interface.

still considering all the possible options though, let me know if you have any other thoughts.

Agreed that just adding a function each time there's a new parameter isn't great, and makes for poor discoverability - I like the builder idea, will give it a shot!

@rcoh
Copy link
Collaborator

rcoh commented Mar 25, 2022

we also recently discussed the need to specify an offset/length into the path. Let's do a builder!

@tanguylebarzic tanguylebarzic force-pushed the tanguy.lebarzic/file-reader-capacity branch from 1852649 to ab28cfd Compare March 27, 2022 20:11
@tanguylebarzic
Copy link
Contributor Author

@rcoh Sorry for the delay, updated the PR with a builder. I used the opportunity to also accept the file size as an optional argument, for cases where it's known by the caller (no need to call the metadata in this case).
Not a fan of the naming (PathBodyBuilder). I also hesitated to add a function from_builder(builder: PathBodyBuilder) in ByteStream, to ease discovery (and have it appear along with from_path and from_file in the doc), but ended up just adding a comment to from_path.

Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! just some small cleanups to finalize

}
let body = PathBodyBuilder::from_path(&file)
.with_buffer_size(16384)
// This isn't the right file length - one shouln't done that in real life
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This isn't the right file length - one shouln't done that in real life
// This isn't the right file length - one shouldn't do that in real life

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 38 to 46
pub fn from_path(path_buf: PathBuf, file_size: u64, buffer_size: usize) -> Self {
PathBody {
state: State::Unloaded(path.to_path_buf()),
len,
state: State::Unloaded(path_buf),
file_size,
buffer_size,
}
}
pub fn from_file(file: File, len: u64) -> Self {
pub fn from_file(file: File, file_size: u64, buffer_size: usize) -> Self {
PathBody {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two functions don't need to be pub anymore, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe they actually need to be public to be able to call PathBodyBuilder directly (as in the example usage).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions are on PathBody, not PathBodyBuilder. In the old code, these were used directly by ByteStream, but now that uses the builder so these don't need to be pub anymore (I also just verified this locally)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I mistakenly thought you were referring to the PathBodyBuilder functions.


impl PathBodyBuilder {
/// Create a PathBodyBuilder from a path.
///
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should note the default buffer size of 4096

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

self
}

/// Returns a [ByteStream](crate::byte_stream::ByteStream) from this builder.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns a [ByteStream](crate::byte_stream::ByteStream) from this builder.
/// Returns a [`ByteStream`](crate::byte_stream::ByteStream) from this builder.

Comment on lines 90 to 114
/// If not used, [byte_stream](PathBodyBuilder::byte_stream) will require an extra call to query the file's metadata.
///
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would probably rephrase this as the inverse to be a bit clearer about what this does:

By pre-specifying the length of the file, this API skips an additional call to retrieve the size from file-system metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, updated.

@tanguylebarzic tanguylebarzic force-pushed the tanguy.lebarzic/file-reader-capacity branch from 5f61a77 to 0a0021e Compare March 31, 2022 12:22
Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your continued iteration on this! The team got a chance to bike shed the API and this is what we came up with in terms of the UX:

// keep the current API
ByteStream::from_path("myfile.txt").await

// deprecate `ByteStream::from_file` and refer people to `read_from`

// New API `ByteStream::read_from` which returns an empty builder:
// - adds `.path` and `.file` methods to the builder
// - renames `with_buffer_size` to `buffer_size`
// - renames `with_file_size` to `file_size`
// - renames `byte_stream` to `build` (since most people will access this _from_ `ByteStream`
ByteStream::read_from().path(path).build().await?
ByteStream::read_from().file(file).build().await?
ByteStream::read_from().path(path).buffer_size(16092).build().await?

// rename to `PathBodyBuilder` to `FsBuilder` (which uses can get with `byte_stream::FsBuilder`

Thanks for your patience while we tweak this!

Comment on lines 38 to 46
pub fn from_path(path_buf: PathBuf, file_size: u64, buffer_size: usize) -> Self {
PathBody {
state: State::Unloaded(path.to_path_buf()),
len,
state: State::Unloaded(path_buf),
file_size,
buffer_size,
}
}
pub fn from_file(file: File, len: u64) -> Self {
pub fn from_file(file: File, file_size: u64, buffer_size: usize) -> Self {
PathBody {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions are on PathBody, not PathBodyBuilder. In the old code, these were used directly by ByteStream, but now that uses the builder so these don't need to be pub anymore (I also just verified this locally)

/// Specify the length of the file to read (in bytes).
///
/// By pre-specifying the length of the file, this API skips an additional call to retrieve the size from file-system metadata.
///
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
///

///
/// Increasing the read buffer capacity to higher values than the default (4096 bytes) can result in a large reduction
/// in CPU usage, at the cost of memory increase.
///
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
///

@@ -23,20 +30,140 @@ use tokio_util::io::ReaderStream;
/// 3. Provide size hint
pub struct PathBody {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PathBody can also be private

…eam from a file

The behaviour of the existing ByteStream::from_file / ByteStream::from_path is unchanged (using a default buffer capacity of 4k, which corresponds to Tokio's ReaderStream default buffer capacity). Using higher buffer sizes can result in a large reduction in CPU during S3 uploads, at the cost of memory increase.
This makes the distinction with the buffer size clearer
- renames `with_buffer_size` to `buffer_size`
- renames `with_file_size` to `file_size`
- renames `byte_stream` to `build`
@tanguylebarzic tanguylebarzic force-pushed the tanguy.lebarzic/file-reader-capacity branch from c2d3b6a to f13845c Compare April 11, 2022 20:44
@tanguylebarzic
Copy link
Contributor Author

Thanks for your continued iteration on this! The team got a chance to bike shed the API and this is what we came up with in terms of the UX:

// keep the current API
ByteStream::from_path("myfile.txt").await

// deprecate `ByteStream::from_file` and refer people to `read_from`

// New API `ByteStream::read_from` which returns an empty builder:
// - adds `.path` and `.file` methods to the builder
// - renames `with_buffer_size` to `buffer_size`
// - renames `with_file_size` to `file_size`
// - renames `byte_stream` to `build` (since most people will access this _from_ `ByteStream`
ByteStream::read_from().path(path).build().await?
ByteStream::read_from().file(file).build().await?
ByteStream::read_from().path(path).buffer_size(16092).build().await?

// rename to `PathBodyBuilder` to `FsBuilder` (which uses can get with `byte_stream::FsBuilder`

Thanks for your patience while we tweak this!

Updated the PR to implement this API. I like it, especially as it it's more discoverable directly from ByteStream. It does expose to 2 potential misuses though - calling neither path nor file on the FsBuilder, or calling both of them. Happy to change the implemented behaviour - I decided to panic when neither has been called (as it's probably a programmer error, it seemed better than returning an error) and favor path over file when both have been called.

Copy link
Contributor

@Velfi Velfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I just have two suggestion for docs and then I'm ready to approve and merge.


#[cfg(feature = "rt-tokio")]
#[cfg_attr(docsrs, doc(cfg(feature = "rt-tokio")))]
pub fn read_from() -> FsBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a short doc here sending people to the FsBuilder docs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example (can be the same one from above) would also be great.

And a small side nit: can you put this method above the two deprecated ones? it makes the docs a little nicer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment + example.

can you put this method above the two deprecated ones?
Moved it higher. To be clear, there's only one deprecated method at the method (from_file). Do you want me to deprecate from_path as well (and encourage the use of ByteStream::read_from().path(x) instead?

Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking great! Thanks again!

No functional change
Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

rust-runtime/aws-smithy-http/src/byte_stream.rs Outdated Show resolved Hide resolved
@rcoh rcoh enabled auto-merge (squash) April 18, 2022 14:19
@rcoh
Copy link
Collaborator

rcoh commented Apr 18, 2022

CI failure is Clippy being upset

@rcoh
Copy link
Collaborator

rcoh commented Apr 21, 2022

I fixed the clippy errors (hope that's alright). Will try to get this merged today. Thanks again @tanguylebarzic !

@rcoh rcoh merged commit 3073a0a into smithy-lang:main Apr 21, 2022
@tanguylebarzic tanguylebarzic deleted the tanguy.lebarzic/file-reader-capacity branch April 21, 2022 19:50
@tanguylebarzic
Copy link
Contributor Author

I fixed the clippy errors (hope that's alright). Will try to get this merged today. Thanks again @tanguylebarzic !

Missed the clippy errors, thank you for taking care of it and merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants