Allow to specify a read buffer initial capacity when creating ByteStream from a file #1238

tanguylebarzic · 2022-03-04T22:34:35Z

Motivation and Context

Tokio's ReaderStream::new called under the hood by ByteStream::from_file / ByteStream::from_path uses a default buffer of 4k. Users may desire to provide a different value, to trade off CPU usage against memory.

Description

This PR adds 2 variations to ByteStream, ByteStream::from_file_with_buffer_size and ByteStream::from_path_with_buffer_size, that allow to specify a different buffer size. For a simple app that uploads many large-ish files (65Mb) to S3, using a higher value for the buffer size (8k, 16k, 32k...), I observed a strong decrease in CPU spent (both system and user), at the cost of an increase in memory usage (some figures below).

The behaviour of the existing ByteStream::from_file / ByteStream::from_path is unchanged (using a default buffer capacity of 4k, which corresponds to Tokio's ReaderStream default buffer capacity).

Testing

Uploading files to S3 from disk. Some figures from my testing:

Buffer size (bytes)	Wall time (seconds)	User CPU time (seconds)	System CPU time (seconds)	Total CPU (seconds)	Max memory (Mb)
4096 (default)	32.7	28.14	22.3	50.44	47
8196	33.06	14.44	11.69	26.13	97
16392	32.57	14.51	12.36	26.87	89
32784	32.78	11.39	8.93	20.32	103

As you can see, while there's no reduction in the time to upload these files, using higher buffer sizes reduces the CPU usage considerably (divided by 2.5x), at a cost of more memory usage (2x). In my case, I mostly care about limiting the CPU, so making this trade-off makes sense.

Checklist

I have updated CHANGELOG.next.toml if I made changes to the AWS SDK, generated SDK code, or SDK runtime crates

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Velfi · 2022-03-07T17:51:57Z

This is a great idea, thanks for submitting! We'll get to reviewing this as soon as we're able.

rcoh

Overall, I really like the new functionality here. What I'm less sure about is the potential for a bit of API explosion. (eg. if we ever end up with another parameter on loading bodies from a path) Thinking about some other options, revolving around making a builder for PathBody.

We would make PathBody public, and create a builder for it to allow code like:

PathBody::with_capacity(1024).from_file(...).byte_stream(). For "vanilla" use cases, from_path and from_file would still provide a direct interface.

still considering all the possible options though, let me know if you have any other thoughts.

rcoh · 2022-03-08T19:49:22Z

rust-runtime/aws-smithy-http/src/byte_stream.rs

+        Self::from_file_with_buffer_size(file, DEFAULT_BUFFER_SIZE).await
+    }
+
+    /// Create a ByteStream from a file, with a specific read buffer initial capacity.


docs should clarify the unit of the read buffer

rcoh · 2022-03-09T01:25:39Z

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

@@ -24,19 +24,22 @@ use tokio_util::io::ReaderStream;
 pub struct PathBody {
    state: State,
    len: u64,


can we rename this to file_size (now that buffer_size is also there it's slightly confusing)

tanguylebarzic · 2022-03-10T12:32:51Z

Overall, I really like the new functionality here. What I'm less sure about is the potential for a bit of API explosion. (eg. if we ever end up with another parameter on loading bodies from a path) Thinking about some other options, revolving around making a builder for PathBody.

We would make PathBody public, and create a builder for it to allow code like:

PathBody::with_capacity(1024).from_file(...).byte_stream(). For "vanilla" use cases, from_path and from_file would still provide a direct interface.

still considering all the possible options though, let me know if you have any other thoughts.

Agreed that just adding a function each time there's a new parameter isn't great, and makes for poor discoverability - I like the builder idea, will give it a shot!

rcoh · 2022-03-25T01:16:42Z

we also recently discussed the need to specify an offset/length into the path. Let's do a builder!

tanguylebarzic · 2022-03-27T20:19:51Z

@rcoh Sorry for the delay, updated the PR with a builder. I used the opportunity to also accept the file size as an optional argument, for cases where it's known by the caller (no need to call the metadata in this case).
Not a fan of the naming (PathBodyBuilder). I also hesitated to add a function from_builder(builder: PathBodyBuilder) in ByteStream, to ease discovery (and have it appear along with from_path and from_file in the doc), but ended up just adding a comment to from_path.

rust-runtime/aws-smithy-http/src/byte_stream.rs

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

rcoh

This is looking great! just some small cleanups to finalize

rcoh · 2022-03-27T23:36:57Z

rust-runtime/aws-smithy-http/src/byte_stream.rs

+        }
+        let body = PathBodyBuilder::from_path(&file)
+            .with_buffer_size(16384)
+            // This isn't the right file length - one shouln't done that in real life


Suggested change

// This isn't the right file length - one shouln't done that in real life

// This isn't the right file length - one shouldn't do that in real life

rcoh · 2022-03-29T19:26:53Z

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

+    pub fn from_path(path_buf: PathBuf, file_size: u64, buffer_size: usize) -> Self {
        PathBody {
-            state: State::Unloaded(path.to_path_buf()),
-            len,
+            state: State::Unloaded(path_buf),
+            file_size,
+            buffer_size,
        }
    }
-    pub fn from_file(file: File, len: u64) -> Self {
+    pub fn from_file(file: File, file_size: u64, buffer_size: usize) -> Self {
        PathBody {


these two functions don't need to be pub anymore, right?

I believe they actually need to be public to be able to call PathBodyBuilder directly (as in the example usage).

these functions are on PathBody, not PathBodyBuilder. In the old code, these were used directly by ByteStream, but now that uses the builder so these don't need to be pub anymore (I also just verified this locally)

You are right, I mistakenly thought you were referring to the PathBodyBuilder functions.

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

rcoh · 2022-03-29T19:28:35Z

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

+
+impl PathBodyBuilder {
+    /// Create a PathBodyBuilder from a path.
+    ///


we should note the default buffer size of 4096

rcoh · 2022-03-29T19:32:17Z

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

+        self
+    }
+
+    /// Returns a [ByteStream](crate::byte_stream::ByteStream) from this builder.


Suggested change

/// Returns a [ByteStream](crate::byte_stream::ByteStream) from this builder.

/// Returns a [`ByteStream`](crate::byte_stream::ByteStream) from this builder.

rcoh · 2022-03-29T19:35:08Z

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

+    /// If not used, [byte_stream](PathBodyBuilder::byte_stream) will require an extra call to query the file's metadata.
+    ///


would probably rephrase this as the inverse to be a bit clearer about what this does:

By pre-specifying the length of the file, this API skips an additional call to retrieve the size from file-system metadata.

Agreed, updated.

rcoh

Thanks for your continued iteration on this! The team got a chance to bike shed the API and this is what we came up with in terms of the UX:

// keep the current API
ByteStream::from_path("myfile.txt").await

// deprecate `ByteStream::from_file` and refer people to `read_from`

// New API `ByteStream::read_from` which returns an empty builder:
// - adds `.path` and `.file` methods to the builder
// - renames `with_buffer_size` to `buffer_size`
// - renames `with_file_size` to `file_size`
// - renames `byte_stream` to `build` (since most people will access this _from_ `ByteStream`
ByteStream::read_from().path(path).build().await?
ByteStream::read_from().file(file).build().await?
ByteStream::read_from().path(path).buffer_size(16092).build().await?

// rename to `PathBodyBuilder` to `FsBuilder` (which uses can get with `byte_stream::FsBuilder`

Thanks for your patience while we tweak this!

rcoh · 2022-03-31T21:33:33Z

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

+    pub fn from_path(path_buf: PathBuf, file_size: u64, buffer_size: usize) -> Self {
        PathBody {
-            state: State::Unloaded(path.to_path_buf()),
-            len,
+            state: State::Unloaded(path_buf),
+            file_size,
+            buffer_size,
        }
    }
-    pub fn from_file(file: File, len: u64) -> Self {
+    pub fn from_file(file: File, file_size: u64, buffer_size: usize) -> Self {
        PathBody {


these functions are on PathBody, not PathBodyBuilder. In the old code, these were used directly by ByteStream, but now that uses the builder so these don't need to be pub anymore (I also just verified this locally)

rcoh · 2022-03-31T21:38:15Z

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

+    /// Specify the length of the file to read (in bytes).
+    ///
+    /// By pre-specifying the length of the file, this API skips an additional call to retrieve the size from file-system metadata.
+    ///


Suggested change

///

rcoh · 2022-03-31T21:38:23Z

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

+    ///
+    /// Increasing the read buffer capacity to higher values than the default (4096 bytes) can result in a large reduction
+    /// in CPU usage, at the cost of memory increase.
+    ///


Suggested change

///

rcoh · 2022-03-31T21:48:46Z

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

@@ -23,20 +30,140 @@ use tokio_util::io::ReaderStream;
 /// 3. Provide size hint
 pub struct PathBody {


PathBody can also be private

…eam from a file The behaviour of the existing ByteStream::from_file / ByteStream::from_path is unchanged (using a default buffer capacity of 4k, which corresponds to Tokio's ReaderStream default buffer capacity). Using higher buffer sizes can result in a large reduction in CPU during S3 uploads, at the cost of memory increase.

This makes the distinction with the buffer size clearer

- renames `with_buffer_size` to `buffer_size` - renames `with_file_size` to `file_size` - renames `byte_stream` to `build`

tanguylebarzic · 2022-04-11T21:19:17Z

Thanks for your continued iteration on this! The team got a chance to bike shed the API and this is what we came up with in terms of the UX:

// keep the current API
ByteStream::from_path("myfile.txt").await

// deprecate `ByteStream::from_file` and refer people to `read_from`

// New API `ByteStream::read_from` which returns an empty builder:
// - adds `.path` and `.file` methods to the builder
// - renames `with_buffer_size` to `buffer_size`
// - renames `with_file_size` to `file_size`
// - renames `byte_stream` to `build` (since most people will access this _from_ `ByteStream`
ByteStream::read_from().path(path).build().await?
ByteStream::read_from().file(file).build().await?
ByteStream::read_from().path(path).buffer_size(16092).build().await?

// rename to `PathBodyBuilder` to `FsBuilder` (which uses can get with `byte_stream::FsBuilder`

Thanks for your patience while we tweak this!

Updated the PR to implement this API. I like it, especially as it it's more discoverable directly from ByteStream. It does expose to 2 potential misuses though - calling neither path nor file on the FsBuilder, or calling both of them. Happy to change the implemented behaviour - I decided to panic when neither has been called (as it's probably a programmer error, it seemed better than returning an error) and favor path over file when both have been called.

Velfi

Looks good, I just have two suggestion for docs and then I'm ready to approve and merge.

Velfi · 2022-04-12T18:04:37Z

rust-runtime/aws-smithy-http/src/byte_stream.rs

+
+    #[cfg(feature = "rt-tokio")]
+    #[cfg_attr(docsrs, doc(cfg(feature = "rt-tokio")))]
+    pub fn read_from() -> FsBuilder {


Would you mind adding a short doc here sending people to the FsBuilder docs?

an example (can be the same one from above) would also be great.

And a small side nit: can you put this method above the two deprecated ones? it makes the docs a little nicer

Added comment + example.

can you put this method above the two deprecated ones?
Moved it higher. To be clear, there's only one deprecated method at the method (from_file). Do you want me to deprecate from_path as well (and encourage the use of ByteStream::read_from().path(x) instead?

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs

rcoh

this is looking great! Thanks again!

No functional change

rcoh

LGTM!

rust-runtime/aws-smithy-http/src/byte_stream.rs

a -> an

rcoh · 2022-04-18T17:34:34Z

CI failure is Clippy being upset

rcoh · 2022-04-21T13:27:08Z

I fixed the clippy errors (hope that's alright). Will try to get this merged today. Thanks again @tanguylebarzic !

tanguylebarzic · 2022-04-21T19:52:40Z

I fixed the clippy errors (hope that's alright). Will try to get this merged today. Thanks again @tanguylebarzic !

Missed the clippy errors, thank you for taking care of it and merging!

tanguylebarzic requested a review from a team as a code owner March 4, 2022 22:34

Velfi requested a review from rcoh March 7, 2022 17:52

rcoh reviewed Mar 9, 2022

View reviewed changes

tanguylebarzic force-pushed the tanguy.lebarzic/file-reader-capacity branch from 1852649 to ab28cfd Compare March 27, 2022 20:11

Velfi reviewed Mar 28, 2022

View reviewed changes

rust-runtime/aws-smithy-http/src/byte_stream.rs Outdated Show resolved Hide resolved

Velfi reviewed Mar 28, 2022

View reviewed changes

rust-runtime/aws-smithy-http/src/byte_stream.rs Outdated Show resolved Hide resolved

Velfi reviewed Mar 28, 2022

View reviewed changes

rust-runtime/aws-smithy-http/src/byte_stream/bytestream_util.rs Show resolved Hide resolved

rcoh reviewed Mar 29, 2022

View reviewed changes

tanguylebarzic force-pushed the tanguy.lebarzic/file-reader-capacity branch from 5f61a77 to 0a0021e Compare March 31, 2022 12:22

rcoh reviewed Mar 31, 2022

View reviewed changes

tanguylebarzic added 10 commits April 11, 2022 22:43

Rename len to file_size

0a216e3

This makes the distinction with the buffer size clearer

Use a builder to specify advanced options to create a ByteStream

e81119a

Improved comments

85683d1

Specify the unit to use for PathBodyBuilder.with_file_size

b38a34a

Improved comments following review

33a5e62

Rename PathBodyBuilder to FsBuilder

6b94368

Renaming in FsBuilder

401f0b0

- renames `with_buffer_size` to `buffer_size` - renames `with_file_size` to `file_size` - renames `byte_stream` to `build`

Make PathBody private

e0b6650

Updated API for FsBuilder

f13845c

tanguylebarzic force-pushed the tanguy.lebarzic/file-reader-capacity branch from c2d3b6a to f13845c Compare April 11, 2022 20:44

Velfi requested changes Apr 12, 2022

View reviewed changes

rcoh reviewed Apr 13, 2022

View reviewed changes

tanguylebarzic added 2 commits April 15, 2022 15:12

Document panic behavior of ByteStream::build

cb564b5

Document ByteStream::read_from

304f555

Move ByteStream::read_from

51e271d

No functional change

rcoh approved these changes Apr 15, 2022

View reviewed changes

rust-runtime/aws-smithy-http/src/byte_stream.rs Outdated Show resolved Hide resolved

Update rust-runtime/aws-smithy-http/src/byte_stream.rs

67986d9

a -> an

Velfi approved these changes Apr 15, 2022

View reviewed changes

rcoh added 2 commits April 15, 2022 09:53

Merge branch 'main' into tanguy.lebarzic/file-reader-capacity

d9c3d02

Merge branch 'main' into tanguy.lebarzic/file-reader-capacity

cefb37e

rcoh enabled auto-merge (squash) April 18, 2022 14:19

Fix clippy errors

7e68316

Merge branch 'main' into tanguy.lebarzic/file-reader-capacity

22b98fe

rcoh mentioned this pull request Apr 21, 2022

Added ByteStream method to read from file chunk awslabs/aws-sdk-rust#517

Closed

Merge branch 'main' into tanguy.lebarzic/file-reader-capacity

8171699

rcoh merged commit 3073a0a into smithy-lang:main Apr 21, 2022

tanguylebarzic deleted the tanguy.lebarzic/file-reader-capacity branch April 21, 2022 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to specify a read buffer initial capacity when creating ByteStream from a file #1238

Allow to specify a read buffer initial capacity when creating ByteStream from a file #1238

tanguylebarzic commented Mar 4, 2022 •

edited

Loading

Velfi commented Mar 7, 2022

rcoh left a comment

rcoh Mar 8, 2022

rcoh Mar 9, 2022

tanguylebarzic commented Mar 10, 2022

rcoh commented Mar 25, 2022

tanguylebarzic commented Mar 27, 2022

rcoh left a comment

rcoh Mar 27, 2022

tanguylebarzic Mar 31, 2022

rcoh Mar 29, 2022

tanguylebarzic Mar 31, 2022

rcoh Mar 31, 2022

tanguylebarzic Apr 1, 2022

rcoh Mar 29, 2022

tanguylebarzic Mar 31, 2022

rcoh Mar 29, 2022

rcoh Mar 29, 2022

tanguylebarzic Mar 31, 2022

rcoh left a comment •

edited

Loading

rcoh Mar 31, 2022

rcoh Mar 31, 2022

rcoh Mar 31, 2022

rcoh Mar 31, 2022

tanguylebarzic commented Apr 11, 2022

Velfi left a comment

Velfi Apr 12, 2022

rcoh Apr 13, 2022

tanguylebarzic Apr 15, 2022

rcoh left a comment

rcoh left a comment

rcoh commented Apr 18, 2022

rcoh commented Apr 21, 2022

tanguylebarzic commented Apr 21, 2022

	// This isn't the right file length - one shouln't done that in real life
	// This isn't the right file length - one shouldn't do that in real life

	/// Returns a [ByteStream](crate::byte_stream::ByteStream) from this builder.
	/// Returns a [`ByteStream`](crate::byte_stream::ByteStream) from this builder.

		/// If not used, [byte_stream](PathBodyBuilder::byte_stream) will require an extra call to query the file's metadata.
		///

		@@ -23,20 +30,140 @@ use tokio_util::io::ReaderStream;
		/// 3. Provide size hint
		pub struct PathBody {

Allow to specify a read buffer initial capacity when creating ByteStream from a file #1238

Allow to specify a read buffer initial capacity when creating ByteStream from a file #1238

Conversation

tanguylebarzic commented Mar 4, 2022 • edited Loading

Motivation and Context

Description

Testing

Checklist

Velfi commented Mar 7, 2022

rcoh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tanguylebarzic commented Mar 10, 2022

rcoh commented Mar 25, 2022

tanguylebarzic commented Mar 27, 2022

rcoh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcoh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tanguylebarzic commented Apr 11, 2022

Velfi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcoh left a comment

Choose a reason for hiding this comment

rcoh left a comment

Choose a reason for hiding this comment

rcoh commented Apr 18, 2022

rcoh commented Apr 21, 2022

tanguylebarzic commented Apr 21, 2022

tanguylebarzic commented Mar 4, 2022 •

edited

Loading

rcoh left a comment •

edited

Loading