Speed up `Read::bytes` #116651

nnethercote · 2023-10-12T06:59:18Z

nnethercote/perf-book#69 explains that Read::bytes in combination with a BufReader is very slow. This PR speeds it up quite a bit -- on a simple test program the runtime dropped from 320ms to 215ms -- but it's still a lot slower than alternatives. This is basically because BufReader has a certain amount of overhead for each read call, and so a configuration where every single byte requires a read is just a bad one for it.

This greatly increases its speed.

We can reuse this in `next`, avoiding the need to zero-initialize a local variable every time, for a small speed win.

rustbot · 2023-10-12T06:59:28Z

r? @cuviper

(rustbot has picked a reviewer for you, use r? to override)

This is much faster.

nnethercote · 2023-10-12T11:10:08Z

I added another commit that reduces the time further, from 215ms to 140ms.

the8472 · 2023-10-12T14:15:37Z

Have you tried using specialization and reaching into the BufReader buffer?

the8472 · 2023-10-12T17:07:35Z

We can reuse this in next, avoiding the need to zero-initialize a local variable every time

You can read into a MaybeUninit via read_buf_exact

Shnatsel · 2023-10-13T00:23:11Z

These optimizations apply regardless of the underlying reader. For example, a Cursor will implement the BufRead trait but is not an instance BufReader.

Specialization could be worth trying, but that is orthogonal to this PR.

nnethercote · 2023-10-13T01:50:56Z

Specialization was a good idea, and I managed to get it working. But even after trying a dozen different formulations I couldn't get it any faster than the non-specialized version. Surprising! I thought a lot of the code handling variable lengths would boil away when specialized for single byte reads, making a big difference, but no. And the specialized version was more complicated.

So I think this can be reviewed in its current state.

the8472 · 2023-10-13T07:52:59Z

Another thing, Bytes only implements next but not try_fold. If external iteration can't be made any faster then maybe there's still room for improving internal iteration.

Shnatsel · 2023-10-13T10:53:39Z

Indeed, try_fold can likely be improved even without specialization by calling fill_buf and then iterating over the contents, plus a consume call once the buffer runs out and in a drop guard.

This should be faster because most of the time we only need one check per byte read - whether we should stop iterating, and the check if the buffer needs to be refilled is only going to happen once in every fill_buf call.

Shnatsel · 2023-10-13T13:23:44Z

Here's a sketch of bytes() using internal iteration that is only 15% slower than a manual read loop, and is 2x faster than fs::read:

fn custom_try_fold<B, F, I>(mut reader: I, init: B, mut f: F) -> impl Try<Output = B>
where
    I: BufRead,
    F: FnMut(B, u8) -> ControlFlow<B, B>,
{
    let mut accum = init;
    loop {
        match reader.fill_buf() {
            Ok(chunk) => {
                // check for EOF
                if chunk.len() == 0 {
                    return Ok(accum);
                }
                let mut iterator = chunk.iter().copied();
                let result = iterator.try_fold(accum, &mut f);
                let consumed = chunk.len() - iterator.len();
                reader.consume(consumed);
                match result {
                    ControlFlow::Continue(a) => accum = a,
                    ControlFlow::Break(a) => {
                        accum = a;
                        return Ok(accum);
                    },
                }
            },
            Err(e) => return Err(e),
        };
    }
}

It's not fully generic because I don't really understand the nightly-only Try trait, so it cannot be used as-is, but it's a good start. I'm also not sure about correctness on panic in this code, you might also have to add a drop guard that fixes up the consumed amount if the user-supplied function panics.

Here is the same test program as before rewritten using this function, which is what I used for benchmarks: https://gist.github.com/Shnatsel/c13021180a8378a8970fd89006b2dd09

Shnatsel · 2023-10-13T13:28:20Z

But the internal iteration implementation is a lot more complex, so it is probably best to follow up on this in a separate PR.

The changes proposed here are valuable, and will not be superseded by an optimized internal iteration if/when it materializes.

LegionMammal978 · 2023-10-13T14:25:47Z

There is a bit of a semantic issue with implementing Read::bytes() using Read::read_exact(). If a 1-byte read_exact() returns an error, it's unspecified whether or not it also reads a byte:

If this function returns an error, it is unspecified how many bytes it has read, but it will never read more than would be necessary to completely fill the buffer.

This means that filtering the result of Read::bytes() to only Ok values would no longer be guaranteed to return all bytes within the underlying stream, if the reader provides a custom read_exact() implementation.

Also, if Read::read() returns an error with ErrorKind::UnexpectedEof, Read::bytes() will now return None instead of passing it along, since it interprets it as originating from Read::read_exact().

nnethercote · 2023-10-15T22:29:12Z

I have pulled out the inlining patch to #116775, because it's a simple change that gives a big win that I think is worth merging while we work through the other more complicated changes.

nnethercote · 2023-10-16T00:04:20Z

Indeed, try_fold can likely be improved even without specialization by calling fill_buf and then iterating over the contents, plus a consume call once the buffer runs out and in a drop guard.

That does require specialization. fill_buf and consume are methods on BufRead, not on Read.

the8472 · 2023-10-16T00:12:07Z

Another option is to implement fold. Since fold exhausts an iterator it can acquire bytes in larger chunks and thus avoid the per-byte overhead.

Shnatsel · 2023-10-16T00:35:53Z

It seems that fold is implemented in terms of a loop over next() (external iteration) rather than in terms of try_fold() (internal iteration): https://doc.rust-lang.org/stable/src/core/iter/traits/iterator.rs.html#2632-2635

I find that surprising. Shouldn't internal iteration be used whenever possible, since it optimizes better?

nnethercote · 2023-10-16T02:19:57Z

I tried to implement Bytes::try_fold at the Read level, but I think it's actually impossible in a fully generic way. (Likewise for the specialized version at the BufRead level.) Here's what I ended up with:

impl<R: Read> Iterator for Bytes<R> {
    type Item = Result<u8>;

    fn try_fold<B, F, Res>(&mut self, init: B, mut f: F) -> Res            
    where                                                                  
        F: FnMut(B, Self::Item) -> Res,                                    
        Res: Try<Output = B>,                                              
    {                                                                      
        let mut buf = [0u8; 256]; // njn: size?                            
        let mut acc = init;                                                
        loop {                                                             
            match self.inner.read(&mut buf) {
                Ok(0) => return Res::from_output(acc),
                Ok(n) => {
                    // njn: need to protect against `f` panicking
                    acc = buf[..n].iter().map(|&i| Ok(i)).try_fold(acc, &mut f)?;
                }
                Err(ref e) if e.is_interrupted() => continue,              

                // njn: impossible? no way to convert an io::Error into a
                // generic residual, because there's no `Try::from_error` method
                Err(e) => ...
            }
        }
    }
}

nnethercote · 2023-10-16T02:22:48Z

On Zulip, @the8472 said:

Other than that, you may not be able to do it like that anyway. try-fold can short-circuit, which means you shouldn't consume more bytes than the closure will actually take. this means you can't do much better than next in the non-specialized implementation

fold should be able to do better.

I agree with the first paragraph. But I don't see how fold can be used when the per-byte operation is fallible.

nnethercote · 2023-10-16T07:38:03Z

I've done some specialization in #116785.

Shnatsel · 2023-10-16T10:29:30Z

@nnethercote I believe you are looking for FromResidual::from_residual for converting back into a generic Try.

This method is defined on a supertrait of Try. The naming and organization of those functions is very confusing, but the operations are there. The documentation on using Try in generic code explains some of it.

Shnatsel · 2023-10-16T10:36:46Z

Regarding this line:

let mut buf = [0u8; 256]; // njn: size?

I don't think buffering inside an iterator is correct for try_fold for a generic Read: if it short-circuits, the bytes in the intermediate buffer will be lost, already gone from the Read impl but not yet yielded by the iterator.

Buffering is feasible for any BufRead. This is why my sketch requires BufRead.

cuviper

r=me after adding a comment about the byte.
(unless you're still experimenting?)

cuviper · 2023-10-18T23:14:40Z

library/std/src/io/mod.rs

@@ -2772,24 +2772,23 @@ impl<T> SizeHint for Take<T> {
 #[derive(Debug)]
 pub struct Bytes<R> {
    inner: R,
+    byte: u8,


The commit explains why this is here, but that deserves a comment too.

nnethercote · 2023-10-19T00:20:57Z

@cuviper: sorry, this ended up being superseded by #116775 and #116785. Plus @LegionMammal978 had a good comment about the read_exact approach being flawed.

I will close this. It's been an interesting ride! I've learned about specialization and various interesting std details.

nnethercote added 2 commits October 12, 2023 17:55

Inline Bytes::next and Bytes::size_hint.

ca7eef7

This greatly increases its speed.

Add Bytes::byte field.

69ad06f

We can reuse this in `next`, avoiding the need to zero-initialize a local variable every time, for a small speed win.

rustbot assigned cuviper Oct 12, 2023

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 12, 2023

nnethercote mentioned this pull request Oct 12, 2023

I/O: Read::bytes() is very slow, even on a BufReader nnethercote/perf-book#69

Closed

Use read_exact in Bytes::next.

ed2a587

This is much faster.

nnethercote force-pushed the opt-Bytes-read branch from 1c12a4e to ed2a587 Compare October 12, 2023 11:10

Shnatsel mentioned this pull request Oct 13, 2023

Document external .next() vs internal .try_fold() iteration nnethercote/perf-book#70

Open

cuviper approved these changes Oct 18, 2023

View reviewed changes

nnethercote closed this Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `Read::bytes` #116651

Speed up `Read::bytes` #116651

nnethercote commented Oct 12, 2023

rustbot commented Oct 12, 2023

nnethercote commented Oct 12, 2023

the8472 commented Oct 12, 2023

the8472 commented Oct 12, 2023

Shnatsel commented Oct 13, 2023

nnethercote commented Oct 13, 2023

the8472 commented Oct 13, 2023

Shnatsel commented Oct 13, 2023

Shnatsel commented Oct 13, 2023 •

edited

Loading

Shnatsel commented Oct 13, 2023

LegionMammal978 commented Oct 13, 2023 •

edited

Loading

nnethercote commented Oct 15, 2023

nnethercote commented Oct 16, 2023

the8472 commented Oct 16, 2023

Shnatsel commented Oct 16, 2023

nnethercote commented Oct 16, 2023 •

edited

Loading

nnethercote commented Oct 16, 2023

nnethercote commented Oct 16, 2023

Shnatsel commented Oct 16, 2023 •

edited

Loading

Shnatsel commented Oct 16, 2023

cuviper left a comment

cuviper Oct 18, 2023

nnethercote commented Oct 19, 2023

Speed up Read::bytes #116651

Speed up Read::bytes #116651

Conversation

nnethercote commented Oct 12, 2023

rustbot commented Oct 12, 2023

nnethercote commented Oct 12, 2023

the8472 commented Oct 12, 2023

the8472 commented Oct 12, 2023

Shnatsel commented Oct 13, 2023

nnethercote commented Oct 13, 2023

the8472 commented Oct 13, 2023

Shnatsel commented Oct 13, 2023

Shnatsel commented Oct 13, 2023 • edited Loading

Shnatsel commented Oct 13, 2023

LegionMammal978 commented Oct 13, 2023 • edited Loading

nnethercote commented Oct 15, 2023

nnethercote commented Oct 16, 2023

the8472 commented Oct 16, 2023

Shnatsel commented Oct 16, 2023

nnethercote commented Oct 16, 2023 • edited Loading

nnethercote commented Oct 16, 2023

nnethercote commented Oct 16, 2023

Shnatsel commented Oct 16, 2023 • edited Loading

Shnatsel commented Oct 16, 2023

cuviper left a comment

Choose a reason for hiding this comment

cuviper Oct 18, 2023

Choose a reason for hiding this comment

nnethercote commented Oct 19, 2023

Speed up `Read::bytes` #116651

Speed up `Read::bytes` #116651

Shnatsel commented Oct 13, 2023 •

edited

Loading

LegionMammal978 commented Oct 13, 2023 •

edited

Loading

nnethercote commented Oct 16, 2023 •

edited

Loading

Shnatsel commented Oct 16, 2023 •

edited

Loading