Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Read::bytes #116651

Closed
wants to merge 3 commits into from
Closed

Conversation

nnethercote
Copy link
Contributor

nnethercote/perf-book#69 explains that Read::bytes in combination with a BufReader is very slow. This PR speeds it up quite a bit -- on a simple test program the runtime dropped from 320ms to 215ms -- but it's still a lot slower than alternatives. This is basically because BufReader has a certain amount of overhead for each read call, and so a configuration where every single byte requires a read is just a bad one for it.

This greatly increases its speed.
We can reuse this in `next`, avoiding the need to zero-initialize a
local variable every time, for a small speed win.
@rustbot
Copy link
Collaborator

rustbot commented Oct 12, 2023

r? @cuviper

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 12, 2023
This is much faster.
@nnethercote
Copy link
Contributor Author

I added another commit that reduces the time further, from 215ms to 140ms.

@the8472
Copy link
Member

the8472 commented Oct 12, 2023

Have you tried using specialization and reaching into the BufReader buffer?

@the8472
Copy link
Member

the8472 commented Oct 12, 2023

We can reuse this in next, avoiding the need to zero-initialize a local variable every time

You can read into a MaybeUninit via read_buf_exact

@Shnatsel
Copy link
Member

These optimizations apply regardless of the underlying reader. For example, a Cursor will implement the BufRead trait but is not an instance BufReader.

Specialization could be worth trying, but that is orthogonal to this PR.

@nnethercote
Copy link
Contributor Author

Specialization was a good idea, and I managed to get it working. But even after trying a dozen different formulations I couldn't get it any faster than the non-specialized version. Surprising! I thought a lot of the code handling variable lengths would boil away when specialized for single byte reads, making a big difference, but no. And the specialized version was more complicated.

So I think this can be reviewed in its current state.

@the8472
Copy link
Member

the8472 commented Oct 13, 2023

Another thing, Bytes only implements next but not try_fold. If external iteration can't be made any faster then maybe there's still room for improving internal iteration.

@Shnatsel
Copy link
Member

Indeed, try_fold can likely be improved even without specialization by calling fill_buf and then iterating over the contents, plus a consume call once the buffer runs out and in a drop guard.

This should be faster because most of the time we only need one check per byte read - whether we should stop iterating, and the check if the buffer needs to be refilled is only going to happen once in every fill_buf call.

@Shnatsel
Copy link
Member

Shnatsel commented Oct 13, 2023

Here's a sketch of bytes() using internal iteration that is only 15% slower than a manual read loop, and is 2x faster than fs::read:

fn custom_try_fold<B, F, I>(mut reader: I, init: B, mut f: F) -> impl Try<Output = B>
where
    I: BufRead,
    F: FnMut(B, u8) -> ControlFlow<B, B>,
{
    let mut accum = init;
    loop {
        match reader.fill_buf() {
            Ok(chunk) => {
                // check for EOF
                if chunk.len() == 0 {
                    return Ok(accum);
                }
                let mut iterator = chunk.iter().copied();
                let result = iterator.try_fold(accum, &mut f);
                let consumed = chunk.len() - iterator.len();
                reader.consume(consumed);
                match result {
                    ControlFlow::Continue(a) => accum = a,
                    ControlFlow::Break(a) => {
                        accum = a;
                        return Ok(accum);
                    },
                }
            },
            Err(e) => return Err(e),
        };
    }
}

It's not fully generic because I don't really understand the nightly-only Try trait, so it cannot be used as-is, but it's a good start. I'm also not sure about correctness on panic in this code, you might also have to add a drop guard that fixes up the consumed amount if the user-supplied function panics.

Here is the same test program as before rewritten using this function, which is what I used for benchmarks: https://gist.github.com/Shnatsel/c13021180a8378a8970fd89006b2dd09

@Shnatsel
Copy link
Member

But the internal iteration implementation is a lot more complex, so it is probably best to follow up on this in a separate PR.

The changes proposed here are valuable, and will not be superseded by an optimized internal iteration if/when it materializes.

@LegionMammal978
Copy link
Contributor

LegionMammal978 commented Oct 13, 2023

There is a bit of a semantic issue with implementing Read::bytes() using Read::read_exact(). If a 1-byte read_exact() returns an error, it's unspecified whether or not it also reads a byte:

If this function returns an error, it is unspecified how many bytes it has read, but it will never read more than would be necessary to completely fill the buffer.

This means that filtering the result of Read::bytes() to only Ok values would no longer be guaranteed to return all bytes within the underlying stream, if the reader provides a custom read_exact() implementation.

Also, if Read::read() returns an error with ErrorKind::UnexpectedEof, Read::bytes() will now return None instead of passing it along, since it interprets it as originating from Read::read_exact().

@nnethercote
Copy link
Contributor Author

I have pulled out the inlining patch to #116775, because it's a simple change that gives a big win that I think is worth merging while we work through the other more complicated changes.

@nnethercote
Copy link
Contributor Author

Indeed, try_fold can likely be improved even without specialization by calling fill_buf and then iterating over the contents, plus a consume call once the buffer runs out and in a drop guard.

That does require specialization. fill_buf and consume are methods on BufRead, not on Read.

@the8472
Copy link
Member

the8472 commented Oct 16, 2023

Another option is to implement fold. Since fold exhausts an iterator it can acquire bytes in larger chunks and thus avoid the per-byte overhead.

@Shnatsel
Copy link
Member

It seems that fold is implemented in terms of a loop over next() (external iteration) rather than in terms of try_fold() (internal iteration): https://doc.rust-lang.org/stable/src/core/iter/traits/iterator.rs.html#2632-2635

I find that surprising. Shouldn't internal iteration be used whenever possible, since it optimizes better?

@nnethercote
Copy link
Contributor Author

nnethercote commented Oct 16, 2023

I tried to implement Bytes::try_fold at the Read level, but I think it's actually impossible in a fully generic way. (Likewise for the specialized version at the BufRead level.) Here's what I ended up with:

impl<R: Read> Iterator for Bytes<R> {
    type Item = Result<u8>;

    fn try_fold<B, F, Res>(&mut self, init: B, mut f: F) -> Res            
    where                                                                  
        F: FnMut(B, Self::Item) -> Res,                                    
        Res: Try<Output = B>,                                              
    {                                                                      
        let mut buf = [0u8; 256]; // njn: size?                            
        let mut acc = init;                                                
        loop {                                                             
            match self.inner.read(&mut buf) {
                Ok(0) => return Res::from_output(acc),
                Ok(n) => {
                    // njn: need to protect against `f` panicking
                    acc = buf[..n].iter().map(|&i| Ok(i)).try_fold(acc, &mut f)?;
                }
                Err(ref e) if e.is_interrupted() => continue,              

                // njn: impossible? no way to convert an io::Error into a
                // generic residual, because there's no `Try::from_error` method
                Err(e) => ...
            }
        }
    }
} 

@nnethercote
Copy link
Contributor Author

On Zulip, @the8472 said:

Other than that, you may not be able to do it like that anyway. try-fold can short-circuit, which means you shouldn't consume more bytes than the closure will actually take. this means you can't do much better than next in the non-specialized implementation

fold should be able to do better.

I agree with the first paragraph. But I don't see how fold can be used when the per-byte operation is fallible.

@nnethercote
Copy link
Contributor Author

I've done some specialization in #116785.

@Shnatsel
Copy link
Member

Shnatsel commented Oct 16, 2023

@nnethercote I believe you are looking for FromResidual::from_residual for converting back into a generic Try.

This method is defined on a supertrait of Try. The naming and organization of those functions is very confusing, but the operations are there. The documentation on using Try in generic code explains some of it.

@Shnatsel
Copy link
Member

Regarding this line:

let mut buf = [0u8; 256]; // njn: size?

I don't think buffering inside an iterator is correct for try_fold for a generic Read: if it short-circuits, the bytes in the intermediate buffer will be lost, already gone from the Read impl but not yet yielded by the iterator.

Buffering is feasible for any BufRead. This is why my sketch requires BufRead.

Copy link
Member

@cuviper cuviper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me after adding a comment about the byte.
(unless you're still experimenting?)

@@ -2772,24 +2772,23 @@ impl<T> SizeHint for Take<T> {
#[derive(Debug)]
pub struct Bytes<R> {
inner: R,
byte: u8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit explains why this is here, but that deserves a comment too.

@nnethercote
Copy link
Contributor Author

@cuviper: sorry, this ended up being superseded by #116775 and #116785. Plus @LegionMammal978 had a good comment about the read_exact approach being flawed.

I will close this. It's been an interesting ride! I've learned about specialization and various interesting std details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants