Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gzip + chunk pool problem #550

Closed
wants to merge 6 commits into from
Closed

Fix gzip + chunk pool problem #550

wants to merge 6 commits into from

Conversation

algesten
Copy link
Owner

Close #549

This commit adds two intermediary wrappers to the gzip/brotli decoding
cases, DidReadZero and CompressionRead.

The inner, DidReadZero, keeps track of whether the wrapped stream has
read to n=0. The wrapped stream will typically be a
`PoolReturnRead(ChunkedDecoder)` or `PoolReturnRead(LimitRead)`, both
which are fine to read multiple times to 0.

The outher CompressionRead will on n=0 unpack the inner DidReadZero
and ensure we do read to 0.
@algesten algesten marked this pull request as ready for review October 19, 2022 21:37
This commit adds two intermediary wrappers to the gzip/brotli decoding
cases, DidReadZero and PreciseRead.

The inner, DidReadZero, keeps track of whether the wrapped stream has
read to n=0. The wrapped stream will typically be a
`PoolReturnRead(ChunkedDecoder)` or `PoolReturnRead(LimitRead)`, both
which are fine to read multiple times to 0.

The outher PreciseRead will on n=0 unpack the inner DidReadZero
and ensure it reads exactly nothing one more time.
@algesten
Copy link
Owner Author

@jsha gentle reping 🤗

@jsha
Copy link
Collaborator

jsha commented Oct 24, 2022

I will take another look at this soon! Thanks for the ping.

@jsha
Copy link
Collaborator

jsha commented Oct 30, 2022

This isn't quite what I had in mind with #549 (comment). The DidReadZero thing is cool, but we can get away without it, only wrapping another reader on the outside, not on the outside and inside.

I wrote up a demo to try and show what I meant here:

use std::io::{Read, Error, ErrorKind};

const BUFFER_SIZE: usize = 8000;

pub(crate) struct BrotliReader<R: Read> {
    inner: Option<brotli_decompressor::Decompressor<R>>
}

impl<R: Read> BrotliReader<R> {
    fn new(inner: R) -> Self {
        Self { inner: Some(brotli_decompressor::Decompressor::new(inner, BUFFER_SIZE)) }
    }

    fn verify_wrapped_reader_done(&mut self) -> std::io::Result<usize> {
        let r = match self.inner.take() {
            Some(r) => r,
            None => return Ok(0),
        };
        let mut inner = r.into_inner();
        match inner.read(&mut [0u8; 1]) {
            Ok(0) => Ok(0),
            Err(e) => Err(e),
            Ok(_) => Err(Error::new(ErrorKind::InvalidData, "extra data after valid Brotli stream")),
        }
    }
}

impl<R: Read> Read for BrotliReader<R> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        let inner = match &mut self.inner {
            Some(d) => d,
            None => return Ok(0),
        };
        match inner.read(buf) {
            Ok(0) => self.verify_wrapped_reader_done(),
            Ok(n) => Ok(n),
            r => r,
        }
    }
}

#[cfg(test)]
mod tests {
    use std::io::{Cursor, ErrorKind};
    use std::error::Error;
    use super::BrotliReader;

    #[test]
    fn test_no_leftovers() -> Result<(), Box<dyn Error>> {
        // Output from this command:
        // (brotli -c <<<"hello"; echo goodbye) | xxd -p | sed 's/../\\x&/g'
        let compressed_with_extra = b"\x8f\x02\x80\x68\x65\x6c\x6c\x6f\x0a\x03\x67\x6f\x6f\x64\x62\x79\x65\x0a";
        let mut reader = BrotliReader::new(Cursor::new(compressed_with_extra));
        let output = std::io::read_to_string(&mut reader);
        assert!(output.is_err());
        assert_eq!(output.err().unwrap().kind(), ErrorKind::InvalidData);
        Ok(())
    }
}

Unfortunately, the test fails! But if you change BUFFER_SIZE to 10 (the size of the "good" Brotli bytes), it passes. This is because of the internal buffering in the Brotli decompressor. With a BUFFER_SIZE of 8000, it has already read the full 18 byte input into the buffer, and then into_inner drops the unused 8 bytes on the floor, so reading from the underlying Cursor returns Ok(0) as if everything is fine. I filed an upstream issue: dropbox/rust-brotli#84. Without a fix for that, I don't think we can properly fix this for Brotli on our side. I think we should do the simple fix for gzip (switch to MultiGzDecoder), and wait for an upstream fix for Brotli.

@algesten
Copy link
Owner Author

What guarantee do we have that the extra read isn't going to sync wait for the next byte from the server?

@algesten
Copy link
Owner Author

algesten commented Oct 30, 2022

Or put another way: I think technically in rust you should only read to zero once.

There's no guarantee that multiple reads to 0 will be a good state.

We can guarantee that ourselves by knowing exactly what deeper readers are hiding behind that into inner.

My solve was trying to ensure we don't rely on such deeper behavior.

@jsha
Copy link
Collaborator

jsha commented Oct 30, 2022

What guarantee do we have that the extra read isn't going to sync wait for the next byte from the server?

The next layer down is one of:

  1. a chunked_transfer::Decoder
  2. a LimitedRead
  3. a plain stream (for close-delimited responses, in HTTP/1.0)

For all of these, if the response is done, they will return Ok(0), not sync wait. If the response is not done, they should sync wait and when the next byte eventually comes through, the response should error out (due to extra junk at the end). For (3), that sync wait might result in an eventual Ok(0) when the TCP connection is closed.

I think technically in rust you should only read to zero once.

This is a good point. Specifically io::Read::read doesn't make any requirement that Ok(0) means a stream is permanently done, and in some cases it explicitly doesn't mean done:

for File, it is possible to reach the end of file and get zero as result, but if more data is appended to the file, future calls to read will return more data.

One solution to this is the equivalent of std::iter::Fuse but for Read. I'm surprised this doesn't exist in the stdlib! But it would be easy to write our own, and wrap LimitedRead / Decoder in it. Alternately, we could have a marker trait that indicates "this reader implements fused semantics" (which LimitedRead does, but I think chunked_transfer::Decoder does not, out of the box).

@jsha jsha mentioned this pull request Nov 26, 2022
@algesten algesten closed this Nov 26, 2022
@algesten algesten deleted the fix/read-exact-gzip branch August 13, 2024 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compressed, chunked response inhibits connection reuse
2 participants