Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance after introducing zeroing of UninitSlice #6

Closed
kkovaacs opened this issue Jun 11, 2021 · 8 comments · Fixed by #9
Closed

Bad performance after introducing zeroing of UninitSlice #6

kkovaacs opened this issue Jun 11, 2021 · 8 comments · Fixed by #9
Assignees

Comments

@kkovaacs
Copy link

After the following commit zeroing of UninitSlice completely dominates websocket CPU time:

commit 81d6b33c0607dec5f02418bab80cfbd88fa79b53
Author: Sebastian Dröge <[email protected]>
Date:   Tue Dec 10 18:14:00 2019 +0200

    Fix UB in DoRead::read_from()
    
    We need to initialize the buffer first before creating a byte slice from
    it and passing it to Read::read().

It's very easy to reproduce by running the server example from tungstenite-rs and generating some websocket traffic (I've used websocat to send a large file via websocket):

tungstenite-rs [master●] % cargo build --release --example server           
    Finished release [optimized] target(s) in 0.02s
tungstenite-rs [master●] % perf record target/release/examples/server
websocat --binary readfile:/data/iso/cdrom_6.10.0.alpha1.iso  ws://127.0.0.1:3012/

The profile looks like this:

tungstenite-rs [master●] % perf report --percent-limit=1 | cat
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 15K of event 'cycles:u'
# Event count (approx.): 11805093888
#
# Overhead  Command  Shared Object       Symbol                                                                                                                        
# ........  .......  ..................  ..............................................................................................................................
#
    52.76%  server   server              [.] bytes::buf::uninit_slice::UninitSlice::write_byte
    37.22%  server   server              [.] input_buffer::DoRead::read_from
     7.31%  server   libc-2.33.so        [.] __memmove_avx_unaligned_erms
     1.45%  server   server              [.] tungstenite::protocol::WebSocket<Stream>::read_message


#
# (Tip: Show user configuration overrides: perf config --user --list)
#

It seems that the generated code is very suboptimal: it's writing zero bytes into the uninitialized buffer one byte at a time and thus causes a huge performance overhead in a websocket server.

@daniel-abramov
Copy link
Member

Interesting, so it was introduced with #3

We have not run any benchmarks back then, but we might need to investigate it. Thanks for the report!

@kkovaacs
Copy link
Author

It would be great if this could be fixed. I'm using tungstenite via wrap, which uses tokio and tokio-tungstenite under the hood. Tokio now has proper support for reading into uninitialized memory (https://tokio.rs/blog/2020-10-tokio-0-3) so this performance issue is definitely something that could be avoided IMHO.

@JakkuSakura
Copy link

JakkuSakura commented Jun 26, 2021

Here is my benchmark.
https://github.com/qiujiangkun/input_buffer/blob/master/benches/read_data.rs

throughput/input_buffer time:   [35.198 ms 37.608 ms 40.192 ms]                                    
                        thrpt:  [199.04 MiB/s 212.72 MiB/s 227.29 MiB/s]

throughput/extend_from_slice                                                                            
                        time:   [221.84 us 222.59 us 223.36 us]
                        thrpt:  [34.977 GiB/s 35.098 GiB/s 35.217 GiB/s]

throughput/with_capacity                                                                            
                        time:   [222.57 us 223.88 us 225.39 us]
                        thrpt:  [34.662 GiB/s 34.896 GiB/s 35.102 GiB/s]

throughput/with_capacity_unsafe                                                                            
                        time:   [165.58 us 166.01 us 166.50 us]
                        thrpt:  [46.921 GiB/s 47.060 GiB/s 47.182 GiB/s]

After removing quadratic zeroing hack, the performance improves about 160x.

throughput/input_buffer time:   [226.32 us 227.91 us 229.67 us]                                    
                        thrpt:  [34.016 GiB/s 34.278 GiB/s 34.520 GiB/s]

Why is it a quadratic zeroing?

If you have space of 10MiB and trying to read data of 1 MiB chunk with it.

You'd be:

  1. zeroing 10 MiB
  2. zeroing 9 MiB
  3. zeroing 8 MiB
  4. ...
  5. zeroing 1 MiB

In the end, you have zeroed 55 MiB, 4.5x times larger than the actual data you read.

@daniel-abramov
Copy link
Member

Not sure I got the idea behind "quadratic zeroing", but thanks for doing some benchmarks. So it seems that the changes introduced in above mentioned PR [while added some more safety] had a significant performance impact. The change was done for the safety purposes: input_buffer works with Read which recommends [but not guarantees] to not read any data from the provided buffer, it also points out that Read is not unsafe, meaning that in theory the implementer might try to read from the stream causing an UB. Whether that would happen in real scenario is another question though...

It would be great if this could be fixed. I'm using tungstenite via wrap, which uses tokio and tokio-tungstenite under the hood. Tokio now has proper support for reading into uninitialized memory (https://tokio.rs/blog/2020-10-tokio-0-3) so this performance issue is definitely something that could be avoided IMHO.

That's right, however the problem that we have is that tokio-tungstenite is not an independent version but merely a wrapper for tungstenite-rs which means that tokio-tungstenite has a transitive dependency of input_buffer, i.e. tungstenite-rs works with any "abstract stream" (any entity that implements the Read / Write). This has its flaws when it comes to the implementation of tokio-tungstenite that works with AsyncRead and AsyncWrite instead (they are not compatible with Read or Write), so tokio-tungstenite has a compatibility layer implemented to handle Tokio traits (that was discussed again recently).

I see 2 ways to solve this:

  1. Quick fix: revert the change introduced in Fix UB in DoRead::read_from() #3.
  2. Somewhat more complicated (arguably more correct though) would involve touching both crates, i.e. changing the API surface of tungstenite-rs making it independent from Read and Write that would eliminate the dependencies on Read and current input_buffer as such.

Related issues:

CC: @agalakhov @jxs

@JakkuSakura
Copy link

By quadratic zeroing, I mean we are zeroing buffer more than necessary. The larger the buffer, the more bytes we zero(grows quadratically). If we carefully track where we have zeroed, we can zero every byte exactly once and acheive high performance.

@njsmith
Copy link

njsmith commented Jun 28, 2021

Note that if you need to allocate a large zero-initialized buffer, then calling calloc is much more efficient than calling malloc+manually zeroing. Not sure how viable that is in this particular case, though it looks like Vec::from_elem does use it when possible: rust-lang/rust#38723

@daniel-abramov
Copy link
Member

daniel-abramov commented Jun 29, 2021

By quadratic zeroing, I mean we are zeroing buffer more than necessary. The larger the buffer, the more bytes we zero(grows quadratically). If we carefully track where we have zeroed, we can zero every byte exactly once and acheive high performance.

Are you sure we zero more than necessary? We only zero the buffer whenever we increase the capacity, i.e. if you read a big chunk of data from the stream, then each call to the read_from will essentially allocate new memory in the vector (increase its capacity) and zero the uninitialized memory, after which it's passed to the stream's Read::read.


Ok fellows, I tried to analyze the information that @qiujiangkun provided in benchmarks to understand what's going on with our performance and I think I found the culprit.

First thing that I did was fetching the benchmarks that @qiujiangkun provided and running them locally using the same buffer size (MIN_READ) for both input_buffer and extend_from_slice version.

And indeed the difference was quite significant:

throughput/input_buffer time:   [24.596 ms 24.627 ms 24.661 ms]
                        thrpt:  [324.40 MiB/s 324.84 MiB/s 325.26 MiB/s]

throughput/extend_from_slice
                        time:   [1.0731 ms 1.0809 ms 1.0897 ms]
                        thrpt:  [7.1694 GiB/s 7.2277 GiB/s 7.2801 GiB/s]

I was really surprised by such a huge difference given that we do the same things [algorithmically], i.e. in case of input_buffer reading a big message from the stream until the stream is ended. Essentially we do the following steps:

  1. Reserve additional capacity in the vector that would store the complete result.
  2. Initialize the uninitialized reserved memory at the end of the vector by zeroing it (i.e. by writing the reserved chunk).
  3. Call to the Read::read on the stream passing the slice to that reserved "free" memory at the end of the vector (i.e. stream will write data to the slice, returning the amount of bytes written that we then use to advance the length of the vector).

The extend_from_slice buffer does pretty much the same in terms of algorithm, though in a different order:

  1. Pass the fixed buffer to the stream's Read::read to make it write to the buffer, returning the amount of bytes written.
  2. Reserve the capacity in the vector that would store the complete result.
  3. Initialize the uninitialized reserved memory at the end of the vector by writing (copying) the bytes from the fixed buffer in the step (1).

I.e. in both cases we have a reservation of memory to store the final result + 2 write operations of the same size (considering we're using the same benchmark that @qiujiangkun provided, i.e. reading from a "mock" stream until the end).

Such huge difference in performance meant that the resulted compiled code differs heavily / optimized differently. I could sort of figure some performance benefit from having a fixed buffer of a known size on stack, but could not imagine such huge difference between them. Checking the reserve step did not make sense as it's identical in both cases, checking the writing (Read::read on stream) also did not make sense (well, the code of Read::read is not changing here), so I started checking the second writing step that is in our case zeroing the memory (the slowest part according to the benchmarks) vs the internal implementation of the Vec::extend_from_slice.

The function relies on spec_extend, and it seems like there are several of spec_extend implementations based on the type of the iterator passed to the function. There is an extend_desugared:

...
        while let Some(element) = iterator.next() {
            let len = self.len();
            if len == self.capacity() {
                let (lower, _) = iterator.size_hint();
                self.reserve(lower.saturating_add(1));
            }
            unsafe {
                ptr::write(self.as_mut_ptr().add(len), element);
                // NB can't overflow since we would have had to alloc the address space
                self.set_len(len + 1);
            }
        }
...

and the one which is default impl for the iterator that is TrustedLen (which I believe is chosen in our case since the size of the slice is known), this one is:

...
            self.reserve(additional);
            unsafe {
                let mut ptr = self.as_mut_ptr().add(self.len());
                let mut local_len = SetLenOnDrop::new(&mut self.len);
                iterator.for_each(move |element| {
                    ptr::write(ptr, element);
                    ptr = ptr.offset(1);
                    // NB can't overflow since we would have had to alloc the address space
                    local_len.increment_len(1);
                });
            }
...

The first big difference that I spotted is that it does not rely on write_byte(). The doc explicitly tells that the write_byte() will panic if index is out of range which in turn means that there is some boundary checking there (additional operation), apart from that I could not spot a significant difference.

So I prepared a version of the input_buffer that does not use write_byte(), but instead calls to std::ptr::write(data.as_mut_ptr(), 0) or data.as_mut_ptr().write(0). To my surprise it did not really bring significant performance benefits, only minor improvements. I decided to move the data.as_mut_ptr() outside of the loop, so that I only call add(i).write(0) inside the loop and tested it. This change effectively brought huge performance benefit:

[OLD UNSOUND IMPLEMENTATION]
Unsound implementation:
throughput/input_buffer time:   [1.1085 ms 1.1181 ms 1.1281 ms]
                        thrpt:  [6.9253 GiB/s 6.9871 GiB/s 7.0478 GiB/s]

[CURRENT SOUND IMPLEMENTATION]
throughput/input_buffer time:   [24.596 ms 24.627 ms 24.661 ms]
                        thrpt:  [324.40 MiB/s 324.84 MiB/s 325.26 MiB/s]

[NEW SOUND IMPLEMENTATION]
throughput/input_buffer time:   [1.3951 ms 1.4025 ms 1.4101 ms]
                        thrpt:  [5.5405 GiB/s 5.5706 GiB/s 5.6001 GiB/s]

I was quite curious about that since as_mut_ptr() does not do anything tricky except just casting the pointer type that I believe should be NOOP after the optimization, but it seems like the resulted code that compiler generates is still quite different, so the change is very significant.

Compared to our current implementation, the new implementation that does zeroing in a slightly different way is much more efficient reaching almost the same performance as our old unsound implementation that skipped the zeroing step. I created a PR to address this issue.


However there is an important thing that the one can spot immediately: our input_buffer implementation with all its "remove garbage" / "dynamic reservation" logic is still slower than a trivial version with extend_from_slice that just uses a single buffer (array of a fixed size) passed to the Read::read and then copies the result from the buffer to the vector that stores the whole message, i.e. it seems that most of the stuff that input_buffer does, not only does not bring any benefits, but also is slower than much simpler implementation that has much simpler logic without any unsafe. I went further and changed the input_buffer locally to use a fixed array buffer (initialized once when InputBuffer is created) of MIN_READ size to be the buffer that we pass to the Read::read and I removed the whole "remove garbage" and DoRead stuff from the implementation, making the API even simpler to tungstenite-rs (no need to pass an explicit hardcoded MIN_READ) and then benchmarked it again. It turned out that such new version is on-par with the extend_from_slice implementation and in fact even significantly faster:

[OLD UNSOUND IMPLEMENTATION]
Unsound implementation:
throughput/input_buffer time:   [1.1085 ms 1.1181 ms 1.1281 ms]
                        thrpt:  [6.9253 GiB/s 6.9871 GiB/s 7.0478 GiB/s]

[CURRENT SOUND IMPLEMENTATION]
throughput/input_buffer time:   [24.596 ms 24.627 ms 24.661 ms]
                        thrpt:  [324.40 MiB/s 324.84 MiB/s 325.26 MiB/s]

[NEW SOUND IMPLEMENTATION]
throughput/input_buffer time:   [1.3951 ms 1.4025 ms 1.4101 ms]
                        thrpt:  [5.5405 GiB/s 5.5706 GiB/s 5.6001 GiB/s]

[REFACTORED]
throughput/input_buffer time:   [704.69 us 712.01 us 720.42 us]
                        thrpt:  [10.844 GiB/s 10.973 GiB/s 11.086 GiB/s]

So it looks like this simplified version of input_buffer without unsafe (that is also easier to read and understand) easily outperforms even our initial old unsound implementation that avoided zeroing the memory.

Because of this, I have the following plan in mind:

  • Merge Fix slow zeroing of the buffer #9 if approved and release a new version of input_buffer that fixes performance issues.
  • Apply the new changes (simplification and refactoring) that speeds up our implementation even further while removing all unsafe code that we have.
  • Deprecate the input_buffer and move that buffer code (quite tiny module) inside tungstenite-rs. This decision is motivated by the fact that currently there are only 2 crates that use input_buffer, one of them (from which most downloads come from) is the tungstenite-rs. So having input_buffer as separate repository does not make sense as any update in it entails updating the tungstenite-rs, in addition we don't have any decent CI for input_buffer in this repository and investing into one just to have one tiny module that hosts trivial logic in a separate repository does not seems to make any sense. This would also allow to easier decouple our Read implementation and AsyncRead implementation that would be required for the tokio-tungstenite version to combat some limitation and speed it up even further (AsyncRead allows passing uninitialized buffer on the API level eliminating the need for the zeroing step).

What do you guys think?

CC: @agalakhov @jxs @strohel @sdroege

@JakkuSakura
Copy link

That's a throughout investigation, especially the safe, sound, speedy solution. I'm happy to see tungstenite geting faster. Though I'm interested in using refactored input_buffer in my own project.(Copy-pasting is also fine)

daniel-abramov added a commit to snapview/tungstenite-rs that referenced this issue Jul 5, 2021
We're also deprecating the usage of `input_buffer` crate, see:
snapview/input_buffer#6 (comment)
daniel-abramov added a commit to snapview/tungstenite-rs that referenced this issue Jul 5, 2021
We're also deprecating the usage of `input_buffer` crate, see:
snapview/input_buffer#6 (comment)
a-miyashita pushed a commit to givery-technology/tungstenite-rs that referenced this issue Jul 20, 2023
We're also deprecating the usage of `input_buffer` crate, see:
snapview/input_buffer#6 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants