Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd performance and Reader/Writer reuse #248

Closed
vmihailenco opened this issue Mar 26, 2020 · 10 comments · Fixed by #249
Closed

zstd performance and Reader/Writer reuse #248

vmihailenco opened this issue Mar 26, 2020 · 10 comments · Fixed by #249

Comments

@vmihailenco
Copy link

vmihailenco commented Mar 26, 2020

I have a benchmark that produces following results:

BenchmarkCZstd-12      	     664	   1793805 ns/op	 125.62 MB/s	  522419 B/op	       9 allocs/op
BenchmarkGoZstd-12    	     374	   3247447 ns/op	  69.39 MB/s	 1646825 B/op	      12 allocs/op

That is roughly 2 times slower (and results can be worse on smaller payloads) and allocs are rather big (1646825 to encode & decode 225339 bytes?). I know that I can achieve better results by using EncodeAll and DecodeAll, but I would like to use Encoder/Decoder as wrappers over bytes.Buffer. So my question is - am I doing anything wrong here with Encoder/Decoder? I've disabled CRC for more fair comparison based on our previous discussion - I plan to use it in production.

@klauspost
Copy link
Owner

Yes. There is some special handling when you decode a *bytes.Buffer, and it does indeed allocate more than it should. I will send a PR shortly.

As you note, the streaming en+decoders are not for payloads this small.

Also, if you want to eliminate more allocations you can copy data to a hash function or something so you avoid the ioutil.ReadAll.

@klauspost
Copy link
Owner

Thanks for the detailed report 👍

@vmihailenco
Copy link
Author

Thanks - now it looks like this

BenchmarkCZstd-12      	     628	   1857834 ns/op	 121.29 MB/s	  522429 B/op	       9 allocs/op
BenchmarkGoZstd-12     	     423	   3165288 ns/op	  71.19 MB/s	  873224 B/op	      11 allocs/op
BenchmarkGoZstd2-12    	     523	   2133557 ns/op	 105.62 MB/s	  583841 B/op	       2 allocs/op

What still looks strange is that EncodeAll/DecodeAll version is faster and allocates less than using Write / Read. I guess there is a good reason for that, but from user POV I don't understand why there is such a big difference - after all Encoder.Writer can easily be a wrapper over EncodeAll.

@klauspost
Copy link
Owner

after all Encoder.Writer can easily be a wrapper over EncodeAll.

No. With EncodeAll you always know how much you are compressing. Using a stream you don't know if the user will write more, so you need to buffer input and cannot start compressing until the stream is closed.

With longer stream you can start compressing when you have enough input for a block.

@vmihailenco
Copy link
Author

vmihailenco commented Mar 27, 2020

There is no buffering in my benchmark - I always write all data with one Write call. So it is one Write + one Close call. Even you have to buffer that data in 1mb blocks and delay compression - it should not be 30% slower. There is something else that makes Read/Write version significantly slower - something that takes extra 1e6 nanoseconds.

@klauspost
Copy link
Owner

@vmihailenco But there is no way to know that you will not be writing more until you close.

I did add another 'short circuit' though. PR in a minute.

@klauspost klauspost reopened this Mar 27, 2020
klauspost added a commit that referenced this issue Mar 27, 2020
If if writing header (ie first block) and it is the final block, use block compression instead of async.

Addition for #248
@vmihailenco
Copy link
Author

Thanks for your patience and efforts 👍

My totally uneducated guess based on the code I've read is that you split the stream into blocks and each block is handled by separate goroutine + it looks like you have separate encoder goroutines that encode those blocks that are coming from the block goroutines. Synchronizing that is expensive, but it is very likely I got all that wrong. Sorry if that is the case.

@klauspost
Copy link
Owner

My totally uneducated guess based on the code I've read is that you split the stream into blocks and each block is handled by separate goroutine + it looks like you have separate encoder goroutines that encode those blocks that are coming from the block goroutines. Synchronizing that is expensive, but it is very likely I got all that wrong. Sorry if that is the case.

Exactly. In streaming mode we collect until we have enough for the block size we want, by default 64KB.

When there is enough for a block we start compressing it in a separate goroutine. This is first compressing it into sequences which is then handed to another goroutine that encodes the block output. When the block is handed over the next block can start.

This means that long streams are quite fast, but small blocks of course suffer from the overhead of having to hand over data. #251 simply skips all the goroutines and compresses it at once if we know we only have to compress a single block (final == true and no frame header written).

@vmihailenco
Copy link
Author

vmihailenco commented Mar 27, 2020

Thanks for confirming - it all makes sense. I am not sure it is such a good idea in case of HTTP server (many concurrent clients and each want to compress a stream), but I definitely see how it can make a lot of sense if you need to compress one (several) large stream.

I understand it is a matter of time you are willing to invest (and in my case also lack of knowledge) but some ideas how to improve this situation for HTTP web server use case:

  1. completely disable "multi-threading" in case of zstd.WithEncoderConcurrency(1)
  2. rework API
    • ideally make it more like snappy/s2 API where NewReader/NewWriter are just wrappers over Encode/Decode with some buffering. Multi-threading should be exposed separately as Encoder and Decoder.
    • keep it as is but expose more internal guts so NewReader/NewWriter can be implemented without multi-threading in user space. E.g. EncodeBlock(block []byte, final bool).

@klauspost
Copy link
Owner

This will be addressed when I add full concurrent compression.

S2 has Writer.EncodeBuffer.

klauspost added a commit that referenced this issue Mar 27, 2020
* zstd: If first block and 'final', encode direct.

If if writing header (ie first block) and it is the final block, use block compression instead of async.

Addition for #248
klauspost added a commit that referenced this issue Sep 1, 2021
klauspost added a commit that referenced this issue Sep 1, 2021
@klauspost klauspost reopened this Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants