Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding support for libdeflate #171

Closed
jrmuizel opened this issue Dec 2, 2015 · 18 comments
Closed

Consider adding support for libdeflate #171

jrmuizel opened this issue Dec 2, 2015 · 18 comments

Comments

@jrmuizel
Copy link

jrmuizel commented Dec 2, 2015

These alternative implementations of DEFLATE should help the DEFLATE format reach into different areas of the compression spectrum.

SLZ (http://git.1wt.eu/git/libslz.git/) claims to be a much faster compressor
zopfli (https://github.com/google/zopfli) tries to push deflate to it's compression ratio limit
libdeflate (https://github.com/ebiggers/libdeflate) claims faster compression and decompression

@nemequ
Copy link
Member

nemequ commented Dec 2, 2015

Thanks for the links! I didn't know about SLZ or libdeflate.

There is already an issue for zopfli: #3. For zlib-compatible compressors there is also the zlib-ng plugin, and there is an issue (#100) for igzip.

Zopfli and SLZ are both compression-only libraries (as is igzip, FWIW), which Squash doesn't currently support. I do plan to add support for such libraries eventually, but it will require some thought about how things will work internally.

libdeflate, OTOH, would be very easy to implement a plugin for. AFAICT it only supports all-in-one compression, there is no streaming interface. That makes it less powerful but easier to implement a plugin for (if you're interested, the pithy plugin would be a decent example to follow).

I don't mean to seem dismissive, but I want to be transparent: I probably will not get to this any time soon. It's just not a priority at the moment (see the 0.8 milestone). If you would like to have a go at putting together a plugin for libdeflate (or work on support for unidirectional codecs) I'd be happy to answer any questions you might have, it's just probably not a good idea to wait for me to implement these plugins.

I'm going to edit this issue to be just for libdeflate, and open up a separate issue for SLZ so it's easier to keep track of everything.

@nemequ nemequ changed the title Consider adding support for SLZ, libdeflate and zopfli Consider adding support for libdeflate Dec 2, 2015
@jrmuizel
Copy link
Author

jrmuizel commented Dec 3, 2015

I had a look at doing this and one issue that I ran into was the lack of any useful functions in libdeflate for implementing get_uncompressed_size and get_max_compressed_size

@nemequ
Copy link
Member

nemequ commented Dec 3, 2015

Thanks for looking into this.

You don't actually need to implement get_uncompressed_size; most codecs don't support that. Just don't set the get_uncompressed_size field in the codec initialization and you should be good. get_max_compressed_size, OTOH, is necessary, and libraries not providing anything for it is a surprisingly common occurrence.

Normally what I do is create a function that returns a very safe value (something like (uncompressed_size * 2) + 4096), then modify the random-data test case (see tests/random-data.c:35-36) to tell me how much data it really needed (you can run it with something like SQUASH_PLUGINS=plugins ./tests/random-data --verbose libdeflate:gzip). From there it's pretty easy to come up with a more accurate version, which is usually in the form of uncompressed_size + (uncompressed_size / X) + Y. Y is typically the number of additional bytes required to compress a single byte.

There is an improved version of that test in the wip/random-data branch (note the additional comment at the beginning of the file); I'm just waiting for a bug fix from ms-compress before I move it into master, but it should work as a drop-in replacement for random-data.c in current master. It would probably be a good idea to check with that version, too.

The other thing you can do is contact the author of the library and point out that it is missing such a function. In my experience the maintainers recognize the usefulness of such a function and are willing to add one (see richgel999/lzham_codec_devel#5 and google/brotli#274 for some recently-filed examples).

Of course you can also do both so we don't have to wait for the upstream library to add the function in order to get the plugin in Squash.

@jrmuizel
Copy link
Author

jrmuizel commented Dec 3, 2015

I've filed an issue with ebiggers/libdeflate#1 and will look at implementing it without support for it using the strategy that you suggested.

@jrmuizel
Copy link
Author

jrmuizel commented Dec 5, 2015

What size is the decompressed_size parameter to decompress_buffer? deflate_decompress requires passing in the exact size of the decompressed data.

@jibsen
Copy link
Collaborator

jibsen commented Dec 5, 2015

What size is the decompressed_size parameter to decompress_buffer? deflate_decompress requires passing in the exact size of the decompressed data.

I had to work around this same requirement when adding the BriefLZ plugin (#123). I ended up storing the decompressed size as a varint before the compressed data. But of course that would mean the compressed data is no longer entirely deflate.

@nemequ
Copy link
Member

nemequ commented Dec 5, 2015

Other than BriefLZ, the only time I remember encountering that particular limitation was with the FARI plugin, and after I brought the issue up the author decided to add an end-of-stream marker to get rid of it.

That's a pretty odd limitation for a deflate/zlib/gzip implementation… for one thing it makes interoperating with zlib (and other implementations) impossible in lots of situations. For example, the only way to decompress a random *.gz file with libdeflate, AFAICT, would be to first decompress it with another implementation to get the decompressed size, then decompress it again with libdeflate.

I have a hard time believing that limitation is supposed to be permanent. It sounds more like the kind of things someone might do as a shortcut during development, then come back and fix later… @ebiggers, any comment?

I suppose it would be possible to add a flag to the codec info to inform the user that an exact decompressed size is required… The benchmark could be rewritten with a different code path for such plugins. OTOH, I can't see a reasonable way to handle it in the CLI. Basically, anything using the streaming or splicing APIs would have to add extra code to prevent such codecs from being used. Given the rarity of such codecs and the burden it would place on people using the Squash API, I'm disinclined to make such a change…

The only other possibility I can think of would be to have Squash automatically add a small wrapper around certain codecs which just prepends the decompressed size to the output of certain codecs (basically, move the code from the BriefLZ plugin into Squash) but, like Jørgen mentioned, that would break compatibility; you wouldn't be allowed to call the codecs implemented by the libdeflate plugin "deflate", "zlib", and "gzip", and they wouldn't be interoperable with existing deflate/zlib/gzip implementations.

@ebiggers
Copy link

ebiggers commented Dec 5, 2015

Right now, libdeflate is intended only for block-based compression where the decompressed size is stored elsewhere or is known based on the context. This is useful for transparent filesystem compression and for some types of archivers. It's not currently intended to be used to decompress a random .gz file.

This could change in the future (I am aware that DEFLATE streams include a special end marker).

@nemequ
Copy link
Member

nemequ commented Dec 5, 2015

It's not currently intended to be used to decompress a random .gz file.

In that case, why bother with support for the gzip format? For things like transparent filesystem compression people should probably be using deflate; I don't even think there are good reasons to use zlib (in which case why bother supporting it, either?).

This could change in the future (I am aware that DEFLATE streams include a special end marker).

That would be nice; if you don't mind I'll file an issue in the libdeflate tracker to track it.

So, assuming this doesn't change in the immediate future, the question is what to do with a potential libdeflate plugin. I'm leaning towards adding a flag to the codec info to request that Squash generate a wrapper around the compressed data, and stealing the wrapper code from the BriefLZ plugin. Does anyone have any thoughts on the matter before I proceed?

@ebiggers
Copy link

ebiggers commented Dec 5, 2015

I completely agree that raw DEFLATE should be used in applications where the zlib or gzip header is unimportant. But the reality is that people are often confused about this and use zlib or gzip instead of raw DEFLATE. For example: all the filesystems included with the Linux kernel that support transparent DEFLATE-based compression (btrfs, cramfs, isofs, jffs2, logfs, squashfs, ubifs) actually use zlib streams, as far as I can see. So I don't see anything wrong with supporting zlib and gzip streams with block-based compression.

Thinking about it a little more, is "unknown output size" support very useful without streaming support? If you don't have streaming support, you have to allocate all output space ahead of time; and you can't do that without knowing the output size. I suppose you can allocate some size buffer arbitrarily and keep retrying with larger buffers until there is enough space, but that's inefficient and should be replaced with either true streaming support or a known output size.

@nemequ
Copy link
Member

nemequ commented Dec 5, 2015

I completely agree that raw DEFLATE should be used in applications where the zlib or gzip header is unimportant. But the reality is that people are often confused about this and use zlib or gzip instead of raw DEFLATE. For example: all the filesystems included with the Linux kernel that support transparent DEFLATE-based compression (btrfs, cramfs, isofs, jffs2, logfs, squashfs, ubifs) actually use zlib streams, as far as I can see. So I don't see anything wrong with supporting zlib and gzip streams with block-based compression.

That's unfortunate. I'm not surprised it happens sometimes, but I am surprised (and disappointed) that it seems to be the norm.

Thinking about it a little more, is "unknown output size" support very useful without streaming support? If you don't have streaming support, you have to allocate all output space ahead of time; and you can't do that without knowing the output size. I suppose you can allocate some size buffer arbitrarily and keep retrying with larger buffers until there is enough space, but that's inefficient and should be replaced with either true streaming support or a known output size.

That's what Squash will do to emulate streaming with codecs which don't support streaming and don't include the decompressed size. It's definitely sub-optimal, though it works surprisingly well in practice (all the files in the benchmark get it on the first try with all codecs), especially on 64-bit platforms. From Squash's perspective, the problem is that our only real choices are:

  • Don't support emulating streaming with codecs which don't support streaming natively, which places a burden on people using the API.
  • Add a wrapper for codecs which don't encode the decompressed size in the compressed data, which breaks compatibility with other software.
  • Require plugins be capable of decompressing to buffers of any size.

I wrestled with this pretty early on, but given how rare the limitation is in compression libraries the third option made the most sense.

I think one of the main reasons it hasn't been a problem is that codecs have to deal with untrusted (possibly corrupted, possibly malicious) input gracefully, so they need some bounds checking regardless. Instead of basing their loop on the amount of output expected and adding an additional check to make sure it doesn't go outside of the input buffer, all they have to do is base it on the amount of input available and add a check to make sure it doesn't go outside of the output buffer. Or perhaps I'm missing something?

@ebiggers
Copy link

ebiggers commented Dec 5, 2015

Are you saying that when squash emulates streaming, it collects the entire input into a single buffer before calling the codec? I suppose you have no choice if you don't want to add your own framing format, but I don't think that's a proper substitute for the codec supporting streaming itself, and this type of stream emulation probably shouldn't be used in "real" code. Suppose that I stream 10 GB of data to squash using some codec X; is there a way to know when it's doing real streaming as opposed to doing an emulation that requires 10 GB of memory?

Codecs do need to deal with untrusted input. There ordinarily need to be two checks during the decompression loop: one for whether the end of the compressed data buffer has been reached, and one for whether the end of the uncompressed data buffer has been reached. Ordinarily, people will exit the decompression loop in both of these cases. The style I've been using in my decompressors is to just start reading zeroes when the end of the compressed data is reached. Then, you only need one condition for actually exiting the decompression loop: all the uncompressed data you wanted has been decompressed. That makes the code a bit more concise and, probably, a tiny bit faster. The disadvantage is that it can "hide" corruptions, but often the uncompressed data is checksummed anyway so it is irrelevant as corruptions will be detected that way.

I will be keeping alternate approaches in mind, of course. What I'd like to do is make a single codec that works optimally for both buffer-based and stream-based compression. Usually people just implement a buffer-based API on top of a stream-based API and take a performance hit (zlib does this, for example).

@nemequ
Copy link
Member

nemequ commented Dec 5, 2015

Are you saying that when squash emulates streaming, it collects the entire input into a single buffer before calling the codec? I suppose you have no choice if you don't want to add your own framing format, but I don't think that's a proper substitute for the codec supporting streaming itself, and this type of stream emulation probably shouldn't be used in "real" code.

Yes. You're right, it's definitely sub-optimal, but there simply isn't an alternative. A framing format isn't an option since we need to remain interoperable. As for using it in real code, as I mentioned it works surprisingly well in practice; from a performance perspective the default values allocate enough room that a second attempt is rarely necessary, and allocating large chunks enough chunks of memory isn't really an issue, especially on 64-bit platforms.

Suppose that I stream 10 GB of data to squash using some codec X; is there a way to know when it's doing real streaming as opposed to doing an emulation that requires 10 GB of memory?

Yes, there is a SQUASH_CODEC_INFO_NATIVE_STREAMING flag which can be used to check that, and implementations are certainly free to use that to only allow codecs which support streaming. For many cases, however, it's nice to just be able to write code once and have it work as expected even for codecs which don't natively support streaming.

The style I've been using in my decompressors is to just start reading zeroes when the end of the compressed data is reached. Then, you only need one condition for actually exiting the decompression loop: all the uncompressed data you wanted has been decompressed. That makes the code a bit more concise and, probably, a tiny bit faster.

I'd be surprised if replacing the input with zeros was any faster; you would still need to test to make sure you don't overrun the buffer. Handling what happens in the case you do shouldn't matter much since it shouldn't happen often, but even if it did I suspect it would be faster to just break out of the loop, return from the function, goto a label, etc., than processing zeroes. There would be a small gain, though, from not having to read excess bytes at the end of the stream.

One interesting consequence of your approach is what happens when the length is controlled by an attacker; they could easy set it to SIZE_MAX (or INT_MAX, or whatever type you use for the length argument). Obviously they could also provide you with however much data they want, but the problem here is the highly asymmetrical nature of the attack; with a few bytes of data they can cause your CPU to try to chew through terabytes of data (all zeros, but still, you have to process it). Most codecs would be able to return quickly, either successfully or with an error code (depending on whether all the input was decoded or there was an error).

@nemequ
Copy link
Member

nemequ commented Dec 6, 2015

Suppose that I stream 10 GB of data to squash using some codec X; is there a way to know when it's doing real streaming as opposed to doing an emulation that requires 10 GB of memory?

Also, it's worth mentioning that, if you're using the splicing API (decompressing from one FILE* to another), Squash will actually mmap both the input and output if it can. So assuming you have the address space, decompressing a 10 GB file isn't actually a big deal.

@ebiggers
Copy link

ebiggers commented Dec 6, 2015

I'd be surprised if replacing the input with zeros was any faster; you would still need to test to make sure you don't overrun the buffer. Handling what happens in the case you do shouldn't matter much since it shouldn't happen often, but even if it did I suspect it would be faster to just break out of the loop, return from the function, goto a label, etc., than processing zeroes. There would be a small gain, though, from not having to read excess bytes at the end of the stream.

It's not quite that simple actually. In any compression format that uses Huffman codes, you don't know ahead of time how many bits you are going to need to decode a symbol. In DEFLATE, for example, the maximum number of bits in which a match might be encoded in is 48. This occurs if a match is encoded with maximum length codewords and the maximum number of "extra" bits.

On 64-bit platforms, my code in libdeflate will always ensure that there are 48 bits ready before attempting to decode a match; this is the fastest way I found to run the decompressor. Since this may be more than the number of bits actually remaining in the input buffer, this may require "adding" zeroes. This is quite easy to do since the variable containing the "ready" bits will already contain zeroes in the unfilled space (there is no need to explicitly clear bits or shift in zeroes).

So I think that a check for true exhaustion of the compressed data would have to be a separate, entirely new check. Of course, such a check could simply be added; but the goal of my DEFLATE implementation is to be as fast as possible, so every branch in the inner loop counts.

zlib has a slightly different strategy which I've been considering: most of the time it runs a "fast" loop, but when it nears the end of either the input or output buffer it goes into a "slow" loop, where it can be more careful about exactly how many bits are remaining.

One interesting consequence of your approach is what happens when the length is controlled by an attacker; they could easy set it to SIZE_MAX (or INT_MAX, or whatever type you use for the length argument).

An attacker capable of doing such a thing can already (1) cause the application to allocate an arbitrarily large buffer, and (2) provide compressed data with the maximum possible compression ratio, forcing the application to fill the buffer even when providing only a small amount of compressed data. So I don't think my implementation strategy makes much of a difference here. Also, in the applications such as filesystem compression I have been thinking of most, the uncompressed size tends to be carefully controlled, e.g. a file might be divided into fixed-size chunks. Of course, that won't be true of every application.

@ebiggers
Copy link

Update:

The decompression functions in libdeflate have been updated to optionally provide the actual uncompressed size:

bool
deflate_decompress(struct deflate_decompressor *decompressor,
                   const void *in, size_t in_nbytes,
                   void *out, size_t out_nbytes_avail,
                   size_t *actual_out_nbytes_ret);

This turned out to be easier than I had thought earlier because a few different things were being confused. In DEFLATE you know when the stream has ended when you have finished a block with the BFINAL flag set. So you can just compute the actual uncompressed size at the end without affecting the main decompression loop.

Of course, for the reasons I mentioned earlier it's preferable, whenever possible, to not to have to guess the uncompressed size.

@jrmuizel
Copy link
Author

I have a working implementation of a libdeflate plugin here:
https://github.com/jrmuizel/squash

@nemequ
Copy link
Member

nemequ commented Jan 30, 2016

PR merged as c2ebcf6, thanks @jrmuizel :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants