-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding support for libdeflate #171
Comments
Thanks for the links! I didn't know about SLZ or libdeflate. There is already an issue for zopfli: #3. For zlib-compatible compressors there is also the zlib-ng plugin, and there is an issue (#100) for igzip. Zopfli and SLZ are both compression-only libraries (as is igzip, FWIW), which Squash doesn't currently support. I do plan to add support for such libraries eventually, but it will require some thought about how things will work internally. libdeflate, OTOH, would be very easy to implement a plugin for. AFAICT it only supports all-in-one compression, there is no streaming interface. That makes it less powerful but easier to implement a plugin for (if you're interested, the pithy plugin would be a decent example to follow). I don't mean to seem dismissive, but I want to be transparent: I probably will not get to this any time soon. It's just not a priority at the moment (see the 0.8 milestone). If you would like to have a go at putting together a plugin for libdeflate (or work on support for unidirectional codecs) I'd be happy to answer any questions you might have, it's just probably not a good idea to wait for me to implement these plugins. I'm going to edit this issue to be just for libdeflate, and open up a separate issue for SLZ so it's easier to keep track of everything. |
I had a look at doing this and one issue that I ran into was the lack of any useful functions in libdeflate for implementing get_uncompressed_size and get_max_compressed_size |
Thanks for looking into this. You don't actually need to implement Normally what I do is create a function that returns a very safe value (something like There is an improved version of that test in the wip/random-data branch (note the additional comment at the beginning of the file); I'm just waiting for a bug fix from ms-compress before I move it into master, but it should work as a drop-in replacement for random-data.c in current master. It would probably be a good idea to check with that version, too. The other thing you can do is contact the author of the library and point out that it is missing such a function. In my experience the maintainers recognize the usefulness of such a function and are willing to add one (see richgel999/lzham_codec_devel#5 and google/brotli#274 for some recently-filed examples). Of course you can also do both so we don't have to wait for the upstream library to add the function in order to get the plugin in Squash. |
I've filed an issue with ebiggers/libdeflate#1 and will look at implementing it without support for it using the strategy that you suggested. |
What size is the decompressed_size parameter to decompress_buffer? deflate_decompress requires passing in the exact size of the decompressed data. |
I had to work around this same requirement when adding the BriefLZ plugin (#123). I ended up storing the decompressed size as a varint before the compressed data. But of course that would mean the compressed data is no longer entirely deflate. |
Other than BriefLZ, the only time I remember encountering that particular limitation was with the FARI plugin, and after I brought the issue up the author decided to add an end-of-stream marker to get rid of it. That's a pretty odd limitation for a deflate/zlib/gzip implementation… for one thing it makes interoperating with zlib (and other implementations) impossible in lots of situations. For example, the only way to decompress a random *.gz file with libdeflate, AFAICT, would be to first decompress it with another implementation to get the decompressed size, then decompress it again with libdeflate. I have a hard time believing that limitation is supposed to be permanent. It sounds more like the kind of things someone might do as a shortcut during development, then come back and fix later… @ebiggers, any comment? I suppose it would be possible to add a flag to the codec info to inform the user that an exact decompressed size is required… The benchmark could be rewritten with a different code path for such plugins. OTOH, I can't see a reasonable way to handle it in the CLI. Basically, anything using the streaming or splicing APIs would have to add extra code to prevent such codecs from being used. Given the rarity of such codecs and the burden it would place on people using the Squash API, I'm disinclined to make such a change… The only other possibility I can think of would be to have Squash automatically add a small wrapper around certain codecs which just prepends the decompressed size to the output of certain codecs (basically, move the code from the BriefLZ plugin into Squash) but, like Jørgen mentioned, that would break compatibility; you wouldn't be allowed to call the codecs implemented by the libdeflate plugin "deflate", "zlib", and "gzip", and they wouldn't be interoperable with existing deflate/zlib/gzip implementations. |
Right now, libdeflate is intended only for block-based compression where the decompressed size is stored elsewhere or is known based on the context. This is useful for transparent filesystem compression and for some types of archivers. It's not currently intended to be used to decompress a random .gz file. This could change in the future (I am aware that DEFLATE streams include a special end marker). |
In that case, why bother with support for the gzip format? For things like transparent filesystem compression people should probably be using deflate; I don't even think there are good reasons to use zlib (in which case why bother supporting it, either?).
That would be nice; if you don't mind I'll file an issue in the libdeflate tracker to track it. So, assuming this doesn't change in the immediate future, the question is what to do with a potential libdeflate plugin. I'm leaning towards adding a flag to the codec info to request that Squash generate a wrapper around the compressed data, and stealing the wrapper code from the BriefLZ plugin. Does anyone have any thoughts on the matter before I proceed? |
I completely agree that raw DEFLATE should be used in applications where the zlib or gzip header is unimportant. But the reality is that people are often confused about this and use zlib or gzip instead of raw DEFLATE. For example: all the filesystems included with the Linux kernel that support transparent DEFLATE-based compression (btrfs, cramfs, isofs, jffs2, logfs, squashfs, ubifs) actually use zlib streams, as far as I can see. So I don't see anything wrong with supporting zlib and gzip streams with block-based compression. Thinking about it a little more, is "unknown output size" support very useful without streaming support? If you don't have streaming support, you have to allocate all output space ahead of time; and you can't do that without knowing the output size. I suppose you can allocate some size buffer arbitrarily and keep retrying with larger buffers until there is enough space, but that's inefficient and should be replaced with either true streaming support or a known output size. |
That's unfortunate. I'm not surprised it happens sometimes, but I am surprised (and disappointed) that it seems to be the norm.
That's what Squash will do to emulate streaming with codecs which don't support streaming and don't include the decompressed size. It's definitely sub-optimal, though it works surprisingly well in practice (all the files in the benchmark get it on the first try with all codecs), especially on 64-bit platforms. From Squash's perspective, the problem is that our only real choices are:
I wrestled with this pretty early on, but given how rare the limitation is in compression libraries the third option made the most sense. I think one of the main reasons it hasn't been a problem is that codecs have to deal with untrusted (possibly corrupted, possibly malicious) input gracefully, so they need some bounds checking regardless. Instead of basing their loop on the amount of output expected and adding an additional check to make sure it doesn't go outside of the input buffer, all they have to do is base it on the amount of input available and add a check to make sure it doesn't go outside of the output buffer. Or perhaps I'm missing something? |
Are you saying that when squash emulates streaming, it collects the entire input into a single buffer before calling the codec? I suppose you have no choice if you don't want to add your own framing format, but I don't think that's a proper substitute for the codec supporting streaming itself, and this type of stream emulation probably shouldn't be used in "real" code. Suppose that I stream 10 GB of data to squash using some codec X; is there a way to know when it's doing real streaming as opposed to doing an emulation that requires 10 GB of memory? Codecs do need to deal with untrusted input. There ordinarily need to be two checks during the decompression loop: one for whether the end of the compressed data buffer has been reached, and one for whether the end of the uncompressed data buffer has been reached. Ordinarily, people will exit the decompression loop in both of these cases. The style I've been using in my decompressors is to just start reading zeroes when the end of the compressed data is reached. Then, you only need one condition for actually exiting the decompression loop: all the uncompressed data you wanted has been decompressed. That makes the code a bit more concise and, probably, a tiny bit faster. The disadvantage is that it can "hide" corruptions, but often the uncompressed data is checksummed anyway so it is irrelevant as corruptions will be detected that way. I will be keeping alternate approaches in mind, of course. What I'd like to do is make a single codec that works optimally for both buffer-based and stream-based compression. Usually people just implement a buffer-based API on top of a stream-based API and take a performance hit (zlib does this, for example). |
Yes. You're right, it's definitely sub-optimal, but there simply isn't an alternative. A framing format isn't an option since we need to remain interoperable. As for using it in real code, as I mentioned it works surprisingly well in practice; from a performance perspective the default values allocate enough room that a second attempt is rarely necessary, and allocating large chunks enough chunks of memory isn't really an issue, especially on 64-bit platforms.
Yes, there is a
I'd be surprised if replacing the input with zeros was any faster; you would still need to test to make sure you don't overrun the buffer. Handling what happens in the case you do shouldn't matter much since it shouldn't happen often, but even if it did I suspect it would be faster to just break out of the loop, return from the function, goto a label, etc., than processing zeroes. There would be a small gain, though, from not having to read excess bytes at the end of the stream. One interesting consequence of your approach is what happens when the length is controlled by an attacker; they could easy set it to SIZE_MAX (or INT_MAX, or whatever type you use for the length argument). Obviously they could also provide you with however much data they want, but the problem here is the highly asymmetrical nature of the attack; with a few bytes of data they can cause your CPU to try to chew through terabytes of data (all zeros, but still, you have to process it). Most codecs would be able to return quickly, either successfully or with an error code (depending on whether all the input was decoded or there was an error). |
Also, it's worth mentioning that, if you're using the splicing API (decompressing from one FILE* to another), Squash will actually mmap both the input and output if it can. So assuming you have the address space, decompressing a 10 GB file isn't actually a big deal. |
It's not quite that simple actually. In any compression format that uses Huffman codes, you don't know ahead of time how many bits you are going to need to decode a symbol. In DEFLATE, for example, the maximum number of bits in which a match might be encoded in is 48. This occurs if a match is encoded with maximum length codewords and the maximum number of "extra" bits. On 64-bit platforms, my code in libdeflate will always ensure that there are 48 bits ready before attempting to decode a match; this is the fastest way I found to run the decompressor. Since this may be more than the number of bits actually remaining in the input buffer, this may require "adding" zeroes. This is quite easy to do since the variable containing the "ready" bits will already contain zeroes in the unfilled space (there is no need to explicitly clear bits or shift in zeroes). So I think that a check for true exhaustion of the compressed data would have to be a separate, entirely new check. Of course, such a check could simply be added; but the goal of my DEFLATE implementation is to be as fast as possible, so every branch in the inner loop counts. zlib has a slightly different strategy which I've been considering: most of the time it runs a "fast" loop, but when it nears the end of either the input or output buffer it goes into a "slow" loop, where it can be more careful about exactly how many bits are remaining.
An attacker capable of doing such a thing can already (1) cause the application to allocate an arbitrarily large buffer, and (2) provide compressed data with the maximum possible compression ratio, forcing the application to fill the buffer even when providing only a small amount of compressed data. So I don't think my implementation strategy makes much of a difference here. Also, in the applications such as filesystem compression I have been thinking of most, the uncompressed size tends to be carefully controlled, e.g. a file might be divided into fixed-size chunks. Of course, that won't be true of every application. |
Update: The decompression functions in libdeflate have been updated to optionally provide the actual uncompressed size:
This turned out to be easier than I had thought earlier because a few different things were being confused. In DEFLATE you know when the stream has ended when you have finished a block with the BFINAL flag set. So you can just compute the actual uncompressed size at the end without affecting the main decompression loop. Of course, for the reasons I mentioned earlier it's preferable, whenever possible, to not to have to guess the uncompressed size. |
I have a working implementation of a libdeflate plugin here: |
These alternative implementations of DEFLATE should help the DEFLATE format reach into different areas of the compression spectrum.
SLZ (http://git.1wt.eu/git/libslz.git/) claims to be a much faster compressor
zopfli (https://github.com/google/zopfli) tries to push deflate to it's compression ratio limit
libdeflate (https://github.com/ebiggers/libdeflate) claims faster compression and decompression
The text was updated successfully, but these errors were encountered: