Skip to content

se-mz/decompress

Repository files navigation

decompress - A defensive and fast decompression library in pure Common Lisp

Overview

The title speaks for itself. More formats may be added in the future but for now none are planned. A compressor counterpart is highly unlikely. A download can be found here; there is also a Github mirror, but note that this Readme might not display correctly there. On Quicklisp, the corresponding system is called decompress, although the package it defines is semz.decompress.

Features

  • Supported formats: Deflate, zlib, gzip, bzip2, LZMA, LZMA2, XZ.
  • Safety: Edge cases are properly detected and explicitly handled; malformed data is rejected early.¹ The interface is designed to be hard to misuse; in particular, single-member and multi-member data can be handled with separate functions, defaults try to provide the strongest guarantees, and built-in integrity checks are run without user intervention. No FFI is used anywhere.
  • No mandatory overreads: When reading a single well-formed member from a byte stream, no byte beyond the end of the compressed data will be read, allowing this library to be used as a part of complicated stream processing for other formats.
  • Performance: To my knowledge, this is the second fastest Deflate decoder and the fastest bzip2 decoder in pure Common Lisp. Under SBCL, file-to-file decompression usually hits 30%-80% the speed of the C reference implementations (depending on the format) and among CL libraries, we’re only beaten by 3bz. See the performance section for more details and other CL implementations.

¹ This is a guideline; a hard guarantee would be infeasible.

Quickstart

From example.lisp:

The supported multi-member formats are gzip, bzip2 and XZ. Members of gzip and bzip2 are trivial to handle manually with decompress or make-decompression-stream if you need to do that for some reason; for XZ, you’ll need some contortions to deal with padding, but since the spec explicitly recommends not to use multi-member XZ when embedding into a different format, you probably won’t run into this.

For more details, see the reference.

Comparison to other libraries

The other libraries under consideration are chipz, which is probably the most popular library; deflate, a compact implementation by Pierre Mai that is notably used by Quicklisp; and 3bz, which is notably used by pngload.

decompress-1.2.0chipz-20230418deflate-1.0.43bz-20230521
/<
supported formatsdeflate, zlib, gzip, bzip2, lzma, lzma2, xzdeflate, zlib, gzip, bzip2deflate, zlib, gzipdeflate, zlib, gzip
dependenciesalexandria, trivial‑gray‑streams--alexandria, trivial‑features, nibbles, babel, cffi¹, mmap¹
licenseMITBSD‑3MITMIT
input sourcesvector, streamsimple specialized vector, list, stream, pathnamestream(s. s.) vector, stream, foreign memory
output destinationsnew vectornew vector, provided (s. s.) vector, stream, pathnamestream(s. s.) vector
Gray stream wrapperyesonly around other streamsnono
checksum verificationautomaticautomaticautomatic if enabled, disabled by defaultautomatic
multi-member handlingmanual or automaticsingle-onlymanualmanual
edge case handlingstrictlenientlenientstrict, no library error type
partial inputlibrary errorlibrary errorend-of-file errorstatus flag
trailing dataignored & unread, or library errorignored & readignored & unreadignored; not read with stream input
preset zlib dictionariessupportedunsupportedunsupportedunsupported
bzip2 block randomizationsupportedunsupportedn/an/a

¹ Optional depending on implementation.

Edge case handling

Deflate / zlib / gzip

The Deflate spec has a few rough edges which we deal with as follows.

  • Some decompressors (e.g. GNU gzip) allow back references to go beyond the beginning of the output data, pretending that the octets out of bounds were all zeroes. This one always signals an error.
  • Huffman trees are uniquely determined by the associated sequence of code lengths, but not every such sequence corresponds to a tree - some leave codes unassigned and some make codes ambiguous. The spec isn’t clear on what to do with such sequences (outside of two special cases in §3.2.7). While one can argue that leaving codes unassigned is fine as long as they don’t appear, we always signal an error as soon as a sequence doesn’t correspond to a (complete) Huffman tree.
  • According to §3.2.7, the situation where only a single distance code is used may be represented by having only one code length of one. We enforce that the unique distance code is always encoded as a zero bit and never as a one bit.
  • A dynamic Deflate block can omit the encoding for the end-of-block marker. I am convinced that this is an oversight in the standard since the only situation where an encoder would want to output such a block is if it knows in advance the input is infinite and that the miniscule inefficiency of encoding an EOB would be worse than the inefficiency that arises from being unable to change the Huffman tree (no EOB means no new dynamic block).

    However, there’s little reason to ban this; since Deflate already allows arbitrarily many blocks of arbitrary length, nothing stops a stream from being infinite anyway and a finite input is going to run into EOF either way.

  • Dynamic Deflate blocks can provide encodings for the illegal literal/length codes 286 and 287 as well as the illegal distance codes 30 and 31, although these may never occur in the data. We allow them in blocks since otherwise the fixed Deflate block code wouldn’t be expressible dynamically. Some decompressors (e.g. zlib) do not allow this at all.

Zlib and gzip do not add many edge cases by themselves; note however that the gzip file format can consist of multiple gzip streams which each come with their own filename/etc info in their respective header. You can access these with ~decompress~ if necessary, but most decoders ignore anything except maybe the first header, which is readily accessible via ~decompress-all~.

A more detailed comparison of edge case handling across libraries can be found below. An error is an error of some library-defined type that could be handled by a library user; a basic error is something like a failed assertion or (error "foo"), where the problem is explicitly detected but can’t be reasonably handled by a user; an internal error is something like a type error or an array bounds violation that is signalled by the implementation, but not necessarily reliably so.

Deviations from zlib behavior are boldfaced for convenience; note that deviation is not necessarily a bad thing in itself. We consider internal errors deviation since they may cause behavior changes under some implementations.

edge casezlib (zpipe)decompress3bzdeflatechipz
/<
reserved block typeerrorerrorbasic errorerrorerror
uncompressed zero-length blockhandledhandledhandledhandledhandled
uncompressed block with wrong checksumerrorerrorbasic errorerrorerror
uncompressed zero-length block with wrong checksumerrorerrorbasic errorerroroutputs data
reference goes beyond previous outputerrorerrorbasic erroroutputs dataoutputs data
uses illegal length code 286/287errorerrorbasic errorerrorerror
uses illegal distance code 30/31errorerrorbasic errorerrorerror
dynamic block provides encodings for 286/287 & 30/31erroroutputs datainternal erroroutputs dataoutputs data
dynamic block with one distance code (§3.2.7)handledhandledhandledhandledhandled
dynamic block with one distance code, uses the unassigned codeerrorerrorbasic errorerrorerror
dynamic block without distance codes (§3.2.7)handledhandledhandledhandledhandled
dynamic block without distance codes, tries to use len codeerrorerrorbasic errorerrorerror
Huffman tree with too many itemserrorerrorbasic erroroutputs dataoutputs data
Huffman tree with too few itemserrorerrorbasic erroroutputs dataoutputs data
Huffman tree with too few items, uses an unassigned coden/an/an/aerrorerror
dynamic block has code lengths expand across len/dist boundaryhandledhandledhandledhandledhandled
dynamic block has code lengths expand out of boundserrorerrorbasic errorerrorerror
dynamic block starts by repeating the last code lengtherrorerrorbasic errorerrorerror
reference stays in bounds but goes beyond zlib window sizeoutputs dataerroroutputs dataoutputs dataoutputs data
zlib preset dictionary required but none providederrorerrorbasic errorerroroutputs data
wrong Adler-32 checksumerrorerrorbasic erroroutputs data by defaulterror
wrong gzip magic byteserrorerrorbasic errorerrorerror
gzip data with wrong CRC-32 checksumerrorerrorbasic erroroutputs data by defaulterror
gzip data with right CRC-32 checksum but wrong modular lengtherrorerroroutputs dataoutputs dataoutputs data
gzip data with header checksumhandledhandledhandledhandledhandled
gzip data with incorrect header checksumerrorerrorbasic erroroutputs dataoutputs data
gzip extra fieldshandledhandledbasic errorhandledbasic error
inconsistent length fields in extra fields sectionoutputs dataerrorn/aoutputs datan/a

Bzip2

Bzip2 has no spec, so we mostly try to match the reference implementation. There is one notable exception: The reference implementation accepts underfull Huffman trees and only complains once unassigned codes are used; we reject such trees right off the bat since it simplifies the code.

I am not aware of any compressor that outputs such trees; Joe Tsai claims they exist, but provides no names. There isn’t a good reason why a compressor would output such trees since bzip2 cannot have alphabets of size one, and whenever you have a branch with only one child, it’s more efficient to just give the unique child a code that’s one bit shorter. The resulting reshuffling of codes has negligible overhead (none if you only assign them at the end). I suppose this case is something to think about once someone runs into it while using this library.

Chipz is essentially a translation of the reference implementation, but doesn’t bother to support the block randomization mechanism that has been deprecated for a good two decades. We do because our approach makes it very easy to add. You can find a test file here; it consists of 100MB of highest quality zero octets. The testfile in lbzip2 doesn’t actually test block randomization since it is too short for any effects to show up.

Other than this, the format does not allow for many edge cases to begin with since most of its transformations simply cannot have invalid inputs. Just like the reference implementation, we do not care about wasteful encodings such as e.g. defining six Huffman trees despite only using three.

LZMA

LZMA has a spec, but it’s essentially just a commented C++ reference implementation. It leaves very little freedom to the implementer but nevertheless contains some minor pitfalls:

  • The range decoder’s finishing state is also a valid intermediate state, so it is necessary to define some EOF handling scheme. LZMA defines two and XZ defines another one; there are no other possible schemes. We default to the most permissive scheme (which can handle all valid inputs for the other two) but let the user optionally override this.
  • Two of the EOF handling schemes require that the decompressed size is known in advance. If the last Lempel-Ziv match expands beyond the declared decompressed size, one might either signal an error or truncate it and finish gracefully; we signal an error, just like the reference implementation and XZ Utils.
  • The range decoder’s finishing state cannot result in an EOF marker when treated like an intermediate state. As a result, the “optional EOF marker” scheme (the one we default to) can only be implemented in the way described in the spec: When the declared size is reached, only try to consume an EOF marker if the range decoder is not yet in a finishing state.

LZMA2

LZMA2 has no spec, but is a relatively straightforward wrapper on top of LZMA that adds uncompressed blocks, flush points, and LZMA state resets. We don’t expose flush points to the user.

We follow XZ Utils’ implementation; even though there have been subtle compatibility issues with the LZMA reference implementation in the past, the XZ Utils implementation is stricter and avoids certain edge cases entirely. It is also much cleaner; when testing some of the edge cases below, the LZMA SDK’s implementation would often simply segfault.

LZMA2 is subject to the following edge cases:

  • The data must start with a flush point; the first LZMA block after a flush point must specify new LZMA parameters.
  • Every block declares both compressed and decompressed sizes. These could be used to support partial decompression, so they should be verified. We do verify them even though we don’t support partial decompression.
  • The underlying data of an LZMA block might contain an EOF marker. XZ Utils signals an error when this happens, as do we. If an implementation allows this, care must be taken to either skip trailing data or signal an error if any exists.
  • An LZMA state reset does not reset the posState variable. Furthermore, uncompressed blocks will update it appropriately.

    Note that implementing this correctly only matters in the extremely unusual case where an LZMA block is followed by some uncompressed blocks and then another LZMA block, with no flush points in between, and where the second LZMA block doesn’t reset the LZMA state; if the second LZMA block resets the state, then the value of posState is completely irrelevant because it merely denotes the starting position in an array of probabilities that are all initialized to the same value anyway.

    LZMA2 data with this sequence of blocks is extremely unlikely to be produced in practice since the encoder can’t really know that the second LZMA block benefits from state reuse despite the uncompressed data in the middle. Nevertheless, a base64-encoded XZ test file that triggers this case can be found below. It decodes to the string LOL, but fails to decompress if posState isn’t updated by uncompressed blocks.

    /Td6WFoAAAD/EtlBAgAhAQoAAABTxyq54AAAAAUJACX//AAAAgAAT4AAAAAFACfRR0AAAAABKAM7StLkBnKeegEAAAAAAFla
        

XZ

XZ has a decent spec and is mostly a container format. XZ Utils supports partial decompression and skipping the checksum verification of unknown checksum types; we don’t. Note that we implement all standard filters, not just LZMA2. A list of edge cases is below; most of these can be found in XZ Utils’ elaborate test suite as well.

  • XZ duplicates stream header information and length information for all blocks at the end of the file to support partial decompression; while we do not support partial decompression, we nevertheless verify the lengths at the end.
  • XZ’s variable length integers must be encoded with the shortest possible encoding. This is easy to miss because the spec doesn’t explicitly mention it, but the example code and the test suite both check for it.
  • XZ files can consist of multiple concatenated members, and there may be padding after each member, including the final one. The padding must come in blocks of four and consist of zero bytes. Padding is not allowed at the start of the file.
  • Fields whose size isn’t necessarily a multiple of four are padded with the minimal number of zero bytes to ensure such a size. The sole exception to this is the block header padding, which consists of however many bytes are required to match the declared size.
  • The total compressed and decompressed data of a single member are each limited to at most 263 - 1 bytes (~8 exabytes). XZ Utils detects this before the block that pushes the size over the edge is even decoded; we only detect it once the read/output data exceeds this limit.
  • SHA-256 is limited to 261 - 1 bytes of input and therefore cannot be used with extremely large XZ files. We signal an error when a file using SHA-256 exceeds this limit.
  • LZMA2 can only be the last filter in a filter chain; the other standard filters (i.e. delta and the BCJ filters) can never be the last filter.
  • Embedded LZMA2 data might signal EOF before the declared end of the data. We signal an error in this case, as does XZ Utils.

Performance funny numbers

We’re usually the fastest with a significant improvement over chipz; under SBCL, 3bz is slightly faster for gzip data and noticeably faster for zlib data. Preserving trailing data hurts a lot for Deflate but not much for the other formats. Under ABCL, Mai’s Deflate is faster for heavily compressed files and preserving trailing data is rather expensive.

comparison implementationSBCLCCLECLABCL
/<c><c><c><c>
/<
chipz (deflate/zlib/gzip)1.79x-2.76x1.34x-1.87x1.61x-2.30x1.16x-4.03x
chipz (bzip2)1.01x-1.32x2.06x-2.70x2.17x-2.47x2.52x-8.52x
deflate (with checksum)2.04x-3.41x1.30x-3.91x1.87x-3.24x0.48x-1.80x
deflate (no checksum)1.33x-3.16x1.21x-1.73x1.75x-2.09x0.46x-1.72x
3bz0.63x-0.93x1.07x-1.55x2.12x-7.23x1.20x-4.65x
zlib0.31x-0.54x0.04x-0.12x1.00%-2.65%0.18%-0.28%
bzip20.66x-0.82x0.20x-0.22x4.15%-5.39%1.08%-2.29%
xz0.49x-0.62x0.10x-0.17x1.04%-1.69%0.21%-0.42%

Details

The performance measurements below are taken on an older x86-64 machine, taking the best of five attempts, using each library’s idiomatic method of file-to-file decompression:

  • For decompress, this means with-open-file + Alexandria’s copy-stream. We provide measurements for both single member decompression and multiple member decompression. (None of the benchmark files contain multiple members; this merely demonstrates the performance loss when preserving trailing data.)
  • For chipz, this means decompress with pathname arguments.
  • For Mai’s Deflate, this means with-open-file + the inflate-*-stream functions. We provide measurements with and without checksums to reflect a fair comparison and the default respectively.
  • For 3bz, we use this code. Note that this involves manual buffering. We don’t provide a measurement with the significantly slower stream input interface since it wouldn’t be very informative.

We always use the implementation’s default optimization settings. Note that under ECL, one library’s declamations can easily affect the optimization settings other libraries are compiled under; special care was taken to use ECL’s default ((safety 2) (space 0) (speed 3) (debug 3)) for all libraries.

I tried to keep the benchmark files varied since decompression benchmarks can highly input-dependent, but take the results with a grain of salt regardless. As a rough baseline, measurements with zlib-1.2.13, bzip2-1.0.8, and xz-5.4.2 are provided, respectively using the canonical zpipe example program, bunzip2, and xz.

We use the following real world tarballs:

fileinput sizeoutput sizerefimpl
/<
<r><r><r>
openjdk‑17.0.6_p10.tar.gz105,221,267B672,163,840B2.618s
openh264‑2.3.1.tar.gz60,290,897B73,205,760B0.280s
gimp‑2.10.32.tar.bz231,397,425B207,185,920B6.687s
gcc‑11.3.0.tar.xz81,141,364B824,309,760B7.263s
ccl‑1.12.1‑linuxx86.tar.gz20,872,508B80,752,640B0.427s
sbcl‑2.3.3-source.tar.bz27,408,264B43,386,880B1.408s
Python‑3.11.4.tar.xz19,954,828B99,696,640B1.618s

We also have a bunch of example data that we use to emulate common use cases. html_standard.html is HTML as you’d commonly find it on the web; pepper.dat and mario.dat are PNG pixel data from a well-compressible and and badly-compressible image respectively.

base filebase sizezlib (.z)bzip2 (.bz2)xz (.lzma)xz (.xz)
/<
html_standard.html12.80MB0.053s0.373s0.115s0.121s
pepper.dat51.27MB0.148s0.512s0.238s0.262s
mario.dat3.053MB0.018s0.110s0.065s0.070s

SBCL-2.3.8

SBCL is unsurprisingly the fastest of the tested implementations. We primarily optimize for it; notably we use a different implementation of the various CRCs.

filerefimpl3bzdecom (M)decom (S)chipzdefldefl (C)
/<
<r><r><r><r><r><r><r>
openjdk-17.0.6_p10.tar.gz2.618s5.530s5.981s7.801s13.84s10.07s12.21s
openh264-2.3.1.tar.gz0.280s0.831s0.892s1.333s1.594s2.815s3.042s
gimp-2.10.32.tar.bz26.687s-9.857s10.34s10.19s--
gcc-11.3.0.tar.xz7.263s-14.93s14.94s---
ccl-1.12.1-linuxx86.tar.gz0.427s0.923s1.028s1.386s2.182s1.950s2.206s
sbcl-2.3.3-source.tar.bz21.408s-1.979s2.098s2.507s--
Python-3.11.4.tar.xz1.618s-3.090s3.083s---
html_standard.html.z0.053s0.073s0.107s0.138s0.270s0.180s0.254s
pepper.z0.148s0.171s0.273s0.311s0.754s0.364s0.657s
mario.z0.018s0.032s0.039s0.055s0.091s0.081s0.098s
html_standard.html.bz20.373s-0.566s0.587s0.571s--
pepper.bz20.512s-0.714s0.740s0.802s--
mario.bz20.110s-0.134s0.146s0.177s--
html_standard.html.lzma0.115s-0.186s0.205s---
pepper.lzma0.238s-0.387s0.416s---
mario.lzma0.065s-0.108s0.119s---
html_standard.html.xz0.121s-0.222s0.222s---
pepper.xz0.262s-0.529s0.529s---
mario.xz0.070s-0.117s0.117s---

CCL-1.12.2

CCL has notoriously slow bit operations, resulting in an overall 6.5x runtime for Deflate decompression compared to SBCL and less pronounced differences between libraries. This is apparently being taken care of currently and I’d expect performance to change drastically once these fixes land. Preserving trailing data makes almost no difference.

Bzip2 and LZMA runtimes are about 3-4x those of SBCL, which is pretty typical. XZ runtimes suffer from CCL’s weakness around 64-bit arithmetic, as required by CRC-64.

file3bzdecom (M)decom (S)chipzdefldefl (C)
/<
<r><r><r><r><r><r>
openjdk-17.0.6_p10.tar.gz52.87s36.56s39.65s63.12s48.52s54.23s
openh264-2.3.1.tar.gz7.327s6.236s6.751s9.911s10.04s10.65s
gimp-2.10.32.tar.bz2-33.84s34.74s73.68s--
gcc-11.3.0.tar.xz-67.94s68.16s---
ccl-1.12.1-linuxx86.tar.gz8.518s6.963s7.209s10.18s8.416s9.075s
sbcl-2.3.3-source.tar.bz2-6.693s6.903s18.04s--
Python-3.11.4.tar.xz-13.10s13.10s---
html_standard.html.z0.847s0.646s0.689s0.940s0.895s1.582s
pepper.z1.907s1.227s1.281s2.292s2.128s4.802s
mario.z0.292s0.273s0.293s0.365s0.343s0.503s
html_standard.html.bz2-1.685s1.725s3.779s--
pepper.bz2-2.307s2.363s4.754s--
mario.bz2-0.494s0.515s1.317s--
html_standard.html.lzma-0.696s0.752s---
pepper.lzma-1.418s1.491s---
mario.lzma-0.431s0.458s---
html_standard.html.xz-1.013s1.017s---
pepper.xz-2.602s2.592s---
mario.xz-0.503s0.503s---

ECL-21.2.1

Deflate decompression sits at some 25x-50x the runtime of SBCL, which is rough, but still usable for smaller data. Bzip2 ends up at a nicer 20x-30x runtime. LZMA/XZ decompression suffers from heavy type check overhead and is slower than bzip2 as a result.

file3bzdecom (M)decom (S)chipzdefldefl (C)
/<
<r><r><r><r><r><r>
ccl-1.12.1-linuxx86.tar.gz128.7s41.39s53.99s76.07s76.02s77.23s
sbcl-2.3.3-source.tar.bz2-30.60s34.43s75.46s--
Python-3.11.4.tar.xz-155.4s156.2s---
html_standard.html.z16.27s3.901s4.938s6.539s6.845s8.401s
pepper.z40.32s5.576s6.706s12.83s11.63s18.06s
mario.z3.823s1.804s2.269s2.899s3.211s3.548s
html_standard.html.bz2-6.931s7.612s16.25s--
pepper.bz2-10.15s10.99s23.02s--
mario.bz2-2.652s3.038s5.764s--
html_standard.html.lzma-8.427s9.199s---
pepper.lzma-14.06s15.26s---
mario.lzma-5.468s5.886s---
html_standard.html.xz-10.28s10.33s---
pepper.xz-21.72s21.67s---
mario.xz-5.898s5.897s---

ABCL-1.9.2

ABCL’s replace and manual array ops are obscenely slow, which punishes copying and Lisp-side buffering. As a result, the straightforward unbuffered code of Mai’s Deflate outperforms several buffered implementations, especially for heavily compressed files.

In general, Deflate performance is close to the point of unusability; you’re likely better off using java.util.zip. Our bzip2/XZ performance isn’t that bad compared to ECL, likely due to the lower amount of bit diddling; note however how harshly the runtime increases for the (highly-compressed, hence copy-heavy) pepper.dat.

file3bzdecom (M)decom (S)chipzdefldefl (C)
/<
<r><r><r><r><r><r>
html_standard.html.z62.56s23.40s34.95s55.66s39.57s42.04s
pepper.z98.34s81.41s95.19s94.55s37.10s39.35s
mario.z29.39s6.324s12.18s25.46s10.86s11.01s
html_standard.html.bz2-17.22s26.33s103.5s--
pepper.bz2-47.34s57.43s119.2s--
mario.bz2-4.798s9.454s40.87s--
html_standard.html.lzma-32.37s41.58s---
pepper.lzma-111.3s130.4s---
mario.lzma-17.09s22.34s---
html_standard.html.xz-35.51s34.80s---
pepper.xz-121.5s120.9s---
mario.xz-16.86s16.83s---

Reference

In what follows, an octet vector is a one-dimensional array containing octets, i.e. integers x with 0 ≤ x ≤ 255. Octet vectors that are passed to the library need not be specialized or simple; octet vectors that are returned by the library are always both. When we speak of bounding index designators, we use the term with the same meaning as in Common Lisp, but additionally allow a starting index of nil to denote the beginning. Whenever a function takes optional bounding index designators, the default is to denote the entire sequence.

Function: decompress-all

decompress-all format input &key start end output-size => decompressed-buffer, header
  • format: A format specifier. See list-supported-formats.
  • input: Either a binary-input-stream or an octet vector.
  • start, end: Bounding index designators for input if the latter is a vector. Ignored if input is a stream.
  • output-size: A non-negative integer or nil. The default is nil.
  • decompressed-buffer: An octet vector.
  • header: A property list with keyword keys.

The most convenient interface to this library. Returns a fresh octet vector that contains the decompressed form of the format-compressed data in input. If the format allows multiple members, all members are decompressed and concatenated. Trailing data results in a decompression-error.

header is the header information of the first member. See decompression-stream-header.

output-size, if not nil, should be the predicted size of the decompressed data. If correct, this prevents unnecessary copying at the end.

In addition to the options listed above, decompress-all can take format-specific options.

This function is equivalent to calling decompress with all-members-p set; it exists to prevent the subtle errors that result from a forgotten all-members-p. If you want to decompress single members or preserve trailing data, use decompress instead. To handle large data in a streaming fashion, use make-full-decompression-stream.

Function: decompress

decompress format input &key start end output-size all-members-p allow-overreads-p => decompressed-buffer, header

  • format: A format specifier. See list-supported-formats.
  • input: Either a binary-input-stream or an octet vector.
  • start, end: Bounding index designators for input if the latter is a vector. Ignored if input is a stream.
  • output-size: A non-negative integer or nil. The default is nil.
  • all-members-p: A generalized boolean. The default is nil.
  • allow-overreads-p: A generalized boolean. The default is nil.
  • decompressed-buffer: An octet vector.
  • header: A property list with keyword keys.

Returns a fresh octet vector that contains the decompressed form of the format-compressed data in input. If the format allows multiple members, only the first member is decompressed. Trailing data is ignored and will not be read from stream inputs, so that it can be handled by other code.

header is the header information of the decompressed member. See decompression-stream-header.

output-size, if not nil, should be the predicted size of the decompressed data. If correct, this prevents unnecessary copying at the end.

If all-members-p is true, act like decompress-all instead. If the value of all-members-p is known to be constant and true, it is recommended to use decompress-all over decompress since forgetting to set all-members-p can result in subtle bugs down the line.

If allow-overreads-p is true and input is a stream, trailing data may be read, resulting in a minor speedup. This option rarely has to be specified since it is implied by all-members-p.

In addition to the options listed above, decompress can take format-specific options.

To handle large data in a streaming fashion, use make-decompression-stream. To handle all members at once, use decompress-all.

Condition: decompression-error

Direct superclasses: simple-error

General superclass for errors related to decompression. While the error message gives detailed information about the exact error, we do not distinguish between e.g. failed checksums, out of bounds references or malformed Huffman trees on the type level since this information is basically useless to a programmatic user and usually just an artifact of corruption.

Errors related to trailing data are also a decompression-error since it isn’t possible to distinguish a malformed member from trailing data.

Gross user errors such as passing in invalid bounding index designators or returning invalid dictionaries from a dictionary function do not signal a decompression-error.

Condition: eof

Direct superclasses: decompression-error

Signalled when the input stream/buffer is exhausted. Notably implies that we did not detect errors in the data up until this point, but this is not a hard guarantee that the data can be continued in a valid manner since it would be infeasible to verify this.

Even when the input is a stream, it is this condition which is signalled, not end-of-file.

Condition: unrecognized-zlib-dictionary

Direct superclasses: decompression-error

Signalled when zlib decompression involves a preset dictionary whose checksum isn’t recognized by the dictionary function.

Note: This condition is only signalled when preset dictionaries are enabled. See format-specific options.

Function: unrecognized-zlib-dictionary-checksum

unrecognized-zlib-dictionary-checksum uzd => checksum

Returns the unrecognized checksum that was encountered during zlib decompression, in the form adler-32 outputs. See format-specific options.

Function: list-supported-formats

list-supported-formats => list

  • list: A list of symbols.

Returns a list of symbols that are valid format specifiers, i.e. can be used as format arguments to make-decompression-stream and decompress to specify a compression format.

Class: decompression-stream

Direct superclasses: fundamental-binary-input-stream

Gray stream wrapper that can be used for complicated stream processing. The end of file is considered reached when the underlying compressed data ends and all associated decompressed data has been read. Users need not close these streams.

Function: make-decompression-stream

make-decompression-stream format input &key start end allow-overreads-p all-members-p => decompression-stream

  • format: A format specifier. See list-supported-formats.
  • input: Either a binary-input-stream or an octet vector.
  • start, end: Bounding index designators for input if the latter is a vector. Ignored if input is a stream.
  • all-members-p: A generalized boolean. The default is nil.
  • allow-overreads-p: A generalized boolean. The default is nil.
  • decompression-stream: A decompression-stream.

The most general interface to this library. All other interfaces are implemented in terms of this one.

Returns a decompression-stream whence the decompressed form of the format-compressed data in input can be read. If the format allows multiple members, only the first member is decompressed. Trailing data is ignored and will not be read from stream inputs, so that it can be handled by other code.

input should not be modified or read from until decompression-stream has reached end of file. If input is a stream, then the caller is responsible for closing it.

Calling this function will read header information from input and make it available via decompression-stream-header, but no decompression will be performed until data is read from decompression-stream.

If all-members-p is true, the stream will decompress and concatenate all compressed members from the stream (if the format defines multiple members) and signal a decompression-error if any trailing data is detected, similar to decompress-all. The provided header information is always that of the first member. If the value of all-members-p is known to be constant and true, it is recommended to use make-full-decompression-stream over make-decompression-stream since forgetting to set all-members-p can result in subtle bugs down the line.

If allow-overreads-p is true and input is a stream, trailing data may be read, resulting in a minor speedup. This option rarely has to be specified since it is implied by all-members-p.

In addition to the options listed above, make-decompression-stream can take format-specific options.

Function: make-full-decompression-stream

make-full-decompression-stream format input &key start end => decompression-stream

  • format: A format specifier. See list-supported-formats.
  • input: Either a binary-input-stream or an octet vector.
  • start, end: Bounding index designators for input if the latter is a vector. Ignored if input is a stream.
  • decompression-stream: A decompression-stream.

Wrapper around make-decompression-stream that always passes all-members-p.

Function: decompression-stream-format

decompression-stream-format stream => format

Returns the format argument used to create stream. format is always an element of the list returned by list-supported-formats.

Function: decompression-stream-header

decompression-stream-header stream => header

Returns the header metadata of the data associated to stream. Depending on the format, the possible entries are as below.

Deflate

Deflate data doesn’t have a header.

Zlib

  • window-size: The declared window size, in bytes.
  • level: The “decompression level”, a rough measure of whether speed or size were prioritized during compression. Goes from 0 to 3, with 0 meaning fastest and 3 meaning smallest.
  • dictionary: The Adler-32 checksum (as returned by adler-32) of the preset dictionary, if applicable, or nil.

Gzip

We do not bother converting IDs and enumerations into symbols because this doesn’t meaningfully reduce complexity - the uses for the affected fields are so niche that someone who wants to use them is probably more familiar with the raw data than anything we come up with.

  • textp: Boolean. If true, the data is supposed to be text and might require line ending conversion.
  • extra-fields: Alist with two-character-string keys and octet vector values. See gzip spec. Note that keys may be repeated; they have the same order they had in the file.
  • filename: String or nil. Filename of the original file, if provided.
  • comment: String or nil. Comment field, if provided.
  • modification-time: Integer. Modification time of the original file or time of compression, given in Unix time.
  • extra-flags: Integer. See gzip spec.
  • operating-system: Integer. See gzip spec.

Bzip2

  • block-size: Integer. The BWT block size in bytes.

Raw LZMA

Raw LZMA data doesn’t have a header.

LZMA

  • lc, lp, pb: Integer. The corresponding LZMA parameter of the same name.
  • window-size: The declared window size, in bytes.
  • decompressed-size: The declared output size, in bytes. A decompression-error is signalled when this doesn’t match the actual output size, so you can use it to prepare suitable buffers.

Raw LZMA2

Raw LZMA2 data doesn’t have a header.

LZMA2

  • window-size: The declared window size, in bytes.

XZ

  • checksum-type: An integer from 0 to 15 indicating the checksum used. Preassigned numbers are 0 for none, 1 for CRC-32, 4 for CRC-64, and 10 for SHA-256.

Format-specific options

The following keyword arguments can be passed to make-decompression-stream, make-full-decompression-stream, decompress and decompress-all for specific formats. They are ignored for all other formats.

formatmandatoryoptional
deflate-window-size, prefix, prefix-start, prefix-end
zlib-dictionary
gzip--
raw-lzmalc, lp, pb, window-sizedecompressed-size, eof-mode
lzma-eof-mode
raw-lzma2window-size-
lzma2--
xz--
  • window-size (deflate, raw-lzma, raw-lzma2): A non-negative integer. Determines how far back references can reach; references that go back further result in a decompression-error. For deflate, the default is 215, which results in no restrictions whatsoever. For raw-lzma and raw-lzma2, this parameter has no default; it is mandatory.
  • prefix, prefix-start, prefix-end (deflate): Either nil or an octet vector with optional bounding index designators. If not nil, makes the octets in prefix available to back references as if they were previous decompressor output. The default is nil.
  • dictionary (zlib): A function or nil. Implements preset dictionaries. If nil (the default), preset dictionaries are disabled and signal a generic decompression-error when encountered. The easiest way to use this option is via make-simple-zlib-dictionary.

    In general, a function must take a single (unsigned-byte 32), representing an Adler-32 checksum as returned by adler-32, and return three values prefix, prefix-start, prefix-end that are used like the options of the same name for Deflate. If prefix is nil, an unrecognized-zlib-dictionary condition is signalled.

  • lc, lp, pb (raw-lzma): Integers such that 0 ≤ lc ≤ 8 and 0 ≤ lp, pb ≤ 4. Correspond to the LZMA parameters of the same name. These parameters do not have defaults; they are mandatory.
  • eof-mode (raw-lzma, lzma): One of the keywords :always, :maybe, or :never. Determines whether data whose decompressed size is known is terminated by an end of file marker. The default is :maybe, which can handle all valid inputs from the other two modes; the only reason to change this option is to be stricter about inputs.
  • decompressed-size (raw-lzma): Either nil or a non-negative integer. If not nil, this is the declared decompressed size of the data, as in the header data of the same name for lzma; if the real size differs, a decompression-error is signalled. Note that this interacts with eof-mode. Not to be confused with the output-size option for decompress and decompress-all, which is merely an optimization hint.

Function: make-simple-zlib-dictionary

make-simple-zlib-dictionary buffers => dictionary

  • buffers: A sequence of octet vectors.
  • dictionary: A function.

Returns a function suitable to be passed as a dictionary argument for zlib decompression which recognizes every octet vector in the sequence buffers and no others. The vectors should not be modified afterwards. An error is signalled if multiple vectors with distinct contents have the same checksum. See format-specific options for details and how to make more complicated dictionaries.

Function: adler-32

adler-32 data &key start end => checksum

  • data: An octet vector.
  • start, end: Bounding index designators for data.
  • checksum: An (unsigned-byte 32).

Returns the Adler-32 checksum of the given data, in canonical s2~ × 2^{16} + ~s1 form. This function is provided to help with setting up more complicated zlib dictionaries; see format-specific options.