-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed new virtual file system vsilz4 #2201
Comments
Can you efficiently seek at a given offset of the uncompressed stream, without having uncompressed the stream from the beginning to that offset ? |
Yes that is from what I understand the purpose of the frames. https://tukaani.org/xz/format.html describes it as "The data can be split into independently compressed blocks. Every .xz file contains an index of the blocks, which makes limited random-access reading possible when the block size is small enough.". So, there is a trade-off that smaller blocks will give more efficient random access at the cost of lesser compression ratio. |
If I understand correctly zstd might get official framing support one day (see facebook/zstd#395). |
Looks like I have misunderstood lz4, according to discussion at lz4/lz4#187 it does not support random access (out of the box). |
Looks like xz shows more promise, I'll switch my efforts to that. |
The downside of xz is that it isn't particularly fast. Too bad there are not yet standardized framing support for zstd |
FYI some criticism of the xz format: https://www.nongnu.org/lzip/xz_inadequate.html |
xz certainly supports random access without seeking from the start, see: http://libguestfs.org/nbdkit-xz-filter.1.html |
Revisiting this I'm again considering looking into lz4. I think I misunderstood twice... and that it does support random access when compressing using independent blocks which does seem to be the default. |
Still interested in this.. and I note liblz4 is now available in GDAL for other purposes. |
Hmm and again I'm back to probably misunderstood lz4 frame format.. quote from https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md#introduction - "The data format defined by this specification does not attempt to allow random access to compressed data.". |
yeah, I doubt any compression method has a standardized way of encoding a table that maps uncompressed offsets to the offset of the start of a new frame (or equivalent mechanism). |
That's exactly what xz has, and the existence proof of this is: https://gitlab.com/nbdkit/nbdkit/-/tree/master/filters/xz We use this routinely to random access the content of xz-compressed disk images without scanning or (fully) uncompressing them. |
I would pursue vsixz but the decompression speed is no fun. Seems everyone is moving away from xz these days. But it's definitely sad that random access has moved to application level when xz proved it could be part of the general compression format. |
Rationale that it would allow for random access to compressed data.
Looks like zx is the only mainstream format that has random access built into the standard. I previously thought lz4 had the capability but it seems it's a non-standard proof of concept.Looks like both zx and lz4 can do it but I think lz4 is more attractive due to it's impressive speed.I'm interested in making an attempt to implement this.
The text was updated successfully, but these errors were encountered: