Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem Extracting 7z Archives #73

Open
hodm opened this issue Jul 26, 2015 · 8 comments
Open

Problem Extracting 7z Archives #73

hodm opened this issue Jul 26, 2015 · 8 comments

Comments

@hodm
Copy link
Contributor

hodm commented Jul 26, 2015

There is a problem when extracting small 7z files packed with ultra compression.
Exception: Data Error StackTrace: at SharpCompress.Compressor.LZMA.Decoder.Code(Int32 dictionarySize, OutWindow outWindow, Decoder rangeDecoder) in b:\Users\Hod\Documents\GitHub\sharpcompress\SharpCompress\Compressor\LZMA\LzmaDecoder.cs:line 357
get 7z file at: http://www.filedropper.com/mg83
inside there is a program.cs file that I used for test,
it uses the added option that I added in pull request for, but you can remove them.

@adamhathcock
Copy link
Owner

I have to confess that 7z is a drop in implementation from another project. If LZMA has changed I will unlikely be able to fix it unless another project has. If it's a matter of reading 7z headers correctly then it's possible but I don't have the time these days to devote to this.

@weltkante
Copy link
Contributor

The link doesn't seem to work for me, if you can refresh it I can take a look.

(I'm the author of the managed-lzma library used in some branches of SharpCompress, but the stack trace seems to come from the lzma-sdk implementation, right? I can take a look anyways since I understand most of the workings of the 7z format and may be able to spot where it goes wrong.)

@Anarchid
Copy link

This issue seems to be responsible here:
image

The archive in this case is http://springrts.com/dl/buildbot/default/develop/102.0.1-19-gfbc1b5b/win32/spring_%7bdevelop%7d102.0.1-19-gfbc1b5b_win32-minimal-portable.7z

Alternative implementations unpack it without a hitch.

@weltkante
Copy link
Contributor

weltkante commented Jul 18, 2016

Surprisingly my own library also has problems reading this particular archive. Didn't expect that since (unlike the sharpcompress version) it's a straight port of the official C/C++ code ... interesting, I'll take a look and try to figure out where the implementations diverge, since the official 7z application seems to be able to unpack it.

@weltkante
Copy link
Contributor

weltkante commented Jul 18, 2016

There's actually two issues with this file, one being that the file attributes are linux and not windows, thats (probably) the big red 81a48020 block going on in your screenshot. I'll need to have a look how the official C/C++ codebase detects and handles linux file attributes.

Then there's the exception when actually decoding the data, which is unrelated to the linux file attributes. I did have a similar exception in my own codebase which resulted from incorrect updating of the incremental lzma decoder when decoding large files. I'll have a look at the sharpcompress codebase if they have the same issue or if it is something else.

@weltkante
Copy link
Contributor

Updating my findings about linux file attributes: they are not part of the 7z spec and won't be written by the official 7z codebase.

Some third party implementations abused the fact that the official implementation is lazy in its error checking and started to hide linux file attributes in unused bits of the 7z metadata. Newer versions of the official 7z UI acknowledge this and try to parse the unofficial linux attributes, but it's still not part of the actual 7z archive codebase, it's only handled by the UI client.

Since it now seems to have become a kind of de-facto standard I'll update my implementation to support this unofficial extension and also prepare a pull request for SharpCompress to not treat this as a broken archive.

(Still need to look at the 2nd exception when decoding a stream.)

@weltkante
Copy link
Contributor

I've fixed the issues with this particular archive in my library, now taking a look at SharpCompress.

First thing I noted those big red blobs at the top of your screenshot are not actual errors, they are debug output from my old prototype 7z parser which someone must have copied over to SharpCompress and its still outputting a dump of the 7z structure every time it reads an archive in a debug build :-)

Anyways, even though thats not an error, since the linux file attributes make no sense in the context of .NET FileAttribute enum I'm still suggesting to strip this unofficial extension so consumers of the 7z archive are not trying to do something stupid. This should be safe because even with the posix extension present the creator of the archive should have set the low 15 bits to a valid Windows/.NET FileAttribute enum combination (which is the case in your example archive). I've done a pull request to make this change.

I'm still looking into the exception triggered during stream decoding, which is not the same as I had in my library. It's coming from the middle of the decoder so there is probably a bug in it. Unfortunately, as far as I know, the lzma decoder used in SharpCompress is based on very old code derived from a one-time Java port and nobody really supports it. I'll give it a try and run it side-by-side with my working lzma decoder, but I can make no promises that I can find the bug. In that case the only realistic short term fix I can suggest is to replace the lzma decoder with a working one.

weltkante added a commit to weltkante/sharpcompress that referenced this issue Jul 20, 2016
7z archives may require alternating reads from multiple substreams so it is important to seek before reading from the underlying stream. To keep performance at an acceptable level it is necessary to perform buffering because seeking on every single one-byte-read will destroy performance.
@weltkante
Copy link
Contributor

weltkante commented Jul 20, 2016

You are lucky, the decoder is not the problem.

7z archives may require multiple "views" into the underlying stream, as you can see in the CreateDecoderStream method which loops over a list of address ranges and creates multiple substreams from them.

In the current code you are creating ReadOnlySubStream instances which seek to the desired position and then are done with it. That's not going to work because all of your streams are operating on the same underlying stream and the last seek wins. All other ReadOnlySubStream instances are going to return garbage when reading data from them.

The obvious fix is to maintain your own position within the underlying stream and seek before every read, however since the RangeCoder classes used by BCJ2 and LZMA decoders both like to do 1-byte-reads this totally destroys performance, having to do a seek for every single byte requested from the underlying stream ;-)

Therefore I added a class BufferedSubStream, being a modified copy of the ReadOnlySubStream, having a builtin 32kb buffer simply to avoid having to seek on every request.

With the pull request applied I'm able to read the linked archive.

adamhathcock pushed a commit that referenced this issue Jul 21, 2016
7z archives may require alternating reads from multiple substreams so it is important to seek before reading from the underlying stream. To keep performance at an acceptable level it is necessary to perform buffering because seeking on every single one-byte-read will destroy performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants