Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node's zlib.ungzip does not support concatenated files #4306

Closed
yahao87 opened this issue Dec 16, 2015 · 15 comments · Fixed by turkdevops/node#22 or turkdevops/node#32 · May be fixed by adamlaska/node#40, turkdevops/node#23 or adamlaska/node#45
Closed
Labels
zlib Issues and PRs related to the zlib subsystem.

Comments

@yahao87
Copy link

yahao87 commented Dec 16, 2015

why did incorrect operate zlib.ungzip?

It is operates in C#.net
but don't operate in nodejs.(windows7, node ver. 5.2.0)
Most of the well, but a problem appears in some of the files.

in C# (with ICSharpCode)

using ICSharpCode.SharpZipLib.Core;
using ICSharpCode.SharpZipLib.GZip;
using (GZipInputStream gzipStream = new GZipInputStream(fs))
{
    // operate very well
}

in NodeJs

var fs = require('fs');
var zlib = require("zlib");
var file = fs.readFileSync(sFileFullPath);
var extractBin = zlib.gunzipSync(file);
console.log("Compressed buffer  size" + file.length);      // 134079
console.log("Uncompressed buffer size" + extractBin.length);   // 14

Why it did reduce the size?
Here's the file in question.

http://static.kafra.kr/file/patch.rgz

I'm sorry bad English.

@mscdex mscdex added the zlib Issues and PRs related to the zlib subsystem. label Dec 16, 2015
@addaleax
Copy link
Member

Your example gzip file actually consists of multiple streams, i.e. is the concatenation of more than one “raw” gzip file. See e.g. the gzip(1) manpage for some explanations on how the gzip command line program handles this.

Node.js, however, ignores anything after the first stream, which in your case is only 14 bytes long. A bit more simplified:

var abc = zlib.gzipSync('ABC');
var def = zlib.gzipSync('DEF');
var decompressed = zlib.gunzipSync(Buffer.concat([abc, def]));
console.log(decompressed.toString()); // => ABC (!)

There was a fix for this a while ago, which resulted in a regression and then was reverted again; The story so far (as far as I can tell):
nodejs/node-v0.x-archive#6032
nodejs/node-v0.x-archive#6442
nodejs/node-v0.x-archive#8962
nodejs/node-v0.x-archive#8985

@yahao87
Copy link
Author

yahao87 commented Dec 18, 2015

I understand.
Why not support recursion like other language's library?
hmm...
Thank you for answer.

@addaleax
Copy link
Member

I’d totally be in favour of supporting concatenated input files. Maybe someone here has the time to look into this; If people here agree that that would be the intended behaviour, I’d be really interested in looking into this, but I’m not sure I can make any promises on when I’d find time to do that.

@Fishrock123 Fishrock123 added the good first issue Issues that are suitable for first-time contributors. label Jan 7, 2016
@Fishrock123 Fishrock123 changed the title why did incorrect operate zlib.ungzip? Node's zlib.ungzip does not support concatenated files Jan 7, 2016
@Fishrock123 Fishrock123 removed the good first issue Issues that are suitable for first-time contributors. label Jan 7, 2016
@Fishrock123
Copy link
Contributor

cc @chrisdickinson, @indutny, @trevnorris, @bnoordhuis -- You all seemed to have dealt with the aforementioned issues, can any of you give more context?

@addaleax Thanks for triaging this! :)

@bnoordhuis
Copy link
Member

IIRC the problem was that the streams paradigm doesn't work so well for concatenated archives? For the sync versions, we could create an overload or alternative that returns an array or an iterator. For the streaming versions, I'm not so sure; maybe an event for the unconsumed data that you can then feed back into a new zlib stream?

@addaleax
Copy link
Member

addaleax commented Jan 7, 2016

the streams paradigm doesn't work so well for concatenated archives

Not sure whether this helps, but here might be a little confusion between terms here. This is not about providing multiple files to the end user (i.e. it is not about archives in the sense of a single file which in some way contains multiple files), but rather it’s about encoding a single file as the result of multiple, independent compression operations.
One (real-world) example use case would be parallel compression, i.e. splitting a file into multiple parts and compressing them independently, then concatenating the results. When using a command line tool or libraries in other environments (e.g. Python), the decoder then transparently outputs a single stream.
Other compression formats like bzip2 or xz also accept this format (the xz(1) tool has a --single-stream command line option to opt-out of this).

As far as I can tell, the problem with the original patch was that trailing zero bytes (as padding) caused decompression to fail. I’d still say that the default behaviour should be to allow decompression of a single file as concatenated streams, following the conventions set by the command line tools (and besides, it’s cool feature of the compression format).

@chrisdickinson
Copy link
Contributor

Ah, I think I remember this. The terms are confusing, but @bnoordhuis has it right, I think.

There was a patch to make the zlib module compatible with concatenated zlib datastreams, but the problem was that it broke existing code. IIRC this was because:

  • Our current zlib streams stop consuming as soon as they reach the end of the first zlib datastream.
  • Modules in the ecosystem have made assumptions about this (notably, the tarball handling in npm.) Edit — it looks like the handling was that there were garbage bytes trailing a .tar.gz file, and those bytes were causing node to hang while the impl tried to restart the zlib datastream over and over.

The workaround, if you're interested in taking it on, is to expose either:

  1. An option to the existing JS zlib streams that turns on concatenated datastream support, OR
  2. A second set of JS zlib transform streams that the user can use with concatenated datastreams.

@addaleax
Copy link
Member

addaleax commented Jan 8, 2016

@chrisdickinson Yes, you are correct; However, the underlying assumption in the npm’s tarball handling was not that only the first datastream from a gzip file would be extracted. npm expects the ability to extract .tar.gz files which are padded with trailing \0 bytes, and that is perfectly normal for gzip files. The original patch (1183ba4) did not account for that, but that’s an issue with the implementation of this feature, not with the feature itself.

So:

  • As already referenced in Add support for concatenated gzip files. node-v0.x-archive#6442, the RFC1952 explicitly mentions multiple streams as part of the format (and therefore I’d agree that this would be a bug fix, not really a feature). If Node.js really wants to diverge from the standard here, this should be documented somewhere.
  • Yes, 1183ba4 had incorrect error handling for trailing garbage. That can be fixed, though.
  • I don’t think this should affect decompression of the zlib format, only gzip. In gzip, it’s part of the specs, but for the raw zlib format, no such thing is mentioned anywhere nor do the libraries of e.g. other languages allow concatenated zlib streams.

I agree that it’s a good idea to make this optional, but it should be active by default. And yes, if it’s okay with you, I’d be interested in working on this, but I can’t make any promises on when I’d get this done (probably within the next 1 oder 2 weeks).

@chrisdickinson
Copy link
Contributor

I'm definitely 👍 on getting support into core for this, and would love to see a PR for it — I'll note re:

I agree that it’s a good idea to make this optional, but it should be active by default.

The problem while defaulting this way is that the existing behavior has been out in the wild for a long time now, and folks have built programs that expect that behavior — it could potentially cause a lot of breakage without a commensurate upside. We're primarily beholden to the code that exists in the wild by way of our users, and we'd have to have a very good reason to release a version (even a major version) that breaks their code. For something like this we'd have to message changing the default way in advance (i.e., start warning that we're going to do it in the next major version, then actually flip the default in the major version after that.)

@addaleax
Copy link
Member

addaleax commented Jan 9, 2016

Okay, I’ll start working on it! :)

And I get that this would be semver-major, I just can’t really imagine that there’s any application which would rely on the current first-stream-only behaviour. But I do see your point, and if you say this is the way it has to be done, then it’s like that.

@kthelgason
Copy link
Contributor

@addaleax have you started working on this issue? If you have difficulty finding the time I can also work on this.

@addaleax
Copy link
Member

addaleax commented Feb 1, 2016

@kthelgason Feel free to do that – I was busier than expected in the last weeks, sorry

@kthelgason
Copy link
Contributor

@addaleax Great, I'd love too. You've made it really approachable for me with all the groundwork laid out in this thread 😄

@kthelgason
Copy link
Contributor

If I understand correctly an issue with the previous implementation of this feature is that NULL-padded .gz files caused issues. According to rfc1952 the gzip file format is as follows (ephasis mine):

2.2. File format

      A gzip file consists of a series of "members" (compressed data
      sets).  The format of each member is specified in the following
      section.  The members simply appear one after another in the file,
      with no additional information before, between, or **after** them.

Doesn't this explicitly disallow trailing garbage? @addaleax you state above that padding with \0 is perfectly normal, is this a case of implementations diverging from the standard, or am I misunderstanding something?

@addaleax
Copy link
Member

addaleax commented Feb 5, 2016

@kthelgason Yes, I’d agree that trailing \0 are a divergence from quoted section of the standard text.

Then again, the gzip(1) util is also explicitly mentioned in the RFC as a reference implementation, and its behaviour, namely to allow trailing \0 and print a warning about other trailing garbage (but effectively ignoring that too), makes sense to me. There is actually a bit of explanation on it in the manpage.

I have no idea how many .gz files in the wild are actually padded with zero bytes, but if it’s possible, I wouldn’t see any harm in supporting trailing zeroes.

kthelgason added a commit to kthelgason/node that referenced this issue Mar 14, 2016
According to the spec gzipped archives can contain more than one compressed member. Previously Node's gzip implementation would only unzip the first member and throw away the rest of the compressed data. Issue nodejs#4306 is an example of this occurring in daily use.
evanlucas pushed a commit that referenced this issue Mar 15, 2016
According to the spec gzipped archives can contain more than one
compressed member. Previously Node's gzip implementation would only
unzip the first member and throw away the rest of the compressed data.
Issue #4306 is an example of this occurring in daily use.

Fixes: #4306
PR-URL: #5120
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
rvagg pushed a commit that referenced this issue Mar 16, 2016
According to the spec gzipped archives can contain more than one
compressed member. Previously Node's gzip implementation would only
unzip the first member and throw away the rest of the compressed data.
Issue #4306 is an example of this occurring in daily use.

Fixes: #4306
PR-URL: #5120
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment