Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new option -sGZIP_EMBEDDINGS that gzip the wasm output #21426

Closed
wants to merge 0 commits into from
Closed

Conversation

msqr1
Copy link

@msqr1 msqr1 commented Feb 26, 2024

Doesn't work with WASM_ASYNC_COMPILATION off, this new setting gzip the wasm binary before embedding, which can reduce code size. It will decompress via JS DecompressionStream. Can increase startup time, but the code size benefit can be massive for a big project. For mine: -O0, 5.3MB --> 1.7MB, -O3, 3.7MB --> 1.4MB. Implements #21383. Edit any

Fixes: #21383

@msqr1 msqr1 changed the title Add new option -sSINGLE_FILE=2 that gzip the wasm output. Add new option -sSINGLE_FILE=2 that gzip the wasm output. (#21383) Feb 26, 2024
Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Its great to see this working out.

I wonder if we can do this without adding yet another setting.

Instead can we use just make this the default for SINGLE_FILE whenever async compilation is available?

Also, I'd love to see the effect of this in one of our code size tests.

It looks like the current code size tests we have for SINGLE_FILE are:

  • other.test_minimal_runtime_code_size_hello_webgl2_wasm2js
  • other.test_minimal_runtime_code_size_hello_webgl2_wasm

See:

emscripten/test/test_other.py

Lines 10682 to 10684 in b5b7fed

random_printf_sources = [test_file('hello_random_printf.c'),
'-sMALLOC=none',
'-sSINGLE_FILE']

If you update this PR to make gzip the default then we can expect to see the output size for these test drop.

Run the tests with --rebase to update the expectations for the output sizes.

system/lib/libcxxabi/src/private_typeinfo.cpp Outdated Show resolved Hide resolved
tools/settings.py Outdated Show resolved Hide resolved
tools/settings.py Outdated Show resolved Hide resolved
src/base64Utils.js Outdated Show resolved Hide resolved
src/base64Utils.js Outdated Show resolved Hide resolved
src/base64Utils.js Outdated Show resolved Hide resolved
@curiousdannii
Copy link
Contributor

The Compression Streams API is supported with these, which are quite a lot later than Emscripten's minimum supported browser versions:

Chrome: 80
Firefox: 113
Safari: 16.4

@msqr1
Copy link
Author

msqr1 commented Feb 26, 2024

Increased startup time can be unexpected for some people, especially in large projects. That was what made me add another option.

@sbc100
Copy link
Collaborator

sbc100 commented Feb 26, 2024

Increased startup time can be unexpected for some people, especially in large projects. That was what made me add another option.

Perhaps if we only enable it for -Oz (or at least enable it by default for -Oz) because we know that -Oz users are already opting into small code over fast code?

@sbc100
Copy link
Collaborator

sbc100 commented Feb 26, 2024

Increased startup time can be unexpected for some people, especially in large projects. That was what made me add another option.

Fair enough. if we do make this opt in then I suggest we go with a new setting name rather than that = 2 variant of an existing settings. Those = 2 variants have be an maintenance burden and tend to make the code harder to read.

So I suggest:

  1. A new setting
  2. Setting should default to true in -Oz mode when browser support allows.

@msqr1
Copy link
Author

msqr1 commented Feb 27, 2024

@sbc100 Could you please elaborate why =2 stuff are hard to maintain? I don't have much experience with Emscripten, sorry? I find adding a new option harder in this case, ngl. This is because we have to do all the check with -sSINGLE_FILE and -sWASM_ASYNC_COMPILATION. This only works in tandem with -sSINGLE_FILE, so in this case, we only need to look for #if SINGLE_FILE and ignore the rest

@msqr1
Copy link
Author

msqr1 commented Feb 27, 2024

Increased startup time can be unexpected for some people, especially in large projects. That was what made me add another option.

Perhaps if we only enable it for -Oz (or at least enable it by default for -Oz) because we know that -Oz users are already opting into small code over fast code?

Yes, I will do that too

@msqr1
Copy link
Author

msqr1 commented Feb 27, 2024

The Compression Streams API is supported with these, which are quite a lot later than Emscripten's minimum supported browser versions:

Chrome: 80
Firefox: 113
Safari: 16.4

We can also have a polyfill (can be complex IMHO, but could be worth it for large code)

@sbc100
Copy link
Collaborator

sbc100 commented Feb 27, 2024

The Compression Streams API is supported with these, which are quite a lot later than Emscripten's minimum supported browser versions:
Chrome: 80
Firefox: 113
Safari: 16.4

We can also have a polyfill (can be complex IMHO, but could be worth it for large code)

Yeah, I think in this case the cost of the polyfill would outweigh the benefit (sadly)

@curiousdannii
Copy link
Contributor

True, using compression-streams-polyfill is only 17kb of code (7kb gzipped).

Alternatively, fflate could be used directly in non-streaming mode, which is apparently faster than the native API. It would be 4kb gzipped and then we wouldn't need to worry about browser support.

@sbc100
Copy link
Collaborator

sbc100 commented Feb 27, 2024

True, using compression-streams-polyfill is only 17kb of code (7kb gzipped).

Alternatively, fflate could be used directly in non-streaming mode, which is apparently faster than the native API. It would be 4kb gzipped and then we wouldn't need to worry about browser support.

Oh wow! I stand corrected. If we can potentially save several Mb then including a 4kb polyfill might be really nice idea.

We could define some size threshold for it too.. so it would be included for tiny modules.

@curiousdannii
Copy link
Contributor

curiousdannii commented Feb 27, 2024

Yeah, fflate is really amazing. I use it for making .zips in the browser in another project.

I haven't used its async functions though. The WASM files might be big enough for that to be worth using.

@msqr1
Copy link
Author

msqr1 commented Feb 27, 2024

Yeah, fflate is really amazing. I use it for making .zips in the browser in another project.

I haven't used its async functions though. The WASM files might be big enough for that to be worth using.

So that would make it work even with WASM_ASYNC_COMPILATION on! Also, decompression only fflate takes only 3kb for synchronous, and 4kb for async, not 8kb

@sbc100
Copy link
Collaborator

sbc100 commented Feb 27, 2024

I'm excited that this new setting might help me to land: #21217 since it could lessen to cost of the base64 encoding of the data file too.

return bytes;
#else
return new Response(new Blob([bytes]).stream().pipeThrough(new DecompressionStream("gzip"))).arrayBuffer();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we continue to allow the above nodejs optimization that avoid atob?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah sure, only in nodejs though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to do that i think would be to have intArrayFromBase64 continue to have its existing well-defined meaning and we could an outer gunzipByteArray` function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance this code can be made async? If so you can use fetch to decode base64:

export async function parse_base64(data: string, data_type = 'octet-binary'): Promise<Uint8Array> {
    // Parse base64 using a trick from https://stackoverflow.com/a/54123275/2854284
    const response = await fetch(`data:application/${data_type};base64,${data}`)
    if (!response.ok) {
        throw new Error(`Could not parse base64: ${response.status}`)
    }
    return new Uint8Array(await response.arrayBuffer())
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not easily I fear, since we need to continue to support synchronous compilation of the wasm file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be cleaner to layer this decompression on top of the base64 utilities. I.e. have the caller of tryParseAsDataURI take care of decompressing the restul?

@msqr1
Copy link
Author

msqr1 commented Feb 27, 2024

Hmm, one particular test (test_minimal_runtime_code_size_random_printf_wasm2js - test_other.other) is always failing

@sbc100 sbc100 changed the title Add new option -sSINGLE_FILE=2 that gzip the wasm output. (#21383) Add new option -sSINGLE_FILE=2 that gzip the wasm output Feb 27, 2024
@emscripten-core emscripten-core deleted a comment from msqr1 Feb 27, 2024
@msqr1
Copy link
Author

msqr1 commented Feb 27, 2024

Our PR are dependent lol

@sbc100
Copy link
Collaborator

sbc100 commented Feb 27, 2024

Hmm, one particular test (test_minimal_runtime_code_size_random_printf_wasm2js - test_other.other) is always failing

If you change effects that code size then you can run those tests with --rebase to update the expectation.

I usually run test/runner other.*code_size* other.*metadce* --rebase

@sbc100
Copy link
Collaborator

sbc100 commented Feb 27, 2024

@sbc100 Could you please elaborate why =2 stuff are hard to maintain? I don't have much experience with Emscripten, sorry? I find adding a new option harder in this case, ngl. This is because we have to do all the check with -sSINGLE_FILE and -sWASM_ASYNC_COMPILATION. This only works in tandem with -sSINGLE_FILE, so in this case, we only need to look for #if SINGLE_FILE and ignore the rest

I've always regretted adding the =2 flavors such as -sWASM=2 and -sMAIN_MODULE=2 and -sSIDE_MODULE=2. I think there are many different reasons, but maybe I can articulate one or two of them here. For one its not obvious to use user what they mean. For example, -sSINGLE_FILE=2 doesn't say as much to the reader as -sGZIP_EMBEDDED_DATA. When working on the codebase it also makes it hard to find all the places where the =2 version is used. e.g. Rather then just grep SINGLE_FILE now I have to all looks for things like SINGLE_FILE = 2 or SINGLE_FILE != 1 etc, etc.

I think if you add a new GZIP_DATA_URLS or GZIP_EMBEDDED_DATA option then you will have that it will most like nest within SINGLE_FILE block.

e.g.

#if SINGLE_FILE
#if GZIP_EMBDDEDED_DATA
callA();
#else
callB();
#endif
#endif

If find this easier to follow than using a == 2 nested inside the #if for the same option:

#if SINGLE_FILE
#if SINGLE_FILE == 2
callA();
#else
callB();
#endif
#endif

@sbc100
Copy link
Collaborator

sbc100 commented Feb 27, 2024

Our PR are dependent lol

OK if I land #21217 now? I think it should probably not conflict too much with this change and might even simplify it.

@sbc100
Copy link
Collaborator

sbc100 commented Feb 27, 2024

Our PR are dependent lol

OK if I land #21217 now? I think it should probably not conflict too much with this change and might even simplify it.

Actually I think maybe they won't conflict at all.

@curiousdannii
Copy link
Contributor

If this was changed to use fflate, can it be used from npm, or would it have to be vendored?

@sbc100
Copy link
Collaborator

sbc100 commented Feb 28, 2024

If this was changed to use fflate, can it be used from npm, or would it have to be vendored?

We could get it from npm (i.e. put it in the package.json for emscripten).. assuming the npm version can easily be copied inline into the resulting module. But it might make more sense to add it as a git submodule.

@msqr1
Copy link
Author

msqr1 commented Feb 28, 2024

What a change! This took SUPPORT_BASE64_EMBEDDING off my head. But now what is the use of SUPPORT_BASE64_EMBEDDING anyway, memory initialization is gone? Just SINGLE_FILE seems enough?

@msqr1
Copy link
Author

msqr1 commented Mar 4, 2024

When fflate process streaming, it output chunks. I can't find anyway to instantiate wasm streaming from the chunks without reassembling them, which kills the sole purpose of streaming. I love streaming a lot, but it seems impossible here. I will add its MIT Licence into the folder, and remove implicit opt in.

@msqr1
Copy link
Author

msqr1 commented Mar 4, 2024

Anything else to improve?

@kripken
Copy link
Member

kripken commented Mar 4, 2024

@msqr1

Oh, gzipping after base64 (what servers would do) is bad because base64 breaks the natural byte alignment order that most compression algorithms rely on.

Makes sense. I did a quick test locally and base64+gzip makes a large wasm file (poppler) almost 50% larger than pure gzip.

I still wonder about the implications of the webserver doing gzip anyhow, which I think we can assume most sites use, and especially ones that care about efficiency. If the webserver does gzip then we can use a different encoding than base64 that does not break byte alignment. That is, if we do a more naive encoding of the binary into text, which is not as efficient as base64, but when composed with gzip it would be as efficient as gzip. It seems like that would give us all the efficiency benefits here, assuming the webserver uses gzip, and it would be simpler (no fflate etc.)?

@juj
Copy link
Collaborator

juj commented Mar 5, 2024

I did some tests on a .wasm file of Unity's Boat Attack demo project, which is 35.5 MB worth of Wasm: BoatAttack.Unity2023.2.wasm.gz

\1. On this file, reproduced that gzip-compressing base64-encoded data performs poorly:

Uncompressed: 35,554 KB
Base64-encoded: 47,406 KB
Base64-encoded then gzipped: 13,701 KB

Gzipped: 9,736 KB
Gzipped then Base64-encoded: 13,387 KB
Gzipped then Base64-encoded then Gzipped: 10,031 KB (-26.8% smaller than Base64-encoded then gzipped)

\2. Brotli exhibits similar behavior, but not nearly as drastically:

Uncompressed: 35,554 KB
Base64-encoded: 47,406 KB
Base64-encoded then brotlied: 10,999 KB

Brotlied: 8,163 KB
Gzipped then Base64-encoded then Brotlied: 9,865 KB (only -10.3% smaller than Base64-encoded then brotlied)

\3. Out of curiosity, I did an ad hoc python test script that embeds the binary data in strings in a way that doesn't straddle bits across multiple characters.

data = open('uncompressed.wasm', 'rb').read()

#b = []
#for i in range(256):
#  b += [i]
#data = bytearray(b)

f = open('output.js', 'wb')
f.write("var js = '".encode('utf-8'))
for d in data:
  if d == 0: f.write(bytearray([196, 128])) # Replace null byte with UTF-8 \x100 (dec 256) so it can be reconstructed with a simple "x & 0xFF" operation.
  elif d == ord("'"):  f.write("\\'".encode('utf-8')) # Escape single quote ' character with a backspace since we are writing a string inside single quotes. (' -> 2 bytes)
  elif d == ord('\r'): f.write("\\r".encode('utf-8')) # Escape carriage return 0x0D as \r -> 2 bytes
  elif d == ord('\n'): f.write("\\n".encode('utf-8')) # Escape newline 0x0A as \n -> 2 bytes
  elif d == ord('\\'): f.write('\\\\'.encode('utf-8')) # Escape backslash \ as \\ -> 2 bytes
  else:                f.write(f'{chr(d)}'.encode('utf-8')) # Otherwise write the original value encoded in UTF-8 (1 or 2 bytes).

f.write("';".encode('utf-8'))

Decompressing would look like follows:

// Output string from the above compressor
var js = 'Ā�������� \n�\r������������������ !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~��������������������������������� ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ';

// Decompression: read individual byte indices
for(var i = 0; i < js.length; ++i) { console.log(js.charCodeAt(i)&0xFF); }

// Decompression: convert to typed array
var u8array = Uint8Array.from(js, x => x.charCodeAt(0));
console.log(u8array);

(will need more correctness testing, but briefly looking at it, Node, Firefox and Chrome on my system were able to parse the original binary data back from the generated string)

With this kind of an ad hoc "binary-embedded" encoding, I get the following data for the same BoatAttack .wasm file:

Binary-embedded uncompressed .js file: 45,597 KB (+28.2% bigger than original binary file, -3.79% smaller than Base64-encoded)
Binary-embedded gzipped: 10,416 KB (+6.98% bigger than original gzipped, -23.98% smaller than Base64+gzipped)
Binary-embedded brotlied: 8,584 KB (+5.16% bigger than original brotlied, -21.96% smaller than Base64+brotlied)

So this kind of format would still gzip and brotli quite well, and would avoid double compression.

EDIT: A small improvement to the above code would be to offset all values by +1 before encoding, and then when decoding, do a -1 to balance. This avoids the issue that zero (a common number) would get encoded by two bytes, although the original variant has a benefit of data strings being directly human-readable. This shifted variant would yield:

Uncompressed .js file: 41,578 KB (down from 45,597 KB of the &0xFF version)
Binary-embedded gzipped: 10,218 KB (down from 10,416 KB)
Binary-embedded brotlied: 8,450 KB (down from 8,584 KB)

@curiousdannii
Copy link
Contributor

curiousdannii commented Mar 5, 2024

Simple hex encoding is also an option, at the cost of increasing the ungzipped JS file by 50% (compared to base64). But it would then be HTTP compressed effectively. And in theory you could probably process it 4 bytes at a time (8 characters)? Don't know if that would be faster or slower than one at a time charCodeAt.

@msqr1
Copy link
Author

msqr1 commented Mar 5, 2024

We can even use base 122 for 19% overhead of embedding, at the cost of a worse compression ratio after embedding (what servers would do). It seems that if we use a more space-efficient embedding method, we get worse compression on the server, if we use a less space efficient one, we get more done on the server. I don't really know where is the sweet spot.

@juj
Copy link
Collaborator

juj commented Mar 5, 2024

Here's a quick prototype of such binary encoding: main...juj:emscripten:binary_encode

emcc test\hello_world.c -o base64.html -sSINGLE_FILE -sMINIMAL_RUNTIME -Oz

yields

4,452 base64.html
2,324 base64.html.gz
2,273 base64.html.br

whereas

emcc test\hello_world.c -o binenc.html -sSINGLE_FILE -sMINIMAL_RUNTIME -sSINGLE_FILE_BINARY_ENCODE -Oz

yields

3,650 binenc.html
1,982 binenc.html.gz
1,927 binenc.html.br

Synchronously decoding the binary encoding should be faster than synchronously decoding base64 at least in browser context, since the code to decode (in browser context) is a strict subset of computations that base64 decoding performs.

I wonder if there would be something here that would improve your use case @msqr1?

@sbc100
Copy link
Collaborator

sbc100 commented Mar 5, 2024

Its seems like there are 3 main concerns that you have about shipping/using the fflate method @juj. Please correct me if I'm wrong:

  1. Licensing concerns
  2. Concerns that webserver compression would make not an overall win for bytes-over-wire
  3. Concerns about the perf cost of double decompressing on the client

Regarding (1), its MIT licensed to I don't think that will be an issue right? We already included bunch MIT licenes third party JS code don't we? e.g. the polyfills?

Regarding (2), its looks like you have a shown yourself that it can be win, right?

Regarding (3), its it the case that -Oz and -Os is a signal that user would prefer slower code if it means smaller code? I'm happy to see this change land without enabling it by default if this concern persists.

Regarding your alternative approach using a custom encoding, that should even better, but I also don't see a need to spend too much time on this if we already have a clear size win from using fflate (even in when combined with server-side brotli). We can always followup with even better improvements if somebody wants to work on this right?

@msqr1
Copy link
Author

msqr1 commented Mar 5, 2024

I care about the size, I think that downloading speed can impact more than decoding speed. So I think we have to test all base16,32,64,85, it is already a win for fflate, this is just some mini improvement over base64. Ngl, I think we patch some stuff up and merge this pr, improvements can be done in another one.

@kripken
Copy link
Member

kripken commented Mar 5, 2024

Thanks for doing those experiments @juj!

This is my summary of them:

server has gzip server lacks gzip
the old base64 encoding large large
this PR's gzip encoding tiny tiny
simple juj encoding tiny large

(tiny/large is always relative to a column, not a row)

If we care about servers without gzip then this PR is the best option: it's always the tiniest, because if there is server gzip then another gzip on top is almost a no-op, and if there is no server gzip then this is a big win.

But if we do not care about servers without gzip then @juj's simple encoding is just as small, because gzipping it is efficient.

There are also factors like compression time that we could measure, but just regarding binary size the tradeoffs seem clear.

Do we care about servers without gzip? I'd personally guess "no" - we should encourage servers to use gzip, and not add complexity on our side as a workaround for them.

@juj
Copy link
Collaborator

juj commented Mar 5, 2024

Thanks @sbc100 , that's a good summarization.

  1. Licensing concerns
    Regarding (1), its MIT licensed to I don't think that will be an issue right? We already included bunch MIT licenes third party JS code don't we? e.g. the polyfills?

For Unity the choice of the license itself is not all of the problem, but merely the fact that any new body of third party code enters the distributable will require a license audit (Unity legal maintains a network graph of licenses of all code it ships).

From Emscripten repository perspective the licensing issue was solved by adding the license file in the third_party/fflate/ directory (thanks @msqr1), so it is tight bundled with the code. As long as all that third party code neatly lives in its own subdirectory so it is easily identified (so that I have an easy time removing it in our redistributions if I need to), the licensing concern is all clear.

  1. Concerns that webserver compression would make not an overall win for bytes-over-wire
    Regarding (2), its looks like you have a shown yourself that it can be win, right?

Indeed, after my previous comment, I have now seen the light that gzip->base64->gzip/br can produce a considerably smaller file than base64->gzip/br alone. (~-26.8% smaller in gzip, vs ~-10.3% smaller in br)

  1. Concerns about the perf cost of double decompressing on the client
    Regarding (3), its it the case that -Oz and -Os is a signal that user would prefer slower code if it means smaller code?

This is the part that I have been pondering about. In my view, overall I can find the following reasons to target small code size at the expense of a tradeoff somewhere else:

  1. We want the page startup time to be lower. A big part of page startup is the download step.
  2. We want there to be less code to make it easier to navigate through and debug or reason/learn about by a human reader.
  3. We want the disk size to be smaller, due to available hard disk size limitation or due to some CDN file size limitations, or e.g. (demo) competition entry size limitations.

Point 2 generally doesn't apply here in either positive or negative direction before or after, so can be ignored in this context.

This leaves points 1 and 3. When I think more about this, it is quite rare that we would have a competition polarizing these two points, i.e. that optimizing for disk size could pessimize startup times. But this type of extra compression step certainly would be one such situation that places 1 and 3 against each other, instead of optimizing for both simultaneously.

When developers choose -Oz, I think most of them do so in order to pursue 1. (i.e. choose -Oz so that their build will start up quicker, even though it might run slower) In specific cases they could be pursuing 3 instead. (i.e. choose -Oz so that their build will fit in their CDN, even though that might make it start up slower, and also run slower)

Of course there is a possibility that -25% smaller pre-gzipped file size could still win on both 1 and 3 simultaneously, but that comes down to benchmarking a competition between user's network download speed and their CPU decoding speed, so it is not immediately trivial which one wins.

I am ok with landing this PR as it currently stands, since it comes with an option to let users configure it.

I think we have to test all base16,32,64,85, it is already a win for fflate

I don't think any of the baseX methods will be good for this use, they will all likely have similar poor compressibility due to the same effect as base64.

The binary encoding feature I mention above, is smaller for download size than the two layers of gzip, and only requires one layer of gzip (or br). So it will be both faster to download and faster to start up after download.

this is just some mini improvement over base64

If this here refers to the binary encoding I mentioned about above, then it is not at all just a mini improvement, but a -14.3% further reduction compared to the gzip->base64->br step, i.e. a bigger improvement than what the base64->br -> gzip->base64->br optimization in this PR itself provides (-10.3%).

Or if compared to how far these encodings are from the optimal "just brotli compressed" code:

brotlied: 8,163 KB
gzip->base64->br: 9,865 KB (+20.85% bigger than just brotli)
binenc->br: 8,450 KB (only +3.52% bigger than just brotli, i.e. -14.34% smaller than gzip->base64->br)

There are two potential downsides however:
a) it optimizes for the scenario where web server compression is used, and not for CDN setups that aren't configured for compression.
b) the binary encoding method has a unique quirk that text editors might get confused by such UTF-8 codepoints inside strings. It stresses the editor's strict adherence to UTF-8 encoding. For example, in Sublime Text I now see I am unable to copy-paste such compressed strings to another file, since Sublime Text wants to reformat the horizontal tab (0x09) characters * inside * the strings into spaces (I'd call that a bug in tabs-vs-spaces, since it attempts to mutate inside strings). Overall, these types of files would likely not be good for heavy manual editing in text editors. (I find Sublime Text has no problems saving such file though, as long as I don't poke the binary data in that file)

So if you're looking to get the smallest download with fastest startup, binary encoding will be a clear winner over gzip+base64. But if you need to see the small file size number on disk, then this kind of binary encoding would not help that. (there would be a way to get to small files by manually pre-compressing your files to foo.html.br or foo.html.gz in advance for the web server, but that scheme will need server configuration to utilize serving such pre-compressed files)

@msqr1 Would this kind of binary encoding method obsolete the need for -sGZIP_EMBEDDINGS for you? Or would that mean that we'd be looking at maintaining two features side by side, if we landed this PR, and later added a -sBINENC_EMBEDDINGS feature?

@juj
Copy link
Collaborator

juj commented Mar 5, 2024

Thanks for doing those experiments @juj!

This is my summary of them:
server has gzip server lacks gzip
the old base64 encoding large large
this PR's gzip encoding tiny tiny
simple juj encoding tiny large

Collecting the actual numbers fro Unity's Boat Attack wasm file into table of this form, the data looks like this:

server has brotli server has gzip server lacks any compression
the old base64 encoding 12.9MB 15.5MB 48.9MB
this PR's gzip encoding 9.9MB 10.0MB 13.4MB
simple juj encoding 8.5MB 10.2MB 41.6MB
standalone .wasm 8.1MB 9.7MB 35.6MB

If server has brotli compression set up, then the binary encoding will be superior to gzip+base64+gzip (-14% smaller). And with server having gzip it is practically equal (+2% bigger).

If feasible for the use case, I think this binary encoding will be the best of both bytes-transferred-over-the-wire and the startup time decoding step will be a fraction of the time it takes to do a base64 decoding step.

@msqr1
Copy link
Author

msqr1 commented Mar 5, 2024

Well, I just want a small size with a good startup time, juj's approach seems superior, and without the need for a third-party library, I'd take juj's approach.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 6, 2024

If we called the option something generic like -sCOMPRESS_EMBEDDINGS then we could land this as-is while leaving the door open for future improvements. However, if somebody wants to work on a version of this PR with @juj's encoding right away I'm also fine with that.

@msqr1
Copy link
Author

msqr1 commented Mar 6, 2024

@kripken Just curious, do you happen to keep the direct binary embedding to JS source code, I want to see how is that even possible. Also I wouldn't call this new setting -sCOMPRESS_EMBEDDING, following juj, this would be a direct change to -sSINGLE_FILE, as this is a new encoding method.

@msqr1
Copy link
Author

msqr1 commented Mar 6, 2024

So we maintain both base64 and juj encoding?

@kripken
Copy link
Member

kripken commented Mar 6, 2024

@msqr1

Just curious, do you happen to keep the direct binary embedding to JS source code, I want to see how is that even possible.

Sorry, I'm not sure what you're asking here? But if you mean @juj's idea for encoding then the code is in #21478.

So we maintain both base64 and juj encoding?

That's worth discussing, good question. Let's see how #21478 looks. But in general it would be nice to have just a single encoding here, if there are no serious downsides.

@msqr1
Copy link
Author

msqr1 commented Mar 6, 2024

  1. experiment with embedding binary data directly - we used to have that a long ago, Alon implemented such an embedding, where only certain bad byte sequences would be escaped on demand, but all other characters would come through directly. Iirc that was pretty good in terms of size, but due to discrepancies between browsers, that was a constant source of bugs.

@kripken juj said you had this, I just want to see how on earth this is possible, and what does it mean to have discrepancy between browser

@curiousdannii
Copy link
Contributor

One downside with @juj's encoding is that bundlers often default to ASCII output. If someone isn't careful they could massively increase the file size. But at least it would still HTTP compress effectively...

@kripken
Copy link
Member

kripken commented Mar 6, 2024

@msqr1

[..] Alon implemented such an embedding, where only certain bad byte sequences would be escaped on demand, but all other characters would come through directly. [..]

Hmm... interesting, I don't remember that 😄 It's been too long maybe. @juj do you remember any more details?

@juj
Copy link
Collaborator

juj commented Mar 7, 2024

Hmm... interesting, I don't remember that 😄 It's been too long maybe. @juj do you remember any more details?

I dug up the git history now. Looks like it was not Alon who did it, but another contributor @gagern (hi if you're still around).

Originally the feature was added as a linker option -sMEM_INIT_METHOD=2 back in the asm.js days when the program static/global data section needed to be embedded into the asm.js file.

The feature was introduced in this PR in June 2015. I am not sure which PR removed it (I see a trail of #5401 and #5296). The PR #12325 Sep 2020 finally removed the code, but it seems like the feature was obsoleted already before that PR.

Re-reading the implementation of that method, it was a bit different than what I remembered, as it used ASCII encoding and escaped bytes >= 0x80 as \x80 - \xff. Reading back the comments, it looks like the root cause of problems back then was the presence of a null byte inside a string, and issues with Closure (which are no longer a problem).

The method I propose in #21478 does things a bit differently:

  • the null byte is never emitted,
  • use regular UTF-8 characters (that encode to two bytes) instead of four-byte \x?? ASCII sequences for smaller code size.
  • apart from that, characters \ ' \n \r were escaped in both,
  • I also escape " since we have some optimizer pass that mass-converts '' strings into "" strings without escaping "s inside such strings, that broke the compression in tests.
  • Interestingly in old PR fix_string_meminit_windows #3854 I had observed and fixed that Windows could not cope with the character 0x1A in MEM_INIT_METHOD=2 and that caused tests to fail. Now in Embed wasm by encoding binary data as UTF-8 code points. #21478 I don't observe that to happen anymore in Emscripten test suite even when 0x1A bytes are present in the encoding, so should be safe.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 7, 2024

Out of interest, is this embedding/encoding only ever used to wasm binary file? I seem to remember that wasm2js uses its own internal base64 encoding for data segments so won't be effect by this change. Are there any other places where we support embedding binary data?

Would we expect the difference between base64 + gzip vs gzip + base64 to be equally significant for all binary data (e.g. static data segments), or is it fairly specific to wasm binaries?

@msqr1
Copy link
Author

msqr1 commented Mar 7, 2024

Would we expect the difference between base64 + gzip vs gzip + base64 to be equally significant for all binary data (e.g. static data segments), or is it fairly specific to wasm binaries?

It should be equally significant for all data, not just wasm binaries

@msqr1 msqr1 closed this Mar 24, 2024
@msqr1
Copy link
Author

msqr1 commented Apr 22, 2024

@sbc100 Could you please elaborate why =2 stuff are hard to maintain? I don't have much experience with Emscripten, sorry? I find adding a new option harder in this case, ngl. This is because we have to do all the check with -sSINGLE_FILE and -sWASM_ASYNC_COMPILATION. This only works in tandem with -sSINGLE_FILE, so in this case, we only need to look for #if SINGLE_FILE and ignore the rest

I've always regretted adding the =2 flavors such as -sWASM=2 and -sMAIN_MODULE=2 and -sSIDE_MODULE=2. I think there are many different reasons, but maybe I can articulate one or two of them here. For one its not obvious to use user what they mean. For example, -sSINGLE_FILE=2 doesn't say as much to the reader as -sGZIP_EMBEDDED_DATA. When working on the codebase it also makes it hard to find all the places where the =2 version is used. e.g. Rather then just grep SINGLE_FILE now I have to all looks for things like SINGLE_FILE = 2 or SINGLE_FILE != 1 etc, etc.

I think if you add a new GZIP_DATA_URLS or GZIP_EMBEDDED_DATA option then you will have that it will most like nest within SINGLE_FILE block.

e.g.

#if SINGLE_FILE
#if GZIP_EMBDDEDED_DATA
callA();
#else
callB();
#endif
#endif

If find this easier to follow than using a == 2 nested inside the #if for the same option:

#if SINGLE_FILE
#if SINGLE_FILE == 2
callA();
#else
callB();
#endif
#endif

Honestly, I still like the flavor-of-a-setting idea, it let users know that this is a variant, without typing too much, still being grep-able and hold the same meaning. Instead of MAIN_MODULE=2 we can do it like SUPPORT_LONGJMP where it can be MAIN_MODULE=dce. What do you think @sbc100?

@sbc100
Copy link
Collaborator

sbc100 commented Apr 23, 2024

So for this setting you are proposing something like -sSINGLE_FILE=gzip ? That sounds reasonable to me yes.

Although one potential downside is that it wouldn't then cover other types of data embedding that we might have.. but perhaps this is the only remaining place we do data embedding in JS so it might be fine.

@msqr1
Copy link
Author

msqr1 commented Apr 24, 2024

No, I kinda gave up on this, I'm talking about your general concern over =2 variants

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using gzip for -sSINGLE_FILE
5 participants