fix_string_meminit_windows #3854

juj · 2015-10-15T13:42:37Z

Fix memory string initializer to generate format that is safe for text files. Perform Emscripten optimizer cleanup pass in binary file read and write mode to not get confused by the special characters in the string initializer. Skip closure compiler when memory string initializer is used on Windows, since closure trips over when presented with such input. Related to #3326. cc @gagern .

juj · 2015-10-15T13:43:02Z

This fixes Windows bot tests browser.test_meminit_pairs and browser.test_meminit_big.

kripken · 2015-10-15T17:33:21Z

tools/js_optimizer.py

-      f.write(open(out_file).read())
-  f.write('\n')
+      f.write(open(out_file, 'rb').read())
+  nl = '\r\n' if WINDOWS else '\n' # Writing in binary mode, so make sure to generate correct newline output


can just use os.linesep

kripken · 2015-10-15T17:35:47Z

I wonder if after removing those special characters, this compression is still efficient?

cc @gagern

juj · 2015-10-16T14:35:54Z

That's a good question. Also, I think we need to research to some specs (which?) to figure out what are the guaranteed safe bytes in .js files. My impression is that it is outright incorrect to have the 00 byte in a text file, but if I'm wrong, we should build some confidence from the relevant encoding specifications to be sure.

One thing I notice is that blindly replacing text mode file reads and writes to binary mode reads and writes will lead to trouble on Windows. On Windows we have had the convention of generating \r\n output .js files, and I'd like to preserve that, however, if not careful, I see that those line endings start becoming mixed \n with \r\n, and I even saw outright broken \r\r\n bytes in a file, so one must be very careful and precise when writing line separators correctly when transforming from text mode to binary mode.

gagern · 2015-10-16T20:50:49Z

tools/shared.py

-    return re.sub('[\x80-\xff]', escape, s)
+    s = re.sub('\x00(?![0-9])', '\\0', s) # Don't allow a null byte in the generated output file.
+    s = re.sub('[\x00-\x1f\x80-\xff]', escape, s) # Don't allow ASCII chars 0-31 or 128-255 in the generated output file.
+    return s


You might want to apply similar changes to the uglify to_ascii function.

Hmm, that is likely true, but I'm a bit wary to make that in blind. It could be done "for good measure" but I think it's better to defer until there is a problem/test to demonstrate the need.

I remember that in #3326 (comment) I detected someone postprocessing the strings, and in #3326 (comment) I identified uglify as the culprit. So there was some setup where I could enable that processing step without even trying. Likely tied to some optimization level. Not sure whether that applies anymore; I've not been updating my emscripten toolbox in some time. If the code still gets run, it should affect Windows builds the same way as the original problem here.

Thanks, reading that, I notice that we also have that asm2m tests harness that's commented out, I'm running through that now to make sure it's all good. Hopefully I'll see that same issue.

gagern · 2015-10-16T21:10:34Z

Regarding the question whether control bytes in general and null bytes in particular are permissible in JavaScript, see #3326 (comment) referencing section 7.8.4 of the ECMA 262 spec. But just because the spec allows something doesn't mean all the tools along the way can actually deal with it. In particular since the spec talks about characters, without regard for encoding, while some file format interpretations might treat control bytes as non-characters.

Regarding the efficiency of the compression, testing beats philosophizing. But it depends a lot on what kind of code you're dealing with, in particular the split between text and numbers, and whether those numbers tend to be small or random. For comparison, we could reactivate the idea from #3326 (comment) of XORing every byte with 32, hopefully turning more control bytes into digits and other printable characters than the other way round.

On the whole, I very much like the gist of this change. It's something I intended to do for quite a while but never find the time to. Thanks a lot!

kripken · 2015-10-26T23:04:57Z

I suppose we should just merge this, and do further measurements/testing later?

kripken · 2016-01-23T00:57:47Z

What's the status here?

juj · 2016-01-25T17:40:49Z

Rebased this on top of latest incoming to avoid very long branches in git history, since there's been a lot of commits in between. No functional changes. Retested that browser.test_meminit_big and browser.test_meminit_pairs pass after this PR.

This PR does make the escaping a bit more conservative that raw bytes 00-31 (decimal) are not allowed in the generated output, but I don't think we have tests to measure how large the effect is. At least the feature will work on Windows, and gives a method to improve from there. I'd be ok merging this in.

kripken · 2016-01-25T18:05:44Z

tests/test_browser.py

@@ -2943,7 +2943,8 @@ def test_meminit_pairs(self):
    args = ["-O2", "--memory-init-file", "0", "-s", "MEM_INIT_METHOD=2", "-s", "ASSERTIONS=1"]
    self.btest(d, expected='0', args=args + ["--closure", "0"])
    self.btest(d, expected='0', args=args + ["--closure", "0", "-g"])
-    self.btest(d, expected='0', args=args + ["--closure", "1"])
+    if not WINDOWS: # Closure compiler fails on Windows when it receives the unexpected characters in memory string initializer.


is this and the similar thing below still needed, even after this pr?

I think so, I'll test and reconfirm.

if it still is, that seems like either a bug we should fix here, or a bug in closure if it behaves differently on different OSes?

My suspect at the time was that it would be a Closure bug, but I did not dig deep enough to understand and confirm. I'm now running through the asm2m suite to see what comes up, so I'll see after that if this is still an issue.

…t files on Windows.

…hed line endings.

juj · 2016-01-26T00:49:21Z

Looking closer at the asm2m suite gave better test cases to work with, and realized now with playing more at the call site in uglify.js that @gagern pointed out that as long as we are being consistent with ascii/binary file opens (which we already were before the previous version of this PR!), it looks like it's only the ^Z character that is causing trouble in Windows builds. So this PR changed form quite a bit, and in the end only the hex code 0x1a needed to be banned, in the two call sites shown.

The PR also now piggybacks a slightly related helper commit that improves error message prints when our line endings mismatch. @gagern, @kripken: how does this look now?

kripken · 2016-01-26T00:59:12Z

Oh, I see the last commit does that, nevermind that comment comment then.

kripken · 2016-01-26T02:06:33Z

lgtm

kripken · 2016-01-26T02:12:53Z

I can't merge to master today anyhow, so let's merge this now.

fix_string_meminit_windows

…e#3326" This reverts commit b256af6. Since emscripten-core#3854 started escaping \x1a, this should no longer be needed. Conflicts due to code reformatting had to be resolved manually.

juj added the windows label Oct 15, 2015

kripken reviewed Oct 15, 2015
View reviewed changes

juj force-pushed the fix_string_meminit_windows branch from b3e0074 to c8b9d6d Compare October 16, 2015 14:25

gagern reviewed Oct 16, 2015
View reviewed changes

juj force-pushed the fix_string_meminit_windows branch from c8b9d6d to df96b45 Compare January 25, 2016 17:27

kripken reviewed Jan 25, 2016
View reviewed changes

juj added 2 commits January 26, 2016 02:40

Fix memory string initializer to generate format that is safe for tex…

d2a7e64

…t files on Windows.

Improve tools/line_endings.py to show counts and locations of mismatc…

a761768

…hed line endings.

juj force-pushed the fix_string_meminit_windows branch from df96b45 to a761768 Compare January 26, 2016 00:41

Clarify Emscripten specific localmod comment.

17f8dd9

kripken added a commit that referenced this pull request Jan 26, 2016

Merge pull request #3854 from juj/fix_string_meminit_windows

df17a47

fix_string_meminit_windows

kripken merged commit df17a47 into emscripten-core:incoming Jan 26, 2016

juj mentioned this pull request Jun 17, 2017

Add SINGLE_FILE option to embed all subresources into emitted JS #5296

Merged

gagern mentioned this pull request Jul 28, 2017

Deprecate MEM_INIT_METHOD 2 #5401

Closed

juj mentioned this pull request Mar 7, 2024

Add new option -sGZIP_EMBEDDINGS that gzip the wasm output #21426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix_string_meminit_windows #3854

fix_string_meminit_windows #3854

juj commented Oct 15, 2015

juj commented Oct 15, 2015

kripken Oct 15, 2015

juj Oct 16, 2015

kripken commented Oct 15, 2015

juj commented Oct 16, 2015

gagern Oct 16, 2015

juj Jan 25, 2016

gagern Jan 25, 2016

juj Jan 25, 2016

gagern commented Oct 16, 2015

kripken commented Oct 26, 2015

kripken commented Jan 23, 2016

juj commented Jan 25, 2016

kripken Jan 25, 2016

juj Jan 25, 2016

kripken Jan 25, 2016

juj Jan 25, 2016

juj commented Jan 26, 2016

kripken commented Jan 26, 2016

kripken commented Jan 26, 2016

kripken commented Jan 26, 2016

fix_string_meminit_windows #3854

fix_string_meminit_windows #3854

Conversation

juj commented Oct 15, 2015

juj commented Oct 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kripken commented Oct 15, 2015

juj commented Oct 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gagern commented Oct 16, 2015

kripken commented Oct 26, 2015

kripken commented Jan 23, 2016

juj commented Jan 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juj commented Jan 26, 2016

kripken commented Jan 26, 2016

kripken commented Jan 26, 2016

kripken commented Jan 26, 2016