Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix_string_meminit_windows #3854

Merged
merged 3 commits into from
Jan 26, 2016

Conversation

juj
Copy link
Collaborator

@juj juj commented Oct 15, 2015

Fix memory string initializer to generate format that is safe for text files. Perform Emscripten optimizer cleanup pass in binary file read and write mode to not get confused by the special characters in the string initializer. Skip closure compiler when memory string initializer is used on Windows, since closure trips over when presented with such input. Related to #3326. cc @gagern .

@juj juj added the windows label Oct 15, 2015
@juj
Copy link
Collaborator Author

juj commented Oct 15, 2015

This fixes Windows bot tests browser.test_meminit_pairs and browser.test_meminit_big.

f.write(open(out_file).read())
f.write('\n')
f.write(open(out_file, 'rb').read())
nl = '\r\n' if WINDOWS else '\n' # Writing in binary mode, so make sure to generate correct newline output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just use os.linesep

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check.

@kripken
Copy link
Member

kripken commented Oct 15, 2015

I wonder if after removing those special characters, this compression is still efficient?

cc @gagern

@juj juj force-pushed the fix_string_meminit_windows branch from b3e0074 to c8b9d6d Compare October 16, 2015 14:25
@juj
Copy link
Collaborator Author

juj commented Oct 16, 2015

That's a good question. Also, I think we need to research to some specs (which?) to figure out what are the guaranteed safe bytes in .js files. My impression is that it is outright incorrect to have the 00 byte in a text file, but if I'm wrong, we should build some confidence from the relevant encoding specifications to be sure.

One thing I notice is that blindly replacing text mode file reads and writes to binary mode reads and writes will lead to trouble on Windows. On Windows we have had the convention of generating \r\n output .js files, and I'd like to preserve that, however, if not careful, I see that those line endings start becoming mixed \n with \r\n, and I even saw outright broken \r\r\n bytes in a file, so one must be very careful and precise when writing line separators correctly when transforming from text mode to binary mode.

return re.sub('[\x80-\xff]', escape, s)
s = re.sub('\x00(?![0-9])', '\\0', s) # Don't allow a null byte in the generated output file.
s = re.sub('[\x00-\x1f\x80-\xff]', escape, s) # Don't allow ASCII chars 0-31 or 128-255 in the generated output file.
return s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to apply similar changes to the uglify to_ascii function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that is likely true, but I'm a bit wary to make that in blind. It could be done "for good measure" but I think it's better to defer until there is a problem/test to demonstrate the need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that in #3326 (comment) I detected someone postprocessing the strings, and in #3326 (comment) I identified uglify as the culprit. So there was some setup where I could enable that processing step without even trying. Likely tied to some optimization level. Not sure whether that applies anymore; I've not been updating my emscripten toolbox in some time. If the code still gets run, it should affect Windows builds the same way as the original problem here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, reading that, I notice that we also have that asm2m tests harness that's commented out, I'm running through that now to make sure it's all good. Hopefully I'll see that same issue.

@gagern
Copy link
Contributor

gagern commented Oct 16, 2015

Regarding the question whether control bytes in general and null bytes in particular are permissible in JavaScript, see #3326 (comment) referencing section 7.8.4 of the ECMA 262 spec. But just because the spec allows something doesn't mean all the tools along the way can actually deal with it. In particular since the spec talks about characters, without regard for encoding, while some file format interpretations might treat control bytes as non-characters.

Regarding the efficiency of the compression, testing beats philosophizing. But it depends a lot on what kind of code you're dealing with, in particular the split between text and numbers, and whether those numbers tend to be small or random. For comparison, we could reactivate the idea from #3326 (comment) of XORing every byte with 32, hopefully turning more control bytes into digits and other printable characters than the other way round.

On the whole, I very much like the gist of this change. It's something I intended to do for quite a while but never find the time to. Thanks a lot!

@kripken
Copy link
Member

kripken commented Oct 26, 2015

I suppose we should just merge this, and do further measurements/testing later?

@kripken
Copy link
Member

kripken commented Jan 23, 2016

What's the status here?

@juj juj force-pushed the fix_string_meminit_windows branch from c8b9d6d to df96b45 Compare January 25, 2016 17:27
@juj
Copy link
Collaborator Author

juj commented Jan 25, 2016

Rebased this on top of latest incoming to avoid very long branches in git history, since there's been a lot of commits in between. No functional changes. Retested that browser.test_meminit_big and browser.test_meminit_pairs pass after this PR.

This PR does make the escaping a bit more conservative that raw bytes 00-31 (decimal) are not allowed in the generated output, but I don't think we have tests to measure how large the effect is. At least the feature will work on Windows, and gives a method to improve from there. I'd be ok merging this in.

@@ -2943,7 +2943,8 @@ def test_meminit_pairs(self):
args = ["-O2", "--memory-init-file", "0", "-s", "MEM_INIT_METHOD=2", "-s", "ASSERTIONS=1"]
self.btest(d, expected='0', args=args + ["--closure", "0"])
self.btest(d, expected='0', args=args + ["--closure", "0", "-g"])
self.btest(d, expected='0', args=args + ["--closure", "1"])
if not WINDOWS: # Closure compiler fails on Windows when it receives the unexpected characters in memory string initializer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this and the similar thing below still needed, even after this pr?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, I'll test and reconfirm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it still is, that seems like either a bug we should fix here, or a bug in closure if it behaves differently on different OSes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suspect at the time was that it would be a Closure bug, but I did not dig deep enough to understand and confirm. I'm now running through the asm2m suite to see what comes up, so I'll see after that if this is still an issue.

@juj juj force-pushed the fix_string_meminit_windows branch from df96b45 to a761768 Compare January 26, 2016 00:41
@juj
Copy link
Collaborator Author

juj commented Jan 26, 2016

Looking closer at the asm2m suite gave better test cases to work with, and realized now with playing more at the call site in uglify.js that @gagern pointed out that as long as we are being consistent with ascii/binary file opens (which we already were before the previous version of this PR!), it looks like it's only the ^Z character that is causing trouble in Windows builds. So this PR changed form quite a bit, and in the end only the hex code 0x1a needed to be banned, in the two call sites shown.

The PR also now piggybacks a slightly related helper commit that improves error message prints when our line endings mismatch. @gagern, @kripken: how does this look now?

@kripken
Copy link
Member

kripken commented Jan 26, 2016

Oh, I see the last commit does that, nevermind that comment comment then.

@kripken
Copy link
Member

kripken commented Jan 26, 2016

lgtm

@kripken
Copy link
Member

kripken commented Jan 26, 2016

I can't merge to master today anyhow, so let's merge this now.

kripken added a commit that referenced this pull request Jan 26, 2016
@kripken kripken merged commit df17a47 into emscripten-core:incoming Jan 26, 2016
gagern added a commit to gagern/emscripten that referenced this pull request Oct 13, 2016
…e#3326"

This reverts commit b256af6.
Since emscripten-core#3854 started escaping \x1a, this should no longer be needed.
Conflicts due to code reformatting had to be resolved manually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants