Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SINGLE_FILE option to embed all subresources into emitted JS #5296

Merged
merged 86 commits into from
Oct 16, 2017

Conversation

buu700
Copy link
Contributor

@buu700 buu700 commented Jun 13, 2017

Resolves #5279.

I haven't tested this (don't currently have a local development environment set up), but wanted to submit it early for initial feedback. (Alternatively, if anyone else who's already set up to do so has time to test it, that would be really helpful.)

Concerns:

  • Is there anything I'm missing that could potentially end up requested by the emitted JS at run-time? In this commit I covered the memory initialiser, wasm, wast, and asm.js.

  • Are there any other os.path.basename references (or any other places in general) that I should have updated to use the new utility function, and/or is there any reference I changed that was incorrect? (I just ctrl+F'd for os.path.basename and changed anything that looked like it would probably be inserted into the JS somewhere along the line.)

  • Am I expected to add an automated test before this is merged, and if so are there any guidelines for that?

As discussed in emscripten-core#5279, subresource paths are converted into base64 data URIs.
Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks like it's in a good direction, nice.

Some thoughts:

  • We had code to embed the mem init file as a JS array. It seems like that could be replaced with calling this new utility method?
  • Same for --embed-file which embeds data. However, maybe that's a separate issue because of how the file packager stuff works, not sure.
  • A more serious issue is URI size. First, we should confirm that the new method is at least as good as the old on the mem init. (If not, maybe there's something we can do.)
  • Second, and this is my biggest concern here, in the old method we split up the array into chunks because JS engines had issues with very large arrays - parsing them was horribly slow. I hope that isn't an issue with very large strings, but we should confirm.
  • Finally, these data URIs will only work in a browser environment, I think? We do need things to work in node.js and other shell environments as well. (Perhaps it's possible to add a utility function for such environments that converts the URI into a typed array?)

As for testing, yes, we will need to add tests for this. We should test both browser and shell, so tests/test_browser.py and tests/test_core.py are the relevant files. Look at some of the tests there to get an idea of how to write a new one, and let me know if something isn't clear.

emcc.py Outdated
@@ -513,6 +513,16 @@ def filter_emscripten_options(argv):

# ---------------- Utilities ---------------

# Returns the subresource location for run-time access
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this to tools/shared.py, perhaps a static method on the JS class?

src/settings.js Outdated
@@ -856,6 +856,13 @@ var FETCH = 0; // If nonzero, enables emscripten_fetch API.

var ASMFS = 0; // If set to 1, uses the multithreaded filesystem that is implemented within the asm.js module, using emscripten_fetch. Implies -s FETCH=1.

var SINGLE_FILE = 0; // If set to 1, embeds all subresources in the emitted JS file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add examples of those files (data, asm.js, wasm etc.), i think it would help clarify things

@buu700
Copy link
Contributor Author

buu700 commented Jun 13, 2017

We had code to embed the mem init file as a JS array. It seems like that could be replaced with calling this new utility method?

Well, it does accomplish basically the same thing as --memory-init-file 0, but it also slightly breaks backwards compatibility (the CSP connect-src thing).

Same for --embed-file which embeds data. However, maybe that's a separate issue because of how the file packager stuff works, not sure.

Not 100% sure, but based on the ctrl+F results for embed_files it looks like this is already just being concatenated along with pre.js and post.js?

A more serious issue is URI size. First, we should confirm that the new method is at least as good as the old on the mem init. (If not, maybe there's something we can do.)

From a quick test on a binary file I had lying around, this seems like the much more efficient encoding:

> fs.readFileSync('ntru.wasm').join(',').length
153825
> require('datauri').sync('ntru.wasm').length
70385

I also made the suggestion to Google recently to specifically optimise for base64-encoded strings in a future release of Brotli, which wasn't immediately shot down, so maybe it'll work out and even further improve this situation long-term.

Second, and this is my biggest concern here, in the old method we split up the array into chunks because JS engines had issues with very large arrays - parsing them was horribly slow. I hope that isn't an issue with very large strings, but we should confirm.

Huh, that's interesting. Yeah, we should definitely test that and maybe split up the strings if necessary.

Finally, these data URIs will only work in a browser environment, I think? We do need things to work in node.js and other shell environments as well. (Perhaps it's possible to add a utility function for such environments that converts the URI into a typed array?)

Ah, I hadn't thought of that. Would updating Module.read to detect and parse data URIs be a good way to handle that?

Edit: What if we also updated the web version of Module.read and its variants (which currently always use XHR) to do the same? That way the whole CSP issue could be avoided, and users of --memory-init-file 0 could get the added space efficiency without any breaking changes.

re: your comments on the code, sure, I'll take care of those.

emcc.py Outdated
shared.Settings.WASM_TEXT_FILE = os.path.basename(wasm_text_target)
shared.Settings.WASM_BINARY_FILE = os.path.basename(wasm_binary_target)
shared.Settings.ASMJS_CODE_FILE = os.path.basename(asm_target)
shared.Settings.WASM_TEXT_FILE = get_subresource_location(wasm_text_target)
Copy link
Contributor Author

@buu700 buu700 Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just went ahead and tested my change. (I was expecting a complicated and/or long setup process based on past experience installing from source with the SDK, but in retrospect I think it was just compiling clang that would take a long time.)

Anyway, it fails right here, apparently because the wasm, wast, and asm.js don't exist yet. How should I handle this?

@buu700
Copy link
Contributor Author

buu700 commented Jun 14, 2017

@kripken, the changes from your comments and as discussed in the main thread are done.

Remaining issues:

  • From my outdated comment: "It fails right here [emcc.py line 1119], apparently because the wasm, wast, and asm.js don't exist yet. How should I handle this?". (The new --memory-init-file 0 is working for me; it just fails when building wasm.)

  • I didn't touch the --memory-init-file 2-related logic, but if you want I can go ahead and remove it since there's really no reason for it anymore now that option 0 does what 2 was meant to do.

  • I went with the solution of opportunistically parsing data URIs in the shell.js read methods before hitting the networking or filesystem layers, which I confirmed is working as expected in both the browser and Node.js. However, I wasn't sure of the best way to handle this, so there are some things you may want done differently:

    • I introduced a new library, sodiumutil, to handle base64 to binary and binary to UTF-8 encoding/decoding. (I have some naive dependency-free implementations I can pull out of my JS console history that are a just a few lines, but they won't handle all the same corner cases.)

    • Right now I'm injecting sodiumutil in jsifier.js, which may or may not be the best place to handle that.

    • sodiumutil is currently included in all cases, but we could save 3 KB by including it only when needed. That said, even with it included, one of the two asm.js libs of mine that I tested rebuilding (mceliece.js) dropped 26 KB from the mem init encoding change (before compression) and the other one (rsasign.js) dropped 114 KB.

    • Licensing: sodiumutil is Simplified BSD, and depends on code from libsodium.js which is ISC. I'm pretty certain that neither one of these licences is a problem to redistribute as MIT, but IANAL. If it is something that you'd be concerned about, I can add a dual MIT license to sodiumutil and ask the libsodium maintainers if they'd mind doing the same for the code that sodiumutil borrows.

As far as string parsing performance, from a quick test in my JS consoles, setting one long (~1 MB) string to a variable seems to be noticeably faster than concatenating it together in 1 KB chunks in Chrome (single-to-double-digit µs vs single-digit ms) and about the same in Firefox (single-to-double-digit µs in either case).

buu700 added a commit to buu700/emscripten that referenced this pull request Jun 14, 2017
@curiousdannii
Copy link
Contributor

curiousdannii commented Jun 14, 2017

Rather than embedding Data URIs, why not just embed the base64 data as a normal string and decode it to an array in all environments, rather than just Node?

Another array that is already embedded is the emterpreter data. I couldn't tell if you were handling that yet or not.

@buu700
Copy link
Contributor Author

buu700 commented Jun 14, 2017

That is what's being done now. The data URI prefix at this point is only used to tell Module.read* to treat the data as embedded base64.

It seems a bit arbitrary now with this being the case, but I think it probably makes sense to continue using this format since it forces us to to avoid collisions with actual file names (with no prefix or some other prefixes, we could easily parse an actual file name as base64 when it isn't meant to be). To save a few bytes, we could use a shorter mime type than application/octet-stream, or even drop the data URI standard entirely and just pick a custom prefix and throw an error if an actual file name ever starts with it.

As far as emterpreter data, thanks, I'll look around for that and see if it can also be easily migrated to this pattern.

src/jsifier.js Outdated
@@ -45,7 +47,8 @@ function JSify(data, functionsOnly) {
// else. This lets us not hold any strings in memory, we simply print
// things out as they are ready.

var shellParts = read(shellFile).split('{{BODY}}');
var thirdParty = thirdPartyFiles.map(function(f) { return read(f) }).join('\n');
var shellParts = read(shellFile).replace('{{THIRD_PARTY}}', thirdParty).split('{{BODY}}');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to add a third party library sodium to all generated builds? That is awkward for build sizes, plus also it will trigger a legal review for a number of partner companies, which we would prefer to avoid. It looks like sodium is used to base64 decode? Could https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding be used instead?

This looks like a good feature, but it should definitely be implemented as a build option so that builds which don't use it don't need to compromise on generated build sizes. That is, gate the generated code on SINGLE_FILE option to not impact builds which don't utilize SINGLE_FILE?

Copy link
Contributor Author

@buu700 buu700 Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to add a third party library sodium to all generated builds? That is awkward for build sizes ... gate the generated code on SINGLE_FILE option to not impact builds which don't utilize SINGLE_FILE

Yep, as mentioned above, that was exactly how I wanted to handle this in another commit. It's only 3 KB, which isn't much compared to the size of typical emscripten-generated code, but there's still no reason to include it when it isn't used.

Could https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding be used instead?

As mentioned, I have simple implementations working (which basically do this, although the base64 one is a little different because because it's converting to bytes, not a string):

function base64ToBytes (s) {
    var decoded = atob(s);
    var bytes = new Uint8Array(decoded.length);
    for (i = 0 ; i < decoded.length ; ++i) {
        bytes[i] = decoded.charCodeAt(i);
    }
    return bytes;
}

function bytesToString (bytes) {
    if (typeof TextDecoder !== 'undefined') {
        return new TextDecoder('utf-8', {fatal: true}).decode(bytes);
    }
    var s = '';
    for (var i = 0 ; i < bytes.length ; ++i) {
        s += String.fromCharCode(bytes[i]);
    }
    return s;
}

I just don't know offhand how well these handle corner cases. Maybe there's a test suite out there that we could use.

plus also it will trigger a legal review for a number of partner companies, which we would prefer to avoid

There's always my earlier proposed solution of asking the libsodium maintainers to provide from_base64 and to_string as MIT, or we could go further and ask them to fully donate them to emscripten.

(The base64 implementation as I understand is nothing special aside from being constant time, which doesn't matter here, but I believe their string decoding has been extended to support relevant edge cases like very large byte arrays.)

Copy link
Member

@kripken kripken Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @juj, we should avoid adding a dependency here if we can. It should be easy enough to write code for this, the code you quoted looks like it could be a basis. For testing, I think there is a test for MEM_INIT_METHOD (can search under tests/) which checks a large variety of patterns, which we can maybe repurpose. If not, we should add something like that.

Copy link
Contributor Author

@buu700 buu700 Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense. Well, as an alternative, if we wanted to just steal the specific functions instead of using a dependency, this is what that would look like.

The original version of libsodium.js's to_string function was written by me and would've been similar to the above, but based on jedisct1/libsodium.js#57 it seems that some sort of issue with handling large byte arrays was discovered and fixed, which would probably also become a problem for some emscripten users if we were to start from what I have above.

I don't mind working on making those implementations more robust and/or searching around for well tested public domain implementations if you'd prefer not to do this, but so we have the option: @jedisct1, would it be possible for those two functions from libsodium.js to be gifted to emscripten?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juj and @kripken, I just committed my functions from above in place of sodiumutil, which I think we'll be okay with (particularly if we get that MEM_INIT_METHOD test working here).

As far as the large byte array problem with decoding UTF-8: upon further investigation, apparently the issue was that libsodium's original to_string used decodeURIComponent(escape(String.fromCharCode.apply(null, bytes))) instead of a String.fromCharCode(bytes[i]) for loop as I had here. Using the former method, an exception would be thrown with byte arrays larger than a few hundred KB, but the latter is limited only by the heap size as far as I can tell.

The reason it was implemented that way was for better error handling; the latter method will happily return weird strings if you give it an array of bytes that aren't valid UTF-8. That being said, as far as I can tell, that really doesn't matter here as much as it does in a crypto library, so there's really no reason not to just use the latter method.

@juj
Copy link
Collaborator

juj commented Jun 14, 2017

A more serious issue is URI size. First, we should confirm that the new method is at least as good as the old on the mem init. (If not, maybe there's something we can do.)

Some of these files can be hundreds of megabytes in size (asm.js, wasm). This should make sure that these files are not loaded in to memory when SINGLE_FILE is not enabled, and it would be good to benchmark that compilation performance is not regressed. For example the Emscripten toolchain profiler is ideal for this, see http://kripken.github.io/emscripten-site/docs/optimizing/Profiling-Toolchain.html

@buu700
Copy link
Contributor Author

buu700 commented Jun 14, 2017

Nice, glad to know there's a test suite that can properly validate the performance difference. Even if there are a lot of good reasons to expect this to be faster (on top of the improved space-efficiency), I'd hate for something to be rolled into production across the web based on a few minutes of unscientific testing in my JS console.

buu700 added a commit to buu700/emscripten that referenced this pull request Jun 14, 2017
@@ -2394,7 +2386,7 @@ def generate_html(target, options, js_target, target_basename,
%s
};
emterpretXHR.send(null);
''' % (shared.Settings.EMTERPRETIFY_FILE, script.inline)
''' % (shared.JS.get_subresource_location(shared.Settings.EMTERPRETIFY_FILE), script.inline)
Copy link
Contributor Author

@buu700 buu700 Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took care of this as suggested by @curiousdannii. However, while doing this, I noticed that this part of emcc.py injects a number of one-off implementations of basically the same logic from Module.read*. Is there a reason that I shouldn't change them to use Module.read* where possible instead of reimplementing the data URI parsing in each one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those places are emitting code for the HTML file, where we normally don't depend on JS contents. However, we do create Module in the HTML, and we define Module.read there - so this might work. But if the JS is required to run to set up things, it might not. The safest thing might be to leave it as it is, or to create a utility function in the HTML to avoid duplication in there.

Copy link
Contributor Author

@buu700 buu700 Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, got it. In that case, it sounds like the best solution right now would be to just copy and paste my tryParseAsDataURI function from shell.js into emcc.py to be inserted into the HTML? (Another candidate for future refactoring, but it would be consistent with the Module.read* logic already being duplicated here.)

Edit: Alternatively, we could either:

  1. Not support SINGLE_FILE with HTML output.

  2. Document CSP requirements for combining either SINGLE_FILE or --memory-init-file 0 with HTML output. Slightly longer-term, maybe go back and resolve this requirement after the duplication of the Module.read* logic is factored out.

I like option 2, if that's all right with you.

I'm biased since I don't personally use the HTML output for anything, but my thinking is that:

  • Whereas someone could easily rebuild and publish a JS library without knowing that they'll have just broken some of their users' applications, it's pretty obvious and easy to fix when your own page blows up on load after throwing a CSP error

  • Whereas a JS lib needs to work in every environment (web, node, shell), an HTML page only needs to run in a browser, where we know that loading a data URI through XHR works (CSP issues aside)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. If we emit html, then we are on the web and should be able to load the data uri directly? So there is no need for special code in the html to handle parsing it, that's just a problem for shell environments?

Copy link
Contributor Author

@buu700 buu700 Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, requesting a data URI through XHR/fetch works as expected; it's just in the node and shell environments where it wouldn't work since they would try to read it from the filesystem.

The drawback to relying on XHR for this in a web page is the CSP thing — that a Content-Security-Policy header would need to whitelist data: — which I documented as a limitation of combining SINGLE_FILE with HTML output. This would eventually fix itself when/if the HTML generation is refactored to use the same implementations from shell.js, and until then probably wouldn't be a common issue.

As a demonstration, you can compare running (async () => console.log(await (await fetch('data:text/plain;base64,YmFsbHMK')).text()))() in a JS console on GitHub (which has a strict CSP) and almost any other site.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks. Seems fine to document the CSP issue. The one concern I have is does a situation exist where a user needs to use a single file, but can't set the CSP (like maybe they are an embedded game in a game website). For that case, could we detect that the xhr fails and use the parsing code we use for shell environments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, well, in that case, I can just copy and paste the new functions into the HTML output and update all the one-off Module.read*-like to behave like Module.read* (parse data URIs without relying on XHR), in which case there'd be no reason to document the CSP connect-src concern. This seemed like more of a hassle when there was a whole other library involved, but not as much now.

That said, script-src will be an issue with SINGLE_FILE in that kind of scenario no matter how we handle it. Right now, script-src (if defined at all) must whitelist data:, which I've also documented, but if we parse the data URI to a string then we'll still need either unsafe-inline or unsafe-eval to be able to execute it. I believe script-src is most commonly used to whitelist just 'self' (the current origin) and maybe specific CDNs, which is fine for regular non-SINGLE_FILE usage but would be a problem here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if this has any significant risk at all of not working, then we shouldn't do it in -O0, as that's the first thing people try and we should work our hardest to not break. So falling back to parsing the string seems like the right thing do do. As for other concerns, for -O0 we just use the mem init file, which doesn't require eval or such, so it should be ok (with the string parsing)?

Yeah, if we do that, seems like we need the code in the html too. Might put those methods in src/ and include them in both places.

Copy link
Contributor Author

@buu700 buu700 Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, sounds good; I just pushed another commit to handle all the parsing in the HTML output and update the documentation. The only remaining CSP quirks are:

  • Embedding asm.js requires script-src 'unsafe-inline' (out of data:, 'unsafe-eval', and 'unsafe-inline', that one seemed like the least unusual to require)

  • Using Workers requires child-src blob: (something new that I caught — passing a data URI into the Worker constructor wouldn't have worked in most browsers, so I had to parse it and convert it into a Blob URL)

As far as mem init 0, yep, there's no risk of CSP breaking that now that it's always directly parsed without XHR.

@kripken
Copy link
Member

kripken commented Jun 14, 2017

it does accomplish basically the same thing as --memory-init-file 0, but it also slightly breaks backwards compatibility (the CSP connect-src thing).

I don't fully understand the CSP issue, can you please elaborate?

About the wasm etc. not existing on line 1119 in emcc.py, yeah, that looks like a problem. We only generate those files later on, all we know is their path at that point. So we can only create a data uri for them later down. This might require refactoring, or perhaps you can do something like embed a special string (like {{{ SINGLE_FILE_WASM }}} and replace it later down when the data is available.

@buu700
Copy link
Contributor Author

buu700 commented Jun 15, 2017

I don't fully understand the CSP issue, can you please elaborate?

Originally, the data URI was going to be requested via XHR, which would fail on any page with a Content-Security-Policy header containing a connect-src directive that didn't explicitly whitelist data:, which would have potentially broken some existing uses of --memory-init-file 0.

It's no longer an issue now, though, since the base64 data is being parsed directly in JS.

About the wasm etc. not existing on line 1119 in emcc.py, yeah, that looks like a problem. We only generate those files later on, all we know is their path at that point. So we can only create a data uri for them later down. This might require refactoring, or perhaps you can do something like embed a special string (like {{{ SINGLE_FILE_WASM }}} and replace it later down when the data is available.

Perfect, thanks, that works for me. Sounds like a good candidate for future refactoring, but right now I'll move forward with the special string solution.

buu700 added a commit to buu700/emscripten that referenced this pull request Jun 15, 2017
buu700 added a commit to buu700/emscripten that referenced this pull request Jun 15, 2017
buu700 added a commit to buu700/emscripten that referenced this pull request Jun 15, 2017
buu700 added a commit to buu700/emscripten that referenced this pull request Jun 15, 2017
buu700 added a commit to buu700/emscripten that referenced this pull request Jun 15, 2017
src/settings.js Outdated
var WASM_TEXT_FILE = ''; // name of the file containing wasm text, if relevant
var WASM_BINARY_FILE = ''; // name of the file containing wasm binary, if relevant
var ASMJS_CODE_FILE = ''; // name of the file containing asm.js, if relevant

var INCLUDE_THIRD_PARTY_SODIUMUTIL = 0; // If set to 1, sodiumutil will be included in the bundle.
Copy link
Contributor Author

@buu700 buu700 Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best way to communicate between shared.JS.get_subresource_location and jsifier.js?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, yes.

buu700 added a commit to buu700/emscripten that referenced this pull request Jun 15, 2017
@buu700
Copy link
Contributor Author

buu700 commented Jun 15, 2017

Remaining issues after my latest push:

  • HTML output CSP concern — I went with option 2, but can always do something different if necessary

  • Base64/UTF-8 implementation concern (#1, #2)

  • Remove --memory-init-file 2?

  • Running the toolchain profiler

  • Edit: Not sure whether this is a regression from my changes, but it looks like the tests are failing:
    emscriptentestlog.txt

buu700 added a commit to buu700/emscripten that referenced this pull request Jun 16, 2017
@buu700
Copy link
Contributor Author

buu700 commented Oct 12, 2017

Thanks @kripken, the function scope issue was the biggest issue that I saw as well, and that last option is what I had in mind; just committed it. Earlier you had brought up a concern about using typeof in this way, which is why I wanted to confirm with you before actually doing it.

Anyway, the SINGLE_FILE tests are all still passing, and I can't imagine this last change would've broken anything else, so I think we're in a pretty good state here now.

@erikdubbelboer
Copy link
Contributor

If you're going to give it a separate name you might as well do what I said before:

function decodeBase64(input) {
  if (typeof atob === 'function') {
    return atob(input);
  }

  ...

This way it doesn't matter at all where in the file the function is defined and it makes it impossible that decodeBase64 is still undefined as the issue I had before.

@buu700
Copy link
Contributor Author

buu700 commented Oct 13, 2017

Good catch. However, we're also already using that pattern in arrayUtils.js, which isn't avoidable until Python preprocessor support is implemented, so we're going to be depending on the ordering of at least one of these functions either way (for the time being). There's also the (trivial) performance implication of checking typeof atob every time decodeBase64 is run, as opposed to just once.

Might make more sense to just ensure that arrayUtils and base64Utils are both emitted as early as they possibly can be without breaking anything, but it doesn't matter much to me. I'll leave this up to @kripken before touching anything.

@juj
Copy link
Collaborator

juj commented Oct 13, 2017

Merged this PR locally on top of latest incoming + PR #5664 for testing, and noticed that the test browser.test_modularize fails after the merge.

My merge point is at https://github.com/juj/emscripten/tree/buu_incoming (it did merge cleanly without conflicts)

The error message is

test on [] ['-s', 'EXPORT_NAME="HelloWorld"'] 
          if (typeof Module !== "undefined") throw "what?!"; // do not pollute the global scope, we are modularized!
          HelloWorld.noInitialRun = true; // errorneous module capture will load this and cause timeout
          HelloWorld();
        
[browser launch: a.html ]
[server response: /report_result?0 ]
test on [] ['-s', 'EXPORT_NAME="HelloWorld"'] 
          var hello = HelloWorld({ noInitialRun: true, onRuntimeInitialized: function() {
            setTimeout(function() { hello._main(); }); // must be async, because onRuntimeInitialized may be called synchronously, so |hello| is not yet set!
          } });
        
[browser launch: a.html ]
[server response: /report_result?0 ]
test on [] ['-s', 'EXPORT_NAME="HelloWorld"', '--memory-init-file', '0'] 
          var hello = HelloWorld({ noInitialRun: true});
          hello._main();
        
[browser launch: a.html ]
JavaScript error: , line 0: uncaught exception: abort("Assertion failed: you need to wait for the runtime to be ready (e.g. wait for main() to be called)") at jsStackTrace@http://localhost:8888/a.out.js:1164:13
stackTrace@http://localhost:8888/a.out.js:1181:12
abort@http://localhost:8888/a.out.js:9042:44
assert@http://localhost:8888/a.out.js:461:5
HelloWorld/asm._main@http://localhost:8888/a.out.js:8606:3
@http://localhost:8888/a.html:6:11

buu700 added a commit to buu700/emscripten that referenced this pull request Oct 13, 2017
@buu700
Copy link
Contributor Author

buu700 commented Oct 13, 2017

Nice catch @juj, thanks! It looks like the issue is that we were preferring to parse the embedded mem init base64 asynchronously with fetch before falling back to tryParseAsDataURI — which is fine for other cases which expect file data to arrive asynchronously, but in this case broke the assumption that asm.js module initialization would always be synchronous. Just committed a fix.

@kripken
Copy link
Member

kripken commented Oct 16, 2017

Ok, assuming tests pass (running locally now), I think this is good to merge. Thanks for all the work here @buu700! And for your patience, I know this was a long process.

I think we can improve this further after it lands. There are a few places where code is not behind an ifdef where it could be. For example when building and testing ammo.js now, I see that removeRunDependency is included, when it wasn't before. Looks like it's because we used to have #if MEM_INIT_METHOD == 1 around the mem init loading code, but have now removed that, and it includes calls to add/removeRunDependency, so the optimizer can't get rid of them any more. Also, I think we can ifdef more code in the generated HTML when not SINGLE_FILE.

Might be a good idea to compare code outputs from before and after this PR lands, in various optimization modes, and look for differences.

Also in the longer term we should improve things with a better preprocessor approach, which would let us remove the duplicated code this PR introduces.

@buu700
Copy link
Contributor Author

buu700 commented Oct 16, 2017

That sounds great, thanks! I definitely wasn't expecting it to take over four months to land, but ultimately this approach was still much faster and easier than building my own separate bundling implementation and maintaining it long-term, so thanks for your patience working with me to shake out the bugs and inefficiencies in this.

As far as the post-merge improvements, sounds like a good plan to me, and will probably be faster not to have to test all those things by proxy through me.

One other thing we may want to consider doing before merging, as suggested by @erikdubbelboer: change var decodeBase64 = typeof atob === 'function' ? atob : function (input) { ... to function decodeBase64(input) { if (typeof atob === 'function') return atob(input); ... to ensure that the current lack of hoisting doesn't cause future bugs. Alternatively, for a similar effect without the minor performance implication, ensure that base64Utils.js is emitted as early as it can possibly be.

@kripken
Copy link
Member

kripken commented Oct 16, 2017

Ok, tests passed over here (except for a failure that seems to be an existing unrelated regression).

I think it's ok to merge this before the atob improvements, I'll take a look at that later today.

@kripken kripken merged commit 9b32d3e into emscripten-core:incoming Oct 16, 2017
@kripken
Copy link
Member

kripken commented Oct 16, 2017

Hmm, actually, I'm not sure what the problem is with the current decodeBase64 being a var instead of a function? Is the concern that atob might not exist at that time, but might exist later?

Anyhow, a PR with improvements there is welcome, maybe I'm just not seeing the problem yet.

@buu700
Copy link
Contributor Author

buu700 commented Oct 16, 2017

Awesome!

As far as the var vs function thing: there was one instance of something relying on base64Utils.js and breaking because it was emitted before base64Utils.js, which was fixed by rearranging the print statements in jsifier.js (eae3ed5).

@erikdubbelboer's point was that this entire class of bug could be avoided going forward if the functions were hoisted. I decided to leave this up to you because of 1) the minor performance hit we'd take by running typeof atob inside the function body, and 2) the fact that a temporary workaround for the lack of Python preprocessor support is forcing us to use var for a different function in arrayUtils.js anyway.

@buu700 buu700 deleted the incoming branch October 16, 2017 22:56
@buu700 buu700 restored the incoming branch October 16, 2017 22:57
@kripken
Copy link
Member

kripken commented Oct 18, 2017

I see, thanks. Yeah, I think it's ok to leave things as they are, but I don't feel strongly either way.

@nazar-pc
Copy link
Contributor

nazar-pc commented Nov 2, 2017

Is atob polyfill really needed here? This function is available in IE10+ and also supported by any other modern browser. And IE10- are not supported by Microsoft anymore.
Do you still think it is a good idea to have it at all?

@kripken
Copy link
Member

kripken commented Nov 2, 2017

I think it might not exist in web worker and shell contexts.

@buu700
Copy link
Contributor Author

buu700 commented Nov 2, 2017

Specifically, it doesn't exist in versions of WebKit from before last summer or the SpiderMonkey shell.

@nazar-pc
Copy link
Contributor

nazar-pc commented Nov 2, 2017

According to MDN it is supported in WebKit since about forever (also Safari is evergreen browser, so doesn't really matter): https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/atob#Browser_compatibility
And yes, I've just compiled Spidermonkey shell and it doesn't have atob unfortunately:(

@buu700
Copy link
Contributor Author

buu700 commented Nov 2, 2017

Oops, sorry, that was a typo. I meant that it isn't supported in Workers in older versions of WebKit: https://bugs.webkit.org/show_bug.cgi?id=158576

sbc100 added a commit to sbc100/emscripten that referenced this pull request Oct 2, 2023
It seems like the error is strictly worse than just letting the
exception propagate out.  For example, we have a report of someone
hitting this error but I can't tell what the root cause was:
emscripten-core#20349

This try/catch and this error message was part of the original
function back emscripten-core#5296 but I don't see any reason for them.
sbc100 added a commit to sbc100/emscripten that referenced this pull request Oct 3, 2023
It seems like the error is strictly worse than just letting the
exception propagate out.  For example, we have a report of someone
hitting this error but I can't tell what the root cause was:
emscripten-core#20349

This try/catch and this error message was part of the original
function back emscripten-core#5296 but I don't see any reason for them.
sbc100 added a commit that referenced this pull request Oct 3, 2023
It seems like the error is strictly worse than just letting the
exception propagate out.  For example, we have a report of someone
hitting this error but I can't tell what the root cause was:
#20349

This try/catch and this error message was part of the original
function back #5296 but I don't see any reason for them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants