Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per the spec, ES modules should always be decoded lossily as UTF-8 #128

Open
andreubotella opened this issue Dec 15, 2021 · 1 comment
Open
Labels
bug Something isn't working

Comments

@andreubotella
Copy link
Contributor

andreubotella commented Dec 15, 2021

In the spec, the "fetch a single module script" algorithm handles fetching and parsing a specific ES module, and in step 13 it uses the encoding spec's "UTF-8 decode" to decode the response body into a string, which it then passes to the relevant parser. This algorithm decodes UTF-8 lossily (i.e. substituting invalid byte sequences with replacement characters), and if the body starts with a UTF-8 BOM, it strips it. It doesn't switch the encoding based on the MIME type, or on the presence of a UTF-16 BOM.

Deno's behavior depends on whether the module specifier has a file:// schema. For non-file:// schemas, the MIME type's charset parameter is used to pick an encoding if present. If not, UTF-8 is used. If the charset parameter doesn't correspond to a valid encoding, it throws. The BOM is ignored when choosing the encoding. For local files, the BOM is used, and if there isn't one, UTF-8 is used. In either case, if the decoding finds an invalid byte sequence in the chosen encoding at any point, the module load throws. And a BOM is not stripped when decoding, regardless of whether it was used to choose the encoding.

Note that Deno's behavior of choosing the MIME charset parameter over the BOM goes not only against the browser's handling of modules, but also against the way browsers handle classic scripts and page loads, which always prioritizes the BOM.

This was not noticed before now because most of the WPT tests that test for module loading are HTML-only, which our test runner doesn't support, since they typically use <script type="module">. The only JS tests for module loading deal with import assertions and JSON modules, and so this was only noticed when implementing them in denoland/deno#12866.

Since the behavior with local and remote files is so different, and this doesn't seem to have been noticed until now, it doesn't seem like switching to UTF-8 only would be very breaking in the Deno ecosystem. The one case where this might be breaking is with UTF-16 files with a BOM, since servers would be likely to detect and serve them as the correct encoding – but I doubt this is something to lose sleep over, since UTF-16 is very rarely used. Another potential concern might be CDN's that transpile Node.js code, since the served code might not have been run locally at any point, but the most used ones seem to always serve either with charset=utf-8 or without the charset parameter.

@kitsonk kitsonk added the bug Something isn't working label Dec 15, 2021
@andreubotella
Copy link
Contributor Author

I've been looking at this again, and this seems like an unlikely change to be accepted, since despite the arguments I made, there's no knowing what would break. Depending on how we feel about it, we might enable warnings for non-UTF-8 JS/TS modules and see the user response. But UTF-8 should be required for JSON modules, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants