Per the spec, ES modules should always be decoded lossily as UTF-8 #128

andreubotella · 2021-12-15T20:55:33Z

In the spec, the "fetch a single module script" algorithm handles fetching and parsing a specific ES module, and in step 13 it uses the encoding spec's "UTF-8 decode" to decode the response body into a string, which it then passes to the relevant parser. This algorithm decodes UTF-8 lossily (i.e. substituting invalid byte sequences with replacement characters), and if the body starts with a UTF-8 BOM, it strips it. It doesn't switch the encoding based on the MIME type, or on the presence of a UTF-16 BOM.

Deno's behavior depends on whether the module specifier has a file:// schema. For non-file:// schemas, the MIME type's charset parameter is used to pick an encoding if present. If not, UTF-8 is used. If the charset parameter doesn't correspond to a valid encoding, it throws. The BOM is ignored when choosing the encoding. For local files, the BOM is used, and if there isn't one, UTF-8 is used. In either case, if the decoding finds an invalid byte sequence in the chosen encoding at any point, the module load throws. And a BOM is not stripped when decoding, regardless of whether it was used to choose the encoding.

Note that Deno's behavior of choosing the MIME charset parameter over the BOM goes not only against the browser's handling of modules, but also against the way browsers handle classic scripts and page loads, which always prioritizes the BOM.

This was not noticed before now because most of the WPT tests that test for module loading are HTML-only, which our test runner doesn't support, since they typically use <script type="module">. The only JS tests for module loading deal with import assertions and JSON modules, and so this was only noticed when implementing them in denoland/deno#12866.

Since the behavior with local and remote files is so different, and this doesn't seem to have been noticed until now, it doesn't seem like switching to UTF-8 only would be very breaking in the Deno ecosystem. The one case where this might be breaking is with UTF-16 files with a BOM, since servers would be likely to detect and serve them as the correct encoding – but I doubt this is something to lose sleep over, since UTF-16 is very rarely used. Another potential concern might be CDN's that transpile Node.js code, since the served code might not have been run locally at any point, but the most used ones seem to always serve either with charset=utf-8 or without the charset parameter.

The text was updated successfully, but these errors were encountered:

andreubotella · 2022-01-17T05:18:57Z

I've been looking at this again, and this seems like an unlikely change to be accepted, since despite the arguments I made, there's no knowing what would break. Depending on how we feel about it, we might enable warnings for non-UTF-8 JS/TS modules and see the user response. But UTF-8 should be required for JSON modules, right?

kitsonk added the bug Something isn't working label Dec 15, 2021

bartlomieju transferred this issue from denoland/deno Aug 6, 2023

ikreymer mentioned this issue Apr 16, 2024

Wombat does not support Javascript importmap webrecorder/wombat#141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per the spec, ES modules should always be decoded lossily as UTF-8 #128

Per the spec, ES modules should always be decoded lossily as UTF-8 #128

andreubotella commented Dec 15, 2021 •

edited

Loading

andreubotella commented Jan 17, 2022

Per the spec, ES modules should always be decoded lossily as UTF-8 #128

Per the spec, ES modules should always be decoded lossily as UTF-8 #128

Comments

andreubotella commented Dec 15, 2021 • edited Loading

andreubotella commented Jan 17, 2022

andreubotella commented Dec 15, 2021 •

edited

Loading