Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update proposal #14

Merged
merged 6 commits into from
Jan 16, 2024
Merged

Update proposal #14

merged 6 commits into from
Jan 16, 2024

Conversation

eqrion
Copy link
Collaborator

@eqrion eqrion commented Dec 29, 2023

Several big changes here, hoping to resolve the remaining outstanding issues and TODOs. Apologies for the single PR.

The major changes are:

  • Many editorial changes to tighten up ambiguities in the definition of builtins
  • Reworked logic in builtins to handle edge cases discovered through implementation
  • Reworked some builtin names to better match JS naming
  • Removed WTF-8 support and replaced it with UTF-8 support through TextEncoder/TextDecoder (probably the controversial one, reasoning is in the commit message).

Fixes #12, touches on #11, #10, and #9.

@jakobkummerow PTAL. @wingo, curious your thoughts on the encodings part. No rush anyone as I know it's the holidays.

  - Eliminate usage of 'builtin module' in description. This is not essential to the proposal and
    causes confusion around a similarly named JS proposal, which had different goals.
  - Clarify some minor points.
  - Make JS-API changes to WebIDL comprehensive.
  - Reword feature detection section to actually propose change to WebAssembly.validate method
  - Function builtin behaviors is defined using 'create a host function'
  - Clarify behavior around monkey patching using standard language
  - Clarify edge cases around nullability
  - Clarify edge cases around unsigned/signed integers
  - Restrict 'substring' behavior to normal cases
  - Use wasm helpers for when wasm instructions are needed
The existing WTF-8 operation in this proposal violated one of the goals of the
proposal: "don't create substantial new functionality" by introducing WTF-8
transcoding support to the web platform without prior precedent. The WTF-8
operation is removed because of this.

The naming for WTF-16 operations is reworked to refer to 'charCodes' instead
as that is what the JS String interface uses.

We could support UTF-8 transcoding by referring to the TextEncoder/TextDecoder
interfaces, so this commit adds support for that.
proposals/js-string-builtins/Overview.md Outdated Show resolved Hide resolved
proposals/js-string-builtins/Overview.md Show resolved Hide resolved

TODO: formalize these better.
There is however, the Encoding API for `TextEncoder`/`TextDecoder` which can be used for UTF-8 support. However, this is technically a separate spec from JS and may not be available on all JS engines (in practice it's available widely). This proposal exposes UTF-8 data conversions using this API under separate `wasm:text-encoder` `wasm:text-decoder` interfaces which are available when the host implements these interfaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting this proposal to UTF-8 (as opposed to WTF-8) sounds fine for newly written code, but I'm worried about legacy Java/Kotlin/Dart/etc code that contains unpaired surrogates in string constants. Being able to compile such string constants to i8 data segments and creatings strings from them via array-from-data + string-from-array was the original motivation for supporting WTF-8. (Using 16-bit code units in data segments is wasteful for latin-based languages and non-localized/application-internal ASCII strings, and also a bit of a headache to implement because data segments have no alignment requirements.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. The tricky part about that use-case for WTF-8 is:

  1. There's no prior API's that operate on WTF-8 in the Web (AFAIK), it's a new encoding that would need to be adopted. It's also not really a web standard yet either, it's just a repo on github (http://simonsapin.github.io/wtf-8/). Adding it to this proposal would be a major new addition to the web platform, and would need to get consensus from other stakeholders that it's okay. If we got that, I'd feel better about it.
  2. The WTF-8 spec says 'WTF-8 must not be used to represent text in a file format or for transmission over the Internet' and so supporting it for embedding string constants in a wasm binary seems counter to the intent of the proposal.

Is there some other way we could support this use-case? The TextDecoder interface does support ASCII/Latin-1 and so we could add support for that encoding on JS string creation path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re 1, I'm not so worried about that part. WTF-8 is a small and straightforward relaxation of UTF-8. It's just a way to represent the JS strings that are already prevalent on the web in a 1-byte-codeunit format. Also, it's not like we're adding WTF-8 strings as a general-purpose primitive, nor are we encouraging anyone to use it as a data interchange format. I see this closer to e.g. TextDecoder supporting a (huge!) bunch of legacy encodings that aren't otherwise used on the web (note that we didn't have a string-to-wtf8 function, only string-from-wtf8).

Re 2, yes, that's an unresolved wrinkle that I'm also unhappy about. Wasm satisfies the "used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons" part, it just unfortunately also has a "transmission over the Internet" step between the module producer and the browser executing the module.

All that said, it's probably acceptable, at least for a first version, to have only UTF-8 and WTF-16 (aka "charCode") support. Since the relevant use case is string constants, toolchains can check the string for UTF-8 validity, and use the UTF-8 encoding most of the time, and fall back to WTF-16 in the rare cases where UTF-8 is too strict. We can reopen the WTF-8 discussion if and when evidence arises that the UTF-8+WTF-16 combo is insufficient.

I think having a way to create strings from ASCII data would be nice in general (for performance: one use case where it came up is number-to-string conversion). I'd like to see it added right away, but I admittedly don't have a strong reason to argue that it is necessary for an MVP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a way to create strings from ASCII data would be nice in general (for performance: one use case where it came up is number-to-string conversion). I'd like to see it added right away, but I admittedly don't have a strong reason to argue that it is necessary for an MVP.

It's probably worth adding, I won't do it in this PR but in a future one.

"start" is the index in the array where the first codeunit of the string will be written.

Returns the number of codeunits written. Traps if the string doesn't fit into the array.
### "wasm:js-string" "copyToCharCodeArray"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this have "copy" as part of the name? Wouldn't fromCharCodeArray/toCharCodeArray be a more consistent pair of names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that it takes a pre-allocated array to copy into, instead of creating a new array. I figured that if we had support for a string -> (array i8) operation that creates an immutable array from a JS string, then we'd want to call that fromCharCodeArray and need some way to distinguish.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would intoCharChodeArray work for you? It matches the TextDecoder interface which has to and into versions for allocating/re-using existing buffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine. (It's also just a name, not very important either way.)

If we do want to unify names, fromCharCodeArray and decodeStringFromUTF8Array could also be made more similar to each other, but again it's just a name. (Some might argue "yeah, the names are different because one is a JS String operation and the other a TextDecoder thing", some might reply "what do I care? They're utf8 and wtf16 versions of the same operation, so they should have similar names".)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, indeed my reasoning was 'the names are different because one is a JS String operation and the other a TextDecoder thing'. I think it's more compelling when you consider the full import specifier:

"wasm:js-string" "fromCharCodeArray"

  1. "wasm:text-decoder" "fromUTF8Array" vs
  2. "wasm:text-decoder" "decodeStringFromUTF8Array"

(1) appears to me to be creating a text decoder from a UTF-8 array.

proposals/js-string-builtins/Overview.md Show resolved Hide resolved
proposals/js-string-builtins/Overview.md Outdated Show resolved Hide resolved
proposals/js-string-builtins/Overview.md Show resolved Hide resolved
proposals/js-string-builtins/Overview.md Outdated Show resolved Hide resolved
proposals/js-string-builtins/Overview.md Outdated Show resolved Hide resolved
proposals/js-string-builtins/Overview.md Show resolved Hide resolved
end > string.length)
// Ensure the range is within bounds to avoid the complex behavior that
// `substring` performs when that is not the case.
if (start > string.length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think checking for start <= end (i.e. taking the return "" path when start > end) was useful, let's keep that.
(I also didn't mean to express strong opposition to what you had here before. If you'd like to return "" for end > string.length, I'm fine with that. I was just pointing out that handling the start < string.length < end case by capping end is rather simple, if we want to allow it.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the change.


TODO: formalize these better.
There is however, the Encoding API for `TextEncoder`/`TextDecoder` which can be used for UTF-8 support. However, this is technically a separate spec from JS and may not be available on all JS engines (in practice it's available widely). This proposal exposes UTF-8 data conversions using this API under separate `wasm:text-encoder` `wasm:text-decoder` interfaces which are available when the host implements these interfaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re 1, I'm not so worried about that part. WTF-8 is a small and straightforward relaxation of UTF-8. It's just a way to represent the JS strings that are already prevalent on the web in a 1-byte-codeunit format. Also, it's not like we're adding WTF-8 strings as a general-purpose primitive, nor are we encouraging anyone to use it as a data interchange format. I see this closer to e.g. TextDecoder supporting a (huge!) bunch of legacy encodings that aren't otherwise used on the web (note that we didn't have a string-to-wtf8 function, only string-from-wtf8).

Re 2, yes, that's an unresolved wrinkle that I'm also unhappy about. Wasm satisfies the "used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons" part, it just unfortunately also has a "transmission over the Internet" step between the module producer and the browser executing the module.

All that said, it's probably acceptable, at least for a first version, to have only UTF-8 and WTF-16 (aka "charCode") support. Since the relevant use case is string constants, toolchains can check the string for UTF-8 validity, and use the UTF-8 encoding most of the time, and fall back to WTF-16 in the rare cases where UTF-8 is too strict. We can reopen the WTF-8 discussion if and when evidence arises that the UTF-8+WTF-16 combo is insufficient.

I think having a way to create strings from ASCII data would be nice in general (for performance: one use case where it came up is number-to-string conversion). I'd like to see it added right away, but I admittedly don't have a strong reason to argue that it is necessary for an MVP.

"start" is the index in the array where the first codeunit of the string will be written.

Returns the number of codeunits written. Traps if the string doesn't fit into the array.
### "wasm:js-string" "copyToCharCodeArray"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine. (It's also just a name, not very important either way.)

If we do want to unify names, fromCharCodeArray and decodeStringFromUTF8Array could also be made more similar to each other, but again it's just a name. (Some might argue "yeah, the names are different because one is a JS String operation and the other a TextDecoder thing", some might reply "what do I care? They're utf8 and wtf16 versions of the same operation, so they should have similar names".)

@eqrion
Copy link
Collaborator Author

eqrion commented Jan 16, 2024

I think at this point all comments have been addressed, minus some naming stuff that we can continue in issues if important. The WTF-8 stuff has some open issues we can continue discussing there too.

@eqrion eqrion merged commit 206a3a0 into main Jan 16, 2024
@eqrion eqrion deleted the refactor-2 branch January 16, 2024 14:43
eqrion pushed a commit that referenced this pull request Aug 6, 2024
It seems like a wrong copy-and-paste from Firefox.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Naming inconsistency: "concat" vs. "concatenate"
2 participants