-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarity on units of string #65
Comments
FWIW, Haskell strings also consist of code points:
|
Swift strings are an encoding-independent array of Unicode code points: https://docs.swift.org/swift-book/documentation/the-swift-programming-language/stringsandcharacters/ But the point of this proposal is that every language benefits from using stringref at the boundaries, even if they don't use stringref internally. |
Not specifically, unless they want to interop with a language with JS-like strings at the boundary. We should assume that embedder languages are as diverse as embedded languages, and consequently, stringref semantics generally doesn't match either side. Imagine a Python3 embedding API for Wasm. |
A stringref is conceptually an array of Unicode scalar values. It does not specify a particular encoding. As the overview says:
Internally the Wasm engine can represent a stringref however it wants (ropes, unions, breadcrumbs, etc.), the stringref proposal does not require a particular implementation strategy. stringref is encoding-agnostic and generic, but it supports operations to convert into / from a variety of different encodings. Currently the spec supports conversion for WTF-8 and WTF-16 encodings, but it would be easy to add more. stringref is intended to be a general purpose string type which can be used at the boundary of every language, regardless of the language's internal string encoding.
The stringref proposal does not require both sides to have the same string encoding. The point is that stringref is an intermediary: languages can convert their internal string type into a stringref, and then pass that stringref to another Wasm module. That Wasm module can then convert the stringref into its own internal string type. Why is that better than using say... |
As far as I can tell, Swift strings are sequences of Unicode scalar values (as mentioned in your link), not code points, so they're comparable to safe Rust strings, but the API hides away the encoding. eg, it doesn't seem possible to construct a string in Swift containing the code point
The point I'm trying to raise in this issue is that as described, the proposal seems to require a new string representation, but it also claims "No new string implementations on the web: allow re-use of JS engine's strings" as one of the requirements. It's possible that the intention is for
The proposal says that it can also have "isolated surrogates", which requires special encoding. As explained above, "isolated surrogates" and "Unicode scalar values" as a base unit of strings is not compatible with UTF-16/WTF-16 (or even strict WTF-8), since there are sequences of isolated surrogates where the obvious encoding is already used by UTF-16 for Unicode scalar values. |
No, that is not the intention. The intention is for stringref to be an encoding-agnostic string representation that can be used by every language for interop at the boundaries, while also allowing efficient transferring of host strings (such as JS strings). This can be accomplished in many ways, it is not necessary to hardcode stringref as WTF-16. JS engines do not use WTF-16 for JS strings, instead they use multiple different internal representations (Latin1, WTF-16, ropes, etc.) Therefore, hardcoding WTF-16 for stringref would NOT allow for zero-copy transferring of JS strings. In order to fulfill the goal of zero-copy transferring of JS strings, the internal representation of stringref must not be hardcoded. Similar to JS engines, it is expected that Wasm engines might choose to have multiple different internal representations for stringref. The Wasm engine will transparently switch between those internal representations as needed. And when creating views on a stringref, there are many different implementation strategies (such as Swift breadcrumbs). Making it encoding-agnostic means that the Wasm engine can choose whatever strategy works best. Hardcoding WTF-16 forces copies of JS strings, makes stringref less useful for interop, and prevents those sorts of internal optimizations. That's why stringref has value beyond just
Yes, which is why the views are defined as WTF-8 / WTF-16, and not UTF-8 / UTF-16. The views give bytes, not Unicode code points. This is already covered in the overview. The stringref proposal also provides operators that make it easy for languages to validate that the bytes are valid UTF-8 / UTF-16. Those operators can be made very efficient (which isn't possible with |
Correct me if I'm wrong, but with
I don't see why this is the case. The contents of the Looking more closely at the Concatenation section of the proposal, the WTF-16/WTF-8 semantics are already hardcoded into
Right. JS implementations already do it, so I'd expect a Wasm implementation could do it too, just as long as it pretends that the string is made up of UTF-16 code units. This doesn't imply that the user needs to see the UTF-16 code units or UTF-16 string offsets, but it has implications around what the set of possible strings is. |
Yes, if you want to do string operations on the JS string, then you would need to somehow import the JS string operations. Calls into JS are fast, so that's not a big deal.
Yes, which is why stringref is nice, because it normalizes the behavior and operations across all hosts, instead of being only for JS.
If the spec says that the encoding is WTF-16 (and thus provides constant-time access to WTF-16 bytes), then it would have to convert the JS string (which might not be WTF-16) into WTF-16. If the string representation is left unspecified, then the Wasm engine can do whatever it wants, including just sending over the JS string directly without any conversion.
No it is not hardcoded, the implementation does not need to use WTF-16 bytes for the internal representation. That note is simply saying that when concatenating isolated surrogate pairs, it needs to combine the two isolated surrogates into a single surrogate pair. That does not force the Wasm engine to use a WTF-16 implementation. You seem to be mixing up the internal representation of stringref with the observable behavior of the stringref operations. They are not the same thing.
stringref does not pretend to be any encoding, the encoding is always completely unspecified. It is the stringref views that pretend to be a particular encoding (such as WTF-8 or WTF-16). You can have multiple views on the same stringref (e.g. a WTF-8 and WTF-16 view on the same stringref). When you convert from a stringref into a stringref view, this does have a (potential) performance cost. But after that conversion is done, operations on the stringref view should be fast. So it is expected that languages will accept a stringref, convert it into a view which matches the language's internal string representation, and then do operations on that view. e.g. Rust would accept a stringref, convert it into a WTF-8 view, then do operations on that view. But Java would accept a stringref, convert it into a WTF-16 view, and then do operations on that view. The views is what makes stringref work for every language, because different languages can have different views of the same stringref. |
I've assumed that when you've talked about "hardcoding", you're still only talking about the specification. The implementation is obviously allowed to do whatever it wants as long as it matches the behaviour described by the specification. This includes using alternative underlying string representations, as already happens in JavaScript implementations, where the ECMAScript specification describes strings as being sequences of 16-bit integers. It seems that semantically, this proposal is suggesting that I think it should be fairly obvious that a Note that well-formed WTF-8 strings correspond directly to potentially ill-formed UTF-16 strings (aka, WTF-16 strings). WTF-16 and well-formed WTF-8 both represent an array of 16-bit integers, and it seems like it should be up to an implementation to choose which one (if any) to use for the underlying string representation, but presumably it would be simpler if the specification clearly stated that the string is a sequence of 16-bit integers. |
Yes, and the point of the spec is that the encoding of stringref is not specified. If it was specified, then that would be hardcoding. Because WebAssembly is very low level (at a similar level as CPU bytecode or VM bytecode), WebAssembly specs make certain guarantees about behavior, performance, and the exact layout of memory. The stringref proposal intentionally does not make guarantees about the exact memory layout of a stringref, instead it makes guarantees about the behavior and performance of stringref views.
Internally, no, it does not require that. It only requires that when concatenating isolated surrogates, they are combined into a surrogate pair. That does not impose any restrictions on the internal implementation of stringref, because that behavior can be trivially accomplished with any internal encoding. For example, let's suppose that a WebAssembly engine chooses to implement stringref with WTF-8. And let's suppose that the WebAssembly program concatenates these two stringrefs together: "\uD801"
"\uDC37" Because it uses WTF-8 internally, those two stringrefs are represented with the following byte sequences:
And when concatenating, the new stringref will have the following byte sequence:
This behavior is unambiguous and it follows naturally from the behavior of WTF-8 and surrogate pairs. But this is different from WTF-16, which would simply concatenate the two byte sequences directly:
Correct, which is why the specification leaves the encoding of stringref unspecified, to allow those sorts of implementations.
It corresponds conceptually, but not in terms of actual memory layout. WebAssembly is low level, WebAssembly programs can inspect and modify individual bytes, WebAssembly programs and WebAssembly engines care about memory layout. That's why the stringref spec does not specify a particular memory layout, instead the engine has the flexibility to implement stringref however it wants (as long as it fulfills the requirements in the spec).
Yes it is up to the implementation to decide its internal representation, and no it would not be simpler or clearer if the spec said that a string is a sequence of 16-bit integers, because that would restrict the internal implementation. It is the views that provide guarantees about encoding, and there are multiple views, there is not a single "canonical" view. |
The aforementioned languages cannot, because as we saw from the counter examples, they distinguish string values that stringref can't distinguish. The only canonical choice of intermediary would be well-formed Unicode. But stringref relaxes to WTF, which specifically covers additional string values that JS and friends can represent, but not those that some other languages have. AFAICS, that is a language-biased choice, not a neutral one.
Which does not fit the semantics of concatenation demanded by Python3 or the other languages mentioned. So again that is a biased semantics.
That's not correct. An implementation is always free to do whatever it wants internally as long as its observable behaviour matches the specification. And the prescribed observable behaviour arguably is most directly specified in terms of WTF-16 code units. |
Really good catch, @Maxdamantus ! So if we were to define an abstract string data type in WebAssembly, we have a few options. We could say:
I made a thinko in e.g. https://wingolog.org/archives/2023/10/19/requiem-for-a-stringref, considering that (3) was the same as (2). My mistake. Backing up a bit then, is there even a reasonable choice to make here? Maybe not -- ultimately this is a question of taste and consensus and tradeoffs -- but I think that if we were forced to choose, (2) is the right option. In my humble opinion, emphasis on the H part here—happy to be corrected—Python's choice of (3) is actually a bug which could (should?) be fixed over time, and which does not need to be accommodated by a standard Wasm string data type. My reasoning is that surrogate codepoints proceed from two places only (happy to be corrected here too): Surrogates don't come across the wire, because they are not part of a Unicode encoding scheme: they can't be expressed in UTF-16 or UTF-8. So you can only create them in your program, and there are only these two ways that it happens. In the case of Python 3, neither case applies: there are no UTF-16 surrogate pairs to split, and you wouldn't use a Python 3 string as raw data storage. |
@wingo, as far as I am concerned, isolated surrogates are a bug in practice no matter how they appear in a text string. The difference between (2) and (3) is that the former covers the class of bugs arising in languages of the 16-bit Unicode era, while the latter covers the cases arising in languages from the 32-bit Unicode era. Both come from exposing the respective underlying encoding and giving read/write access to raw code units. Neither is more likely to be "fixed" in future versions of the corresponding languages. Moreover, down the road, new languages are more likely to be in camp (3) than in camp (2). (Hopefully, they'll all be in camp (1) instead.) From my perspective, there are two defendable design choices for a "universal" Unicode string semantics (ignoring (4), which could be raw 32-bit values): a. Support the intersection of possible values – that would be (1), maximising utility for well-defined language interop. In contrast, (2) is neither here nor there, and combines the semantical disadvantages of both. |
I've been wondering about another approach which seems simpler interfacially, not sure if it's already been discussed: Just add three string types, I feel like it should be feasible for an implementation to essentially create zero-copy If something like this were added, I would expect for example that converting from an ill-formed Also note that for strings containing only ASCII data, this conversion is likely to be a no-op, since ASCII strings use the same code units and same length in UTF-8, UTF-16 and UTF-32. The main implementation issues I see with this approach for non-ASCII strings are computing length and indexing of the zero-copy converted strings, since those would potentially be linear operations. This might be solvable by simply computing length and/or breadcrumbs lazily. I think it would still make sense to have Also regarding Python's use of code points as the unit of strings, I don't think it's possible for them to fix this[0], especially given that they make use of these ill-formed strings for encoding ill-formed UTF-8 data (note that this is similar in principle to how WTF-8 uses ill-formed UTF-8 to encode ill-formed UTF-16 data). [0] At least not without a major Python version. I've long considered the change between Python 2 and Python 3 to be a mistake (I think this opinion is becoming less and less controversial), and PEP-0383/surrogateescape is one piece of evidence for this (my understanding is that they added this some time after Python 3 was released when they realised that actually you often do just need to pass the bytes along in a string, but this raises the question of why one ill-formed string is better than the other ill-formed string). |
Just a note on https://peps.python.org/pep-0383/: this strategy appears to only ever produce high surrogate codepoints. These strings are representable in option (2). |
Note that they can feasibly come across the wire in the form of JSON. |
I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're represented in different programming languages and VMs.
This might be just a wording issue, but I think the current description of "What's a string?" could be problematic:
If this is taken to mean that a string is a sequence of Unicode code points (ie, "Unicode scalar values" and "isolated surrogates", basically any integer from
0x0
to0x10FFFF
), this does not correspond with "WTF-16" or JavaScript strings, since there are sequences of isolated surrogates that can't be distinctly encoded in WTF-16.eg, the sequence
[U+D83D, U+DCA9]
(high surrogate, low surrogate) doesn't have an encoding form in WTF-16. Interpreting the WTF-16/UTF-16 code unit sequence<D83D DCA9>
produces the sequence[U+1F4A9]
(a Unicode scalar value,💩
). There are 1048576 occurrences of such sequences (one for every code point outside of the BMP), where the "obvious" encoding is already used to encode a USV.Note that this is different to how strings work in Python 3, where a string is indeed a sequence of any Unicode code points:
If this proposal is suggesting that strings work the same way as in Python 3, I think implementations will likely[0] resort to using UTF-32 in some cases, as I believe Python implementations do (I think Python implementations usually use UTF-32[1] essentially for all strings, though they will switch between using 8-bit, 16-bit and 32-bit arrays depending on the range of code points used). Other than Python, I'm not actually sure what language implementations would benefit from such a string representation.
As a side note, the section in question also lumps together Python (presumably 3) and Rust, though this might result from a misunderstanding that should hopefully be explained above. Rust strings are meant to be[2] valid UTF-8, hence they correspond to sequences of Unicode scalar values, but as explained above, Python strings can be any distinct sequence of Unicode code points.
[0] Another alternative would be using a variation of WTF-8 that preserves UTF-16 surrogates instead of normalising them to USVs on concatenation, though this seems a bit crazy.
[1] Technically this is an extension of UTF-32, since UTF-32 itself doesn't allow encoding of code points in the surrogate range.
[2] This is at least true at some API level, though technically the representation of strings in Rust is allowed to contain arbitrary bytes, where it is up to libraries to avoid emitting invalid UTF-8 to safe code: rust-lang/rust#71033
The text was updated successfully, but these errors were encountered: