-
Notifications
You must be signed in to change notification settings - Fork 6
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String constants #13
Comments
Sure, that could work. I find it hard to estimate whether it would be more efficient than using globals directly, but it would certainly have a higher limit on the number of strings it can support. Speaking of which: the relevant limit for the status quo is 100K imports, not 1M globals. I think requiring a bunch of JS code to run before the Wasm module can be instantiated is Not Great. That's the bullet that all imports-based approaches to string contents have to bite. I think making |
Ah yes, I felt like it was lower than 1M but couldn't find a reason why.
From some discussion with @rossberg, I realized this would require us to have A different option to avoid any significant core wasm change would be to:
I'd really prefer to avoid adding an optimization like this to our baseline compiler (and I assume interpreter if we had one too). We'd need to add some way of tracking a deferred |
Sure, we don't have it in our baseline compiler either. It's a functionally unobservable and hence strictly optional internal optimization, appropriate mostly for optimizing compilers. |
Just so I understand correctly, are you saying that your recent toolchain experience indicates we'll need something more than globals? If so, should I push on the immutable array idea or do you have another idea? |
I'm saying that toolchains want to be able to use strings in globals. It's less about the character sequences themselves, it's more about things that transitively depend on them, such as Having a constant In the absence of a The other way is to import the strings themselves as immutable globals. That maintains the ability to use global initializers for string-using globals and have them be immutable+nonnullable. The drawbacks are that (1) all strings must be eagerly created in JS before the Wasm module can be instantiated (so, again, no laziness; and to make it even worse, that eager code on the critical path is one-shot JavaScript), (2) the number of imports to process increases from hundreds to tens of thousands (and we may even have to raise the limit on allowed imports when sufficiently large apps start running into it), (3) toolchain support to create/optimize/manipulate these JS+Wasm bundles is yet to be implemented (and affects everything: e.g. merging modules will also have to merge their JS-provided string constants; dead-code elimination has to rewrite both files; etc). I suspect that we will have to prototype both of these approaches in toolchains in order to determine which one is more viable (or even: whether either of them is viable at all). I don't see how tables or immutable arrays would help much, as they don't change the fundamental equation; the only clear benefit they bring is reducing the number of individual imports (addressing drawback (2) above); but it sounds to me like the price they pay for that is that they require even more eager work to be done up front (the respective |
Adding
So if I read this correctly, this option is to make all (or most) globals mutable, initialize them at
Agreed, importing each string as an immutable global seems like the per-global import cost would be prohibitive. That's why I was trying to think through an 'Option 3' where a module only imports a single immutable array which contains all the strings in the module, then the module's other global initializers could refer to that string array using a constant array.get operation. Regarding 'laziness' of creating strings, what are you imagining? At least in SM in JS, we create the actual string objects when parsing the source (as far as I can tell). Does V8 do something differently? |
It's a tiny addition compared to GC. I don't expect any system that doesn't implement GC to implement GC'ed strings.
Agreed. All options aside from a
V8 does that too: for string constants in JS source, it eagerly creates interned strings. So there's certainly a chance that we may, at least for now, get away with Wasm being like JS in this regard. I generally think it's desirable to design Wasm such that it can perform better than JS; otherwise what's the point of compiling to Wasm instead of compiling to JS? |
Compared to GC sure, compared to this proposal it's a big addition to cross into core wasm with a new valtype/section and raises the bar of general usefulness quite a bit. I would like to see this proposal move as fast as possible and avoid getting stuck in big core wasm discussions. Again, not ruling it out, but it seems expedient to look at other extensions first.
My goal with this proposal is for wasm to have parity (or very very close) with JS when it comes to JS strings. It'd be nice to have it perform better on JS strings, but that's not a hard requirement. |
I finally got around to prototyping and benchmarking this (source here). I tested out four strategies:
All of these for a list of 100,000 random (generally short) strings. Testing on my machines I got the following:
These numbers are for the full script to parse and execute and include synchronous wasm compilation and instantiation (using our baseline compiler). For uncompressed file sizes, I get:
The data segments one is an outlier in runtime, and that's mostly due to the compile time for the big 'init' function which does When it comes to the string section, it doesn't seem to buy anything for runtime. I took the strategy of allocating the strings and storing them on the instance upon instantiation. It does seem like the string section is more compact, but I haven't measured compressed sizes so I don't know. As of right now, it seems like array_from_iterable is the sweet spot of minimal spec complexity and performance. |
Thanks for the data @eqrion ! I think we may want to add something to the code size and startup time for "array from iterable" to make the comparison fully apples-to-apples. Atm that binary is (module
(type $strings (array externref))
(import "" "strings" (global $gimport$0 (ref $strings)))
) which means we have an array of strings. To access an individual string we'd need (import "str" "0" (global $gimport$0 externref))
(import "str" "1" (global $gimport$1 externref))
(import "str" "2" (global $gimport$2 externref))
... which means the strings are already separated, and accessing an individual string costs only a Apologies if I have misunderstood something in the code or explanation, which is definitely possible! |
Yeah, this is where comparing things gets hard. I was expecting that a toolchain using the 'array from iterable' approach would skip creating a global per string and do the array access scheme you mentioned. It's at the cost of a bounds check and indirection per use of the constant, but it avoids the cost of creating/initializing the globals.
The extra code size for 'imported globals' (compared to 'array from iterable') is from the import statements (otherwise they're pretty similar), and the extra startup time comes from imports not scaling well (requires a full JS property lookup per import). I don't think think those are things that 'array from iterable' will necessarily need to do (although I may be wrong and accessing from global instead of array may be required). I guess I could add a separately timed stage to the benchmarks that accesses each string constant once to capture the cost that each scheme implies? e.g. global.get vs array.get vs string.const. Would that be what you're interested in? |
Definitely, yeah, there are multiple possible tradeoffs here, and comparisons get hard. Though my concern about "array from iterable" isn't just runtime speed but also size - each use of a string gets larger. Enough appearances of a particular string constant and the binary could be bigger, maybe even big enough to offset any startup benefit to not creating those globals. But fwiw I did a measurement on a large Java testcase I have and most strings appear only once, so at least there "array from iterable" would be smaller.
Yes, I think that's worth doing. Though maybe be careful not to put it all in a single big function - my worry is that this is one of those costs that gets "spread" around an entire codebase, while in a single big function the VM might do things like store the array in a local after the first load which would hide the true Also an option is to involve Java/Dart/Kotlin people in this discussion to get a more realistic picture of how they use strings atm. But just to try to present the point of view I've seen from the Java side in my work with them, we've worked very hard together (J2Wasm+Binaryen) to avoid runtime overheads that get spread around the codebase, by focusing on tweaking things like how itable accesses are done and trying to avoid every possible load there. So even an extra Btw, another option might be "struct from iterable", just replacing the array with a struct. We do know the number of fields statically per compilation, so that is possible, and while it would add some type section size, it would then avoid an |
That would be a nice solution, but I think we have an implementation limit of 10k struct fields which from the above discussion seems too small for string constants. Part of the motivation for using arrays instead of globals was the 100k import limit. I'll try to get some data on the cost of accessing each string constant once for all of the different approaches. |
I suppose if toolchains were open to splitting their string data into multiple structs as needed, it would scale well. It would be some more complexity for toolchains though. We also probably could increase the struct field size limit to 100k with a concrete use-case. |
Here's some quick data on adding a single use for every string constant to compare overheads. I did put all of the uses into a single function. However, I'm only testing with our baseline compiler so there will not be any optimizations that re-use array pointers, bounds checks, dead code elimination, etc. It should be basically equivalent to if they were spread out in different functions.
The before/after numbers show the effect of compiling and running a function that loads every string for that strategy. I don't have more data for the string section approach, as I haven't had time to implement a string.const instruction as well. I would guess it would be roughly the same cost as a single |
From looking at some profiles, I do believe we could speed up the imported globals approach. Most of the extra time there is spent in parsing the imports in the binary, then doing slow generic JS property lookups on the imports object. I think we could add a special case for using numeric indices on a dense JS array that bypasses a lot of that generic code. However, that approach is still limited to 100_000 imports as said above. I'm warming up to the idea of a |
@kripken For the J2CL use-case, how many strings are you seeing right now, and what would be a reasonable maximum limit? Would supporting up to 100k strings in a struct work for you? If you needed more, would splitting into several structs be a large burden. @bashor Would you have a concern about having a 100k string constant limit? |
@eqrion I see 73K strings in a large Java testcase. But I can't say if that is a reasonable limit, other codebases may well need more...
The first and last lines seem to say that the imported array method has less startup overhead but a little more runtime overhead. The middle one though confuses me: how can |
I was confused by that as well. I spent a good amount of time verifying that I didn't mess it that one up originally and couldn't find anything. I'll try to swing back and look at that again. |
We'll check it on some projects and come back to you. |
@eqrion so far, the biggest number of string literals we've seen in real applications is ~14K. |
It's hard to predict if 100K is enough, and for how long. I think the current number could be easily doubled within a year. |
I'm wondering how strict these limits are, and if they could be increased in the future. |
It depends on which approach is taken:
With (2) I think it would be reasonable for implementations to increase their max struct size to 100,000 fields to support this use case. Going above that may get tricky, but 200,000 may be possible. I don't think engines want to dramatically increase the limit on number of imports. Even if we had a limit of 100,000 fields for structs, it would still be possible for toolchains to split their string constants into multiple structs, and import multiple of them. It just is less convenient. |
Yes, I think toolchains can fairly easily emit multiple structs here to stay under the 10,000 field limit. (This does require a whole-program decision, but even emitting a single struct requires that anyhow, so I don't see it as adding much work.) |
For Google Sheets, when we switched to string-builtins spec, our production binaries got 442KB larger for around total of 14k string constants - that's extra ~5%. re. offload string constants to be part of JavaScript load: re. number of constants: |
@gkdn, can you share how the compressed size changes as well? |
When compressed, regression is 38kb / ~3.5% |
For dart2wasm's use case of js-strings, @jakobkummerow's suggestion sounds really good to me. As an example of how much the .js shim grows with strings, currently in this app with a lot of strings, the generated JS shim (with string literals) is 546,872 bytes, of which 518,837 bytes are for the strings (94%). @jakobkummerow's suggestion solves this problem, and I don't see any downsides for dart2wasm as long as number of imports and the name size limit is large enough. |
@osa1 Does Dart disallow invalid UTF-16 (like unmatched surrogate pairs)? |
The Scala compiler allows unpaired surrogate chars in string literals. However, from my experience, actual occurrences of such strings are exceedingly rare in practice (some tools surrounding the compiler, such as the code formatter, even choke on them). As long as there is some workaround available, even if not very performant, I believe it would be fine to restrict string constants to valid Unicode scalar value sequences. |
(I deleted my original comment as I was wrong) Dart also allows unpaired surrogate pairs in string literals. Example: void main() {
final s1 = "\u{D83D}";
final s2 = String.fromCharCode(0xDE00);
print(s1 + s2); // prints 😄
} I also think these cases should be extremely rare, and we have the workaround of importing such strings in the .mjs file. |
Implementation feedback: I. Design question: Reasons in favor of
Reasons in favor of
Personally I have no strong opinion on this, the question just occurred to me when I started implementing these constants. Both options are implementable. II. Choice of the magic module name III. Implementation effort |
I think I have a slight preference for the flag being part of the compile phase. The reason would be that it allows us to know during eager compilation that while the global is 'externref' it will only ever be a non-null string and we could use that to eliminate some casts when we see the global used. But also, it's not a strong preference. I do have a strong preference for this being a separate kind of flag though.
I think that if we can't get any agreement on a compact import section format that using a single character might be okay, but I think I'd want to measure it first. I think a compact import section could let us optimize string imports very well by switching to a tight loop that just allocates strings and puts them into their final storage. |
I'm landing my patch in V8, to make it easier for our toolchain partners to experiment with this approach. Of course it's still part of the off-by-default For now, it uses |
Cool, I hope to have a quick PR up for this soon too. I'll just use the same module name as well for now. |
Per the discussion at: WebAssembly/js-string-builtins#13 this patch implements imported externref globals that can be requested to be automatically populated with string constants, with the string's value equaling the respective import's name. Bug: v8:14179 Change-Id: Ie86dc28b5a337c64ef03cb879e45adf340001a00 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5438375 Commit-Queue: Jakob Kummerow <[email protected]> Reviewed-by: Matthias Liedtke <[email protected]> Cr-Commit-Position: refs/heads/main@{#93301}
I've now adopted the imported string constants approach described above in the overview. I will leave this issue open for a while in case this approach needs refinement or we want to look at another approach. |
Collected some data from Google Sheets:
The regression over string.const is likely coming from:
|
I'm going to close this as I believe we have an acceptable solution now. |
I'd like to reopen the discussion a little bit - remember in the in-person meeting, I asked whether a modified scheme like the following would be possible:
Now I realise that there's a specific advantage of this alternative scheme: the custom section can use UTF16/WTF16 encoding. In the current scheme where strings come from the import names directly, a reencoding step seems needed to go from UTF8 -> WTF16, which would need a copy. Also, since UTF8 can't represent all WTF16 strings, producers will need a fallback. If the custom section in the alternative scheme directly encodes UTF16/WTF16 strings, there's no reencoding step or producer gap (with WTF16), and engines could even in principle directly use the allocated space of the custom section as the backing store for the relevant string. I realise that custom section based designs have been discussed above, but they seem to involve host code parsing the custom section and feeding it back into the module's imports, rather than combining with the "compile time flag" idea that's been newly introduced. |
@conrad-watt nice idea, two comments: (1) Wire bytes size will be larger. The current approach needs (on top of the import's "module name", whatever that'll end up being) the length of the string and the string payload itself (as utf-8). Your idea needs the length of the index string, the index string, the length of the actual string (assuming a vector-of-vectors encoding for the custom section like existing similar sections), and the string payload itself; so depending on how exactly the indices are encoded, that'll probably add around 3-6 bytes per string constant. (2) I once tried to implement string constants in wire bytes to be used as backing stores for the strings created from them, but had to give up on the attempt because sections aren't two-byte aligned, so for the time being I consider this an entirely hypothetical benefit that won't actually materialize in practice. (In many common languages, utf-8 is also much shorter than wtf-16, so module producers may well opt against wtf-16 for that reason if given a choice.) |
IIUC, I think we could cut this down to just
for each string constant if the custom section is a concatenation of all the strings, rather than a vector of strings. This is still one more LEB128 than the current encoding, and therefore would have that additional code size overhead, but maybe that is worth it to avoid the re-encoding issues and the WTF16 compatibility? |
Would this allow string global constants to reuse each others' bytes by overlapping their declared ranges in the custom section (e.g. substrings)? I don't know how much this would come up in practice, but it seems an interesting compression "trick". |
Yes, good point. If you had both of the strings |
On our main test suite, encoding strings as WTF-8 instead of WTF-16 (in the data segment) reduces the overall code size (not just size of the strings) by almost 6% (400 KB out of 6800 KB). That's a test suite that contains a non-trivial amount of non-ASCII code points in constant strings too, since it contains a number of test cases for string algorithms. I'm fairly confident the amount of non-ASCII code points in that codebase is above the average of production builds. I don't think we want an encoding that mandates U/WTF-16. That would be a very significant code size regression by design. |
Following on @sjrd's point, I believe it's uncommon (and not recommended) for JavaScript source to be served over the web in UTF-16 format; instead, UTF-8 is preferred. See, for example, Mozilla engineer & character encoding expert Henri Sivonen's post Always Use UTF-8[...], or this comment in Chromium's code for streaming JS parsing (which doesn't even seem to support UTF-16!). While this doesn't imply that Wasm must do the same, it shows that UTF-8 is a very common way to send JS strings over the wire, and it's reasonable to expect JS engines to be adept at dealing with them in that encoding. |
FWIW, the concatenated/overlapping trick could be used with utf-8 too, although you'd have to validate that indices fell on valid USVs. |
How would you encode that in an import, and how many bytes would that take? |
If not extending core Wasm import encoding, you'd need an encoding scheme that is valid utf-8. This is going to be less dense and/or have a more restricted range than LEB128s. Just brainstorming, haven't thought deeply about this, but: I imagine you could use
This is basically using the USV as the integer value directly when possible, but because USVs are not contiguous, we subtract the size of the non-contiguous hole when we have a USV larger than the hole. This has the advantage of a pretty dense encoding, at most 4 bytes per integer, that is also pretty easy and fast to encode and decode. The downside is that it cannot represent values larger than Regardless of the exact encoding scheme for an integer chosen, we could further compress them by using a delta encoding, where we transform this series of
into
This delta encoding would perhaps alleviate some of the constraints imposed by the max-encodable-integer limitation, but more importantly would reduce most encoded integers into a single byte since it would be rare that consecutive integers would have a larger delta than what can be encoded in a single byte, especially if the producer toolchain sorts them. Other ideas for posterity:
Anyways, I realize this is all getting a bit out there. I think it could work and has some pretty nice properties, but I also understand if it feels like too much complexity. I definitely don't want to side track and bog down this conversation or proposal if people don't think these ideas are worth considering. Footnotes
|
Agree with @jakobkummerow, SM is in a similar situation and can't really use the wire-bytes as the backing store for the strings. We also likely wouldn't want to, because that would tie the lifetime of the strings to the bytecode and keep it alive longer than it needs to. Also agree with @ajklein and @sjrd, that UTF-8 is going to be size superior to UTF-16 in almost all use-cases. Regarding the ability to subslice into the custom section for compression, I think that's also possible with the current import field scheme using the String.substring builtin. I also am not sure how beneficial it is in practice. It seems like the only advantage left would be the ability to encode unpaired-surrogates directly (using WTF-8/16). This would mean we'd have to take a JS-API spec dependency on the WTF spec, which is not standards track anywhere AFAIK. It also seems like an easy thing to workaround if you need it, and I haven't heard too many complaints about it. I know there was a complaint at the in-person meeting about how the import field names are now 'semantic' with this proposal, but I don't think using indices into a custom section make this better. So overall, I'm inclined to keep the current string constant feature, but would like to hear more if folks want to. |
I think one other point to bring up would be that the custom section approach might be able to handle longer strings, if a large import name would blow some implementation-specific limit. I've made my peace with the current approach so I won't push hard if the custom section-based approach isn't preferred. |
Would it make sense to have a way to offset the indices of some imports, leaving a hole for regular definitions, to avoid that imports consume all the smallest indices? In the text format, this could correspond to allow interleaving imports and regular definitions. |
If we run into the implementation-defined limits for import names, I would support increasing them and don't think we'd run into any issue with that. Although, sort of related to the length limits issue you raised, it would be nice if these string payloads were at the end of the binary to allow compilation of the function bodies to start sooner. I don't think though that's a good enough reason to start defining a custom section for this (given the complexity concerns above).
At least for spidermonkey, we take advantage of the fact that imports come first very heavily. I'd prefer to avoid changing that unless we need to. |
In the event that happens, I believe we could easily expand the implementation-specific limit. I'm going to close this again as I think our current approach is preferable to the alternatives that were brought up after the in person CG meeting. If folks think of another idea, let's talk about it in a new issue. |
Here's another idea for doing string constants that uses a table instead of globals. This can avoid the implementation limit of 1,000,000 globals we have, and possibly be more efficient.
table.get
to get the strings it wants. (minimum size can avoid bounds checks)Table.copyFrom() would be a new JS-API method on tables that takes a JS iterable object and copies it's values into itself starting at a specific index.
A variation of this could also be that the Table JS-API constructor takes an iterable and uses that for the initial values of objects. This could possibly allow the table type to be
(ref extern)
.One problem is that toolchains also want to refer to string constants in global initializers when constructing static objects. Another extension could be to allow
table.get
as a constant expression (table section comes before global section). This might be a little tricky, becausetable.get
can trap if the index is out of bounds. At least for SM, we already have to handle instantiation failure due to OOMs in initializer expressions, so this may not be that problematic?@jakobkummerow WDYT? cc @rossberg if they have any opinions about allowing
table.get
as a constant expression in global initializers.The text was updated successfully, but these errors were encountered: