Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String constants #13

Closed
eqrion opened this issue Dec 6, 2023 · 63 comments
Closed

String constants #13

eqrion opened this issue Dec 6, 2023 · 63 comments

Comments

@eqrion
Copy link
Collaborator

eqrion commented Dec 6, 2023

Here's another idea for doing string constants that uses a table instead of globals. This can avoid the implementation limit of 1,000,000 globals we have, and possibly be more efficient.

  1. Wasm imports a table of externref with minimum size of the total number of string constants
  2. JS code creates the table using JS-API constructor
  3. JS code creates an array of JS strings (from an encoded JSON array if beneficial)
  4. JS code uses a new Table.copyFrom(iterable, startIndex) method to initialize the table, passing it the array it loaded
  5. JS code instantiates the module and passes it the table
  6. Wasm code uses a table.get to get the strings it wants. (minimum size can avoid bounds checks)

Table.copyFrom() would be a new JS-API method on tables that takes a JS iterable object and copies it's values into itself starting at a specific index.

A variation of this could also be that the Table JS-API constructor takes an iterable and uses that for the initial values of objects. This could possibly allow the table type to be (ref extern).

One problem is that toolchains also want to refer to string constants in global initializers when constructing static objects. Another extension could be to allow table.get as a constant expression (table section comes before global section). This might be a little tricky, because table.get can trap if the index is out of bounds. At least for SM, we already have to handle instantiation failure due to OOMs in initializer expressions, so this may not be that problematic?

@jakobkummerow WDYT? cc @rossberg if they have any opinions about allowing table.get as a constant expression in global initializers.

@jakobkummerow
Copy link
Contributor

Sure, that could work. I find it hard to estimate whether it would be more efficient than using globals directly, but it would certainly have a higher limit on the number of strings it can support.

Speaking of which: the relevant limit for the status quo is 100K imports, not 1M globals.

I think requiring a bunch of JS code to run before the Wasm module can be instantiated is Not Great. That's the bullet that all imports-based approaches to string contents have to bite.

I think making table.get (for immutable imported tables) a constant instruction is probably a hard requirement for making this approach viable. If a module doesn't need strings in constant initializers, it can simply use array.new_data + String.fromWtf8Array (which engines can shortcut to an internal "string from data segment" operation, and which toolchains can support easily because everything is in the Wasm module).

@eqrion
Copy link
Collaborator Author

eqrion commented Jan 19, 2024

Speaking of which: the relevant limit for the status quo is 100K imports, not 1M globals.

Ah yes, I felt like it was lower than 1M but couldn't find a reason why.

I think making table.get (for immutable imported tables) a constant instruction is probably a hard requirement for making this approach viable.

From some discussion with @rossberg, I realized this would require us to have immutable (and probably also non-growable) tables, which would be substantial additions. I'm not sure if they're worth adding to the wasm language.

A different option to avoid any significant core wasm change would be to:

  1. Import a (global $strtab (ref (array (ref extern))))
  2. Add array.get for immutable arrays as a constant expression
  3. Add a WebAssembly.Array.fromIterable(heapType, iterable) which could be used to take a JS array of strings and turn it into an immutable wasm array

If a module doesn't need strings in constant initializers, it can simply use array.new_data + String.fromWtf8Array (which engines can shortcut to an internal "string from data segment" operation, and which toolchains can support easily because everything is in the Wasm module).

I'd really prefer to avoid adding an optimization like this to our baseline compiler (and I assume interpreter if we had one too). We'd need to add some way of tracking a deferred array.new_data computation in our abstract evaluation stack, and that leaks pretty far in the compiler.

@jakobkummerow
Copy link
Contributor

I'd really prefer to avoid adding an optimization like this to our baseline compiler

Sure, we don't have it in our baseline compiler either. It's a functionally unobservable and hence strictly optional internal optimization, appropriate mostly for optimizing compilers.
The biggest issue with it is that recent toolchain evolution suggests that most or even all modules will need most or even all of their strings to be available in constant initializers, so there doesn't seem to be much problem left for this solution to solve.

@eqrion
Copy link
Collaborator Author

eqrion commented Jan 23, 2024

I'd really prefer to avoid adding an optimization like this to our baseline compiler

Sure, we don't have it in our baseline compiler either. It's a functionally unobservable and hence strictly optional internal optimization, appropriate mostly for optimizing compilers. The biggest issue with it is that recent toolchain evolution suggests that most or even all modules will need most or even all of their strings to be available in constant initializers, so there doesn't seem to be much problem left for this solution to solve.

Just so I understand correctly, are you saying that your recent toolchain experience indicates we'll need something more than globals? If so, should I push on the immutable array idea or do you have another idea?

@jakobkummerow
Copy link
Contributor

I'm saying that toolchains want to be able to use strings in globals. It's less about the character sequences themselves, it's more about things that transitively depend on them, such as java.lang.String wrapper objects, and ultimately class metadata and whatnot that has properties with (wrapped) string values.

Having a constant string.const instruction makes that easy and efficient. Globals that refer to strings can be initialized with initializers, so they can be immutable and non-nullable, which is great for optimizing functions that use them. When J2Wasm switched to this scheme (from their previous approach of initializing globals lazily, which sprinkled "`(if (global.get $g == null) (global.set $g (...)))" checks all over the module), they saw massive improvements to module size and performance (double-digit percentage, IIRC, but I don't have exact numbers on hand).
Since the actual act of executing constant global initialization is unobservable, this approach also gives engines freedom to potentially find ways to hide some of the work (on another thread, or in an idle time slot, or by being (partially?) lazy). FWIW, we aren't doing any of that in V8 yet, but I think having the option was nice.

In the absence of a string.const instruction, one way to get string constants is to use data segments and array.new_data and the string-from-array functions. Since the latter are imported, they aren't "constant", and spec'ing constant imported functions is probably infeasible, so global initializers are out with this approach. A possible workaround is to initialize all globals via the start function (or, equivalently, some other function that's called before anything else happens). The entire transitive hull of globals that indirectly refer to strings (which probably means: approximately all globals) then needs to be mutable and nullable though. (It's possible that this mutability/nullability could be limited to certain properties; it's unclear to me how feasible that is and to what extent it would even help.)
Replacing an eager start function with user-space lazy initialization is of course technically possible, but practically unattractive due to its code size and performance cost (see above).

The other way is to import the strings themselves as immutable globals. That maintains the ability to use global initializers for string-using globals and have them be immutable+nonnullable. The drawbacks are that (1) all strings must be eagerly created in JS before the Wasm module can be instantiated (so, again, no laziness; and to make it even worse, that eager code on the critical path is one-shot JavaScript), (2) the number of imports to process increases from hundreds to tens of thousands (and we may even have to raise the limit on allowed imports when sufficiently large apps start running into it), (3) toolchain support to create/optimize/manipulate these JS+Wasm bundles is yet to be implemented (and affects everything: e.g. merging modules will also have to merge their JS-provided string constants; dead-code elimination has to rewrite both files; etc).

I suspect that we will have to prototype both of these approaches in toolchains in order to determine which one is more viable (or even: whether either of them is viable at all).

I don't see how tables or immutable arrays would help much, as they don't change the fundamental equation; the only clear benefit they bring is reducing the number of individual imports (addressing drawback (2) above); but it sounds to me like the price they pay for that is that they require even more eager work to be done up front (the respective fromIterable operation). If your intuition says otherwise, I encourage you to prove it with an A/B benchmarkable pair of prototypes :-)

@eqrion
Copy link
Collaborator Author

eqrion commented Jan 31, 2024

Having a constant string.const instruction makes that easy and efficient. Globals that refer to strings can be initialized with initializers, so they can be immutable and non-nullable, which is great for optimizing functions that use them

Adding stringref, string.const and a string-constant-section to core wasm to improve string creation performance is definitely a possibility. However, it's a big addition that impacts systems like embedded that may not even have a host string type. It would be very convenient to the advancement of this proposal to find a way to avoid this. That's why I'm trying to find other smaller tweaks we could pursue. If nothing works here, then we could consider something like this.

In the absence of a string.const instruction, one way to get string constants is to use data segments and array.new_data and the string-from-array functions. Since the latter are imported, they aren't "constant", and spec'ing constant imported functions is probably infeasible, so global initializers are out with this approach. A possible workaround is to initialize all globals via the start function (or, equivalently, some other function that's called before anything else happens). The entire transitive hull of globals that indirectly refer to strings (which probably means: approximately all globals) then needs to be mutable and nullable though. (It's possible that this mutability/nullability could be limited to certain properties; it's unclear to me how feasible that is and to what extent it would even help.)

So if I read this correctly, this option is to make all (or most) globals mutable, initialize them at start, and use the 'string from an array from a data segment initialization path'. This definitely has the advantage of reducing the global import cost, but I'm unsure about the cost of the string creation. Like I said above, we really don't want to add that macro optimization in our baseline compiler and the start function is one-shot Wasm and a great candidate to be lazy and not use the optimizer on. The extra copy that would then happen seems unfortunate.

The other way is to import the strings themselves as immutable globals. That maintains the ability to use global initializers for string-using globals and have them be immutable+nonnullable. The drawbacks are that (1) all strings must be eagerly created in JS before the Wasm module can be instantiated (so, again, no laziness; and to make it even worse, that eager code on the critical path is one-shot JavaScript), (2) the number of imports to process increases from hundreds to tens of thousands (and we may even have to raise the limit on allowed imports when sufficiently large apps start running into it), (3) toolchain support to create/optimize/manipulate these JS+Wasm bundles is yet to be implemented (and affects everything: e.g. merging modules will also have to merge their JS-provided string constants; dead-code elimination has to rewrite both files; etc).

Agreed, importing each string as an immutable global seems like the per-global import cost would be prohibitive. That's why I was trying to think through an 'Option 3' where a module only imports a single immutable array which contains all the strings in the module, then the module's other global initializers could refer to that string array using a constant array.get operation.

Regarding 'laziness' of creating strings, what are you imagining? At least in SM in JS, we create the actual string objects when parsing the source (as far as I can tell). Does V8 do something differently?

@jakobkummerow
Copy link
Contributor

it's a big addition

It's a tiny addition compared to GC. I don't expect any system that doesn't implement GC to implement GC'ed strings.

The extra copy that would then happen seems unfortunate.

Agreed. All options aside from a string.const instruction incur unfortunate extra work.

we create the actual string objects when parsing the source (as far as I can tell). Does V8 do something differently?

V8 does that too: for string constants in JS source, it eagerly creates interned strings. So there's certainly a chance that we may, at least for now, get away with Wasm being like JS in this regard. I generally think it's desirable to design Wasm such that it can perform better than JS; otherwise what's the point of compiling to Wasm instead of compiling to JS?

@eqrion
Copy link
Collaborator Author

eqrion commented Feb 1, 2024

it's a big addition

It's a tiny addition compared to GC. I don't expect any system that doesn't implement GC to implement GC'ed strings.

Compared to GC sure, compared to this proposal it's a big addition to cross into core wasm with a new valtype/section and raises the bar of general usefulness quite a bit. I would like to see this proposal move as fast as possible and avoid getting stuck in big core wasm discussions. Again, not ruling it out, but it seems expedient to look at other extensions first.

we create the actual string objects when parsing the source (as far as I can tell). Does V8 do something differently?

V8 does that too: for string constants in JS source, it eagerly creates interned strings. So there's certainly a chance that we may, at least for now, get away with Wasm being like JS in this regard. I generally think it's desirable to design Wasm such that it can perform better than JS; otherwise what's the point of compiling to Wasm instead of compiling to JS?

My goal with this proposal is for wasm to have parity (or very very close) with JS when it comes to JS strings. It'd be nice to have it perform better on JS strings, but that's not a hard requirement.

@eqrion
Copy link
Collaborator Author

eqrion commented Feb 29, 2024

I finally got around to prototyping and benchmarking this (source here).

I tested out four strategies:

  1. "imported globals" - Define strings in JSON embedded in JS, import through globals
  2. "data segments" - Define strings in a data segment, use array.new_data and JS string builtin to create the string and store in a private global
  3. "array from iterable" - Define strings in JSON embedded in JS, convert to immutable wasm array through new JS-API method, import the array
  4. "string section" - add a binary string section, using UTF-8

All of these for a list of 100,000 random (generally short) strings.

Testing on my machines I got the following:

import_global = 45.41 ms (variance = 448.47)
data_segment = 232.70 ms (variance = 9.95)
array_from_iterable = 13.76 ms (variance = 0.02)
string_section = 13.28 ms (variance = 0.07)

These numbers are for the full script to parse and execute and include synchronous wasm compilation and instantiation (using our baseline compiler).

For uncompressed file sizes, I get:

import_global = 2.1 MiB
data_segment = 5.5 MiB
array_from_iterable = .8 MiB
string_section = .6 MiB

The data segments one is an outlier in runtime, and that's mostly due to the compile time for the big 'init' function which does array.new_data -> stringFromCharCodeArray -> global.set sequence for each string. That could be mitigated with asynchronous compilation and caching, but even with our baseline took a long time. The file size is also pretty bad. I could possibly optimize it further by storing the final strings in a private table instead of globals.

When it comes to the string section, it doesn't seem to buy anything for runtime. I took the strategy of allocating the strings and storing them on the instance upon instantiation. It does seem like the string section is more compact, but I haven't measured compressed sizes so I don't know.

As of right now, it seems like array_from_iterable is the sweet spot of minimal spec complexity and performance.

@kripken
Copy link
Member

kripken commented Feb 29, 2024

Thanks for the data @eqrion !

I think we may want to add something to the code size and startup time for "array from iterable" to make the comparison fully apples-to-apples. Atm that binary is

(module
 (type $strings (array externref))
 (import "" "strings" (global $gimport$0 (ref $strings)))
)

which means we have an array of strings. To access an individual string we'd need global.get + array.get + i32.const. In comparison, "imported globals" has

 (import "str" "0" (global $gimport$0 externref))
 (import "str" "1" (global $gimport$1 externref))
 (import "str" "2" (global $gimport$2 externref))
...

which means the strings are already separated, and accessing an individual string costs only a global.get. That is, some of the extra code size and startup time in "imported globals" is work that "array from iterable" will need to do as well (in order to actually use individual strings), but does not seem to be accounted for here. I'd guess at least the size of array.get + i32.const times the number of strings, but perhaps a little more as if a string is used more than once we'd likely want to cache it in a global.

Apologies if I have misunderstood something in the code or explanation, which is definitely possible!

@eqrion
Copy link
Collaborator Author

eqrion commented Mar 1, 2024

Thanks for the data @eqrion !

I think we may want to add something to the code size and startup time for "array from iterable" to make the comparison fully apples-to-apples. Atm that binary is

Yeah, this is where comparing things gets hard. I was expecting that a toolchain using the 'array from iterable' approach would skip creating a global per string and do the array access scheme you mentioned. It's at the cost of a bounds check and indirection per use of the constant, but it avoids the cost of creating/initializing the globals.

That is, some of the extra code size and startup time in "imported globals" is work that "array from iterable" will need to do as well (in order to actually use individual strings), but does not seem to be accounted for here

The extra code size for 'imported globals' (compared to 'array from iterable') is from the import statements (otherwise they're pretty similar), and the extra startup time comes from imports not scaling well (requires a full JS property lookup per import). I don't think think those are things that 'array from iterable' will necessarily need to do (although I may be wrong and accessing from global instead of array may be required).

I guess I could add a separately timed stage to the benchmarks that accesses each string constant once to capture the cost that each scheme implies? e.g. global.get vs array.get vs string.const. Would that be what you're interested in?

@kripken
Copy link
Member

kripken commented Mar 1, 2024

Yeah, this is where comparing things gets hard. I was expecting that a toolchain using the 'array from iterable' approach would skip creating a global per string and do the array access scheme you mentioned. It's at the cost of a bounds check and indirection per use of the constant, but it avoids the cost of creating/initializing the globals.

Definitely, yeah, there are multiple possible tradeoffs here, and comparisons get hard.

Though my concern about "array from iterable" isn't just runtime speed but also size - each use of a string gets larger. Enough appearances of a particular string constant and the binary could be bigger, maybe even big enough to offset any startup benefit to not creating those globals. But fwiw I did a measurement on a large Java testcase I have and most strings appear only once, so at least there "array from iterable" would be smaller.

I guess I could add a separately timed stage to the benchmarks that accesses each string constant once to capture the cost that each scheme implies? e.g. global.get vs array.get vs string.const. Would that be what you're interested in?

Yes, I think that's worth doing. Though maybe be careful not to put it all in a single big function - my worry is that this is one of those costs that gets "spread" around an entire codebase, while in a single big function the VM might do things like store the array in a local after the first load which would hide the true global.get cost on real-world code.

Also an option is to involve Java/Dart/Kotlin people in this discussion to get a more realistic picture of how they use strings atm. But just to try to present the point of view I've seen from the Java side in my work with them, we've worked very hard together (J2Wasm+Binaryen) to avoid runtime overheads that get spread around the codebase, by focusing on tweaking things like how itable accesses are done and trying to avoid every possible load there. So even an extra array.get per string access sounds risky to me. As background, in the Sheets use case mentioned in the V8 WasmGC blogpost, startup definitely matters, but I believe performance afterwards matters a lot more, so one global per string is likely the way to go there at least.

Btw, another option might be "struct from iterable", just replacing the array with a struct. We do know the number of fields statically per compilation, so that is possible, and while it would add some type section size, it would then avoid an i32.const byte by using an immediate, and also there would be no bounds check.

@eqrion
Copy link
Collaborator Author

eqrion commented Mar 6, 2024

Btw, another option might be "struct from iterable", just replacing the array with a struct. We do know the number of fields statically per compilation, so that is possible, and while it would add some type section size, it would then avoid an i32.const byte by using an immediate, and also there would be no bounds check.

That would be a nice solution, but I think we have an implementation limit of 10k struct fields which from the above discussion seems too small for string constants. Part of the motivation for using arrays instead of globals was the 100k import limit.

I'll try to get some data on the cost of accessing each string constant once for all of the different approaches.

@eqrion
Copy link
Collaborator Author

eqrion commented Mar 6, 2024

Btw, another option might be "struct from iterable", just replacing the array with a struct. We do know the number of fields statically per compilation, so that is possible, and while it would add some type section size, it would then avoid an i32.const byte by using an immediate, and also there would be no bounds check.

That would be a nice solution, but I think we have an implementation limit of 10k struct fields which from the above discussion seems too small for string constants. Part of the motivation for using arrays instead of globals was the 100k import limit.

I suppose if toolchains were open to splitting their string data into multiple structs as needed, it would scale well. It would be some more complexity for toolchains though. We also probably could increase the struct field size limit to 100k with a concrete use-case.

@eqrion
Copy link
Collaborator Author

eqrion commented Mar 13, 2024

Here's some quick data on adding a single use for every string constant to compare overheads. I did put all of the uses into a single function. However, I'm only testing with our baseline compiler so there will not be any optimizations that re-use array pointers, bounds checks, dead code elimination, etc. It should be basically equivalent to if they were spread out in different functions.

import_globals: 60.2ms -> 77.9ms
data_segments: 225ms -> 223.4ms (within noise during 200 iterations)
import_array: 13.7ms -> 33.9ms

The before/after numbers show the effect of compiling and running a function that loads every string for that strategy.

I don't have more data for the string section approach, as I haven't had time to implement a string.const instruction as well. I would guess it would be roughly the same cost as a single global.get as it would be implemented almost identically. So somewhere between the import_globals and data_segments seems like a likely cost.

@eqrion
Copy link
Collaborator Author

eqrion commented Mar 13, 2024

From looking at some profiles, I do believe we could speed up the imported globals approach. Most of the extra time there is spent in parsing the imports in the binary, then doing slow generic JS property lookups on the imports object. I think we could add a special case for using numeric indices on a dense JS array that bypasses a lot of that generic code.

However, that approach is still limited to 100_000 imports as said above.

I'm warming up to the idea of a structFromIterable operation because that seems like it could be faster than the imported array approach, and also be easier to use in constant expressions (array.get has bounds checks that can trap).

@eqrion
Copy link
Collaborator Author

eqrion commented Mar 13, 2024

@kripken For the J2CL use-case, how many strings are you seeing right now, and what would be a reasonable maximum limit? Would supporting up to 100k strings in a struct work for you? If you needed more, would splitting into several structs be a large burden.

@bashor Would you have a concern about having a 100k string constant limit?

@kripken
Copy link
Member

kripken commented Mar 14, 2024

@eqrion I see 73K strings in a large Java testcase. But I can't say if that is a reasonable limit, other codebases may well need more...

import_globals: 60.2ms -> 77.9ms
data_segments: 225ms -> 223.4ms (within noise during 200 iterations)
import_array: 13.7ms -> 33.9ms

The first and last lines seem to say that the imported array method has less startup overhead but a little more runtime overhead. The middle one though confuses me: how can data_segments have ~0 (within noise) runtime overhead for string accesses? Am I misreading that line?

@eqrion
Copy link
Collaborator Author

eqrion commented Mar 15, 2024

The first and last lines seem to say that the imported array method has less startup overhead but a little more runtime overhead. The middle one though confuses me: how can data_segments have ~0 (within noise) runtime overhead for string accesses? Am I misreading that line?

I was confused by that as well. I spent a good amount of time verifying that I didn't mess it that one up originally and couldn't find anything. I'll try to swing back and look at that again.

@bashor
Copy link

bashor commented Mar 18, 2024

@bashor Would you have a concern about having a 100k string constant limit?

We'll check it on some projects and come back to you.

@bashor
Copy link

bashor commented Mar 20, 2024

@eqrion so far, the biggest number of string literals we've seen in real applications is ~14K.

@bashor
Copy link

bashor commented Mar 20, 2024

It's hard to predict if 100K is enough, and for how long. I think the current number could be easily doubled within a year.

@bashor
Copy link

bashor commented Mar 20, 2024

I'm wondering how strict these limits are, and if they could be increased in the future.

@eqrion
Copy link
Collaborator Author

eqrion commented Mar 22, 2024

I'm wondering how strict these limits are, and if they could be increased in the future.

It depends on which approach is taken:

  1. If we expect strings are imported as globals, there's a shared limit of 100,000 imports that all engines use
  2. If we expect strings are stored in a struct that's imported as a global (WebAssembly.structFromIterable idea from above), there's a shared limit of 10,000 fields for structs

With (2) I think it would be reasonable for implementations to increase their max struct size to 100,000 fields to support this use case. Going above that may get tricky, but 200,000 may be possible. I don't think engines want to dramatically increase the limit on number of imports.

Even if we had a limit of 100,000 fields for structs, it would still be possible for toolchains to split their string constants into multiple structs, and import multiple of them. It just is less convenient.

@kripken
Copy link
Member

kripken commented Mar 22, 2024

Yes, I think toolchains can fairly easily emit multiple structs here to stay under the 10,000 field limit. (This does require a whole-program decision, but even emitting a single struct requires that anyhow, so I don't see it as adding much work.)

@gkdn
Copy link

gkdn commented Mar 31, 2024

For Google Sheets, when we switched to string-builtins spec, our production binaries got 442KB larger for around total of 14k string constants - that's extra ~5%.
I'm estimating around 80-90KB of the regression is due to less efficient encoding of the constants as JSON. Every single constant also adds around an extra 25 bytes on average to declare the imports in the wasm module.

re. offload string constants to be part of JavaScript load:
IMO it should be part of the design goal here to make Wasm binaries self contained. It is not good to add an extra large chunk of code to be downloaded/processed in initial JS load; it is also not good to try to late load those JS constants and add an extra roundtrip around Wasm initialization.

re. number of constants:
For Google Sheets, I have seen tests with 112K string constants.

@tlively
Copy link
Member

tlively commented Apr 1, 2024

@gkdn, can you share how the compressed size changes as well?

@gkdn
Copy link

gkdn commented Apr 1, 2024

When compressed, regression is 38kb / ~3.5%

@osa1
Copy link
Contributor

osa1 commented Apr 5, 2024

For dart2wasm's use case of js-strings, @jakobkummerow's suggestion sounds really good to me.

As an example of how much the .js shim grows with strings, currently in this app with a lot of strings, the generated JS shim (with string literals) is 546,872 bytes, of which 518,837 bytes are for the strings (94%).

@jakobkummerow's suggestion solves this problem, and I don't see any downsides for dart2wasm as long as number of imports and the name size limit is large enough.

@gkdn
Copy link

gkdn commented Apr 5, 2024

@osa1 Does Dart disallow invalid UTF-16 (like unmatched surrogate pairs)?

@sjrd
Copy link

sjrd commented Apr 8, 2024

The Scala compiler allows unpaired surrogate chars in string literals. However, from my experience, actual occurrences of such strings are exceedingly rare in practice (some tools surrounding the compiler, such as the code formatter, even choke on them). As long as there is some workaround available, even if not very performant, I believe it would be fine to restrict string constants to valid Unicode scalar value sequences.

@osa1
Copy link
Contributor

osa1 commented Apr 8, 2024

(I deleted my original comment as I was wrong)

Dart also allows unpaired surrogate pairs in string literals. Example:

void main() {
  final s1 = "\u{D83D}";
  final s2 = String.fromCharCode(0xDE00);
  print(s1 + s2); // prints 😄
}

I also think these cases should be extremely rare, and we have the workaround of importing such strings in the .mjs file.

@jakobkummerow
Copy link
Contributor

Implementation feedback:

I. Design question: compile or instantiate?
There is some design wiggle room around the time when the constants are dealt with. Specifically, this could conceptually be part of WebAssembly.compile (like the functions in this proposal), or part of WebAssembly.instantiate.

Reasons in favor of WebAssembly.compile:

  • Consistency with the imported string functions.
  • Maybe some engine actually wants to embed string constants into generated code (and wants to create such code eagerly as part of the compile step).
  • Modules that get instantiated several times per compilation only need to do the validation once (imports in the " module must be globals of type immutable externref). Not sure how many such modules there are.

Reasons in favor of WebAssembly.instantiate:

  • Globals and functions are fundamentally different, so any attempt to treat them "consistently" might cause more confusion/contortion than simplification (which is typically the goal of consistent handling). (In a similar vein, @eqrion suggested above to have an opt-in flag that's not part of the builtins list but something separate -- it could also be consumed by a different phase.)
  • Engines likely need to delay allocation of string constants until instantiation anyway, because strings are heap objects and a heap on which to allocate is typically not available at compile time.
  • Contrary to string builtin functions, which are expected/required to get special compiler treatment for performance, the obvious implementation for string constant globals is no special handling in the compiler, so there's no need to have them available at compile time.

Personally I have no strong opinion on this, the question just occurred to me when I started implementing these constants. Both options are implementable.

II. Choice of the magic module name
Having a single-character name for the recognized imported module isn't just binary-size efficient, it is also convenient for implementations, because it's fast and simple to check (something like module_name.length() == 1 && module_name.chars()[0] == kMagicChar). If we had a compact import section format, this might be alleviated by engines switching to two-level internal storage as well, which would eliminate the need to check every single import's module name. I'm not sure how likely either of these things are (a new binary format might not find CG agreement, and implementations might find a flat list of imports more convenient).
We could pick ' instead of " to avoid the unsightly backslash in the text format. Personally, I don't think (import "string-constant" "Hello, world!") is tangibly better than (import "'" "Hello, world!"), so I wouldn't want to pay any non-negligible price for the more verbose form.

III. Implementation effort
Even easier than I realized previously, because import names are already converted to JS strings anyway, in order to perform the property lookups on the imports object, so replacing those lookups is trivially easy. Plumbing the flag from compile (see I.) to instantiate adds some boilerplate, but overall my patch is still at just ~70 lines.
There's some room for performance improvements now that we expect to have tens of thousands of imports, which leads to more effort, but that would apply to any of the approaches that define one global per string constant.

@eqrion
Copy link
Collaborator Author

eqrion commented Apr 9, 2024

Implementation feedback:

I. Design question: compile or instantiate? There is some design wiggle room around the time when the constants are dealt with. Specifically, this could conceptually be part of WebAssembly.compile (like the functions in this proposal), or part of WebAssembly.instantiate.

I think I have a slight preference for the flag being part of the compile phase. The reason would be that it allows us to know during eager compilation that while the global is 'externref' it will only ever be a non-null string and we could use that to eliminate some casts when we see the global used.

But also, it's not a strong preference. I do have a strong preference for this being a separate kind of flag though.

II. Choice of the magic module name Having a single-character name for the recognized imported module isn't just binary-size efficient, it is also convenient for implementations, because it's fast and simple to check (something like module_name.length() == 1 && module_name.chars()[0] == kMagicChar). If we had a compact import section format, this might be alleviated by engines switching to two-level internal storage as well, which would eliminate the need to check every single import's module name. I'm not sure how likely either of these things are (a new binary format might not find CG agreement, and implementations might find a flat list of imports more convenient). We could pick ' instead of " to avoid the unsightly backslash in the text format. Personally, I don't think (import "string-constant" "Hello, world!") is tangibly better than (import "'" "Hello, world!"), so I wouldn't want to pay any non-negligible price for the more verbose form.

I think that if we can't get any agreement on a compact import section format that using a single character might be okay, but I think I'd want to measure it first. I think a compact import section could let us optimize string imports very well by switching to a tight loop that just allocates strings and puts them into their final storage.

@jakobkummerow
Copy link
Contributor

I'm landing my patch in V8, to make it easier for our toolchain partners to experiment with this approach. Of course it's still part of the off-by-default --experimental-wasm-imported-strings feature.

For now, it uses {builtins: [...], importedStringConstants: true} as the opt-in flag, and requires the "module name" for string constants to be ' (i.e. a single-quote character). It's easy to change these details later if necessary.

@eqrion
Copy link
Collaborator Author

eqrion commented Apr 10, 2024

I'm landing my patch in V8, to make it easier for our toolchain partners to experiment with this approach. Of course it's still part of the off-by-default --experimental-wasm-imported-strings feature.

For now, it uses {builtins: [...], importedStringConstants: true} as the opt-in flag, and requires the "module name" for string constants to be ' (i.e. a single-quote character). It's easy to change these details later if necessary.

Cool, I hope to have a quick PR up for this soon too. I'll just use the same module name as well for now.

hubot pushed a commit to v8/v8 that referenced this issue Apr 10, 2024
Per the discussion at:
WebAssembly/js-string-builtins#13
this patch implements imported externref globals that can be
requested to be automatically populated with string constants,
with the string's value equaling the respective import's name.

Bug: v8:14179
Change-Id: Ie86dc28b5a337c64ef03cb879e45adf340001a00
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5438375
Commit-Queue: Jakob Kummerow <[email protected]>
Reviewed-by: Matthias Liedtke <[email protected]>
Cr-Commit-Position: refs/heads/main@{#93301}
@eqrion
Copy link
Collaborator Author

eqrion commented Apr 10, 2024

I've now adopted the imported string constants approach described above in the overview. I will leave this issue open for a while in case this approach needs refinement or we want to look at another approach.

@gkdn
Copy link

gkdn commented May 16, 2024

Collected some data from Google Sheets:

Custom section:  4326288 (compressed 1137269)
Magic-import:    4048402 (compressed 1113046)
string.const:    3895540 (compressed 1103698)

The regression over string.const is likely coming from:

  • ~65K due to constant definitions (13K constants, 5 byte per definition)
  • ~85K due to impact on global indices (13K imports taking over smallest indices as imported globals needs to be first)

@eqrion
Copy link
Collaborator Author

eqrion commented Jun 5, 2024

I'm going to close this as I believe we have an acceptable solution now.

@conrad-watt
Copy link
Contributor

conrad-watt commented Jun 13, 2024

I'd like to reopen the discussion a little bit - remember in the in-person meeting, I asked whether a modified scheme like the following would be possible:

  • instead of directly encoding each string as an import name, interpret each import name as an index into a custom section which holds a table of string bytes
  • the compile-time flag remains the same - if a module is compiled with this flag set, the "global imports" turn into globals containing externrefs representing the relevant strings encoded in the custom section

Now I realise that there's a specific advantage of this alternative scheme: the custom section can use UTF16/WTF16 encoding. In the current scheme where strings come from the import names directly, a reencoding step seems needed to go from UTF8 -> WTF16, which would need a copy. Also, since UTF8 can't represent all WTF16 strings, producers will need a fallback. If the custom section in the alternative scheme directly encodes UTF16/WTF16 strings, there's no reencoding step or producer gap (with WTF16), and engines could even in principle directly use the allocated space of the custom section as the backing store for the relevant string.

I realise that custom section based designs have been discussed above, but they seem to involve host code parsing the custom section and feeding it back into the module's imports, rather than combining with the "compile time flag" idea that's been newly introduced.

@jakobkummerow
Copy link
Contributor

@conrad-watt nice idea, two comments:

(1) Wire bytes size will be larger. The current approach needs (on top of the import's "module name", whatever that'll end up being) the length of the string and the string payload itself (as utf-8). Your idea needs the length of the index string, the index string, the length of the actual string (assuming a vector-of-vectors encoding for the custom section like existing similar sections), and the string payload itself; so depending on how exactly the indices are encoded, that'll probably add around 3-6 bytes per string constant.

(2) I once tried to implement string constants in wire bytes to be used as backing stores for the strings created from them, but had to give up on the attempt because sections aren't two-byte aligned, so for the time being I consider this an entirely hypothetical benefit that won't actually materialize in practice. (In many common languages, utf-8 is also much shorter than wtf-16, so module producers may well opt against wtf-16 for that reason if given a choice.)

@fitzgen
Copy link
Contributor

fitzgen commented Jun 13, 2024

Your idea needs the length of the index string, the index string, the length of the actual string (assuming a vector-of-vectors encoding for the custom section like existing similar sections), and the string payload itself; so depending on how exactly the indices are encoded, that'll probably add around 3-6 bytes per string constant.

IIUC, I think we could cut this down to just

  • index into the custom section
  • length of this string

for each string constant if the custom section is a concatenation of all the strings, rather than a vector of strings.

This is still one more LEB128 than the current encoding, and therefore would have that additional code size overhead, but maybe that is worth it to avoid the re-encoding issues and the WTF16 compatibility?

@conrad-watt
Copy link
Contributor

conrad-watt commented Jun 13, 2024

IIUC, I think we could cut this down to just

  • index into the custom section
  • length of this string

for each string constant if the custom section is a concatenation of all the strings, rather than a vector of strings.

Would this allow string global constants to reuse each others' bytes by overlapping their declared ranges in the custom section (e.g. substrings)? I don't know how much this would come up in practice, but it seems an interesting compression "trick".

@fitzgen
Copy link
Contributor

fitzgen commented Jun 13, 2024

Would this allow one string global constant to be defined as a substring of another by overlapping their declared ranges in the custom section? I don't know how much this would come up in practice, but it seems an interesting compression "trick".

Yes, good point. If you had both of the strings wu-tang and tangible, you'd be able to use the custom section wu-tangible and slice into it as appropriate to define each string constant.

@sjrd
Copy link

sjrd commented Jun 13, 2024

On our main test suite, encoding strings as WTF-8 instead of WTF-16 (in the data segment) reduces the overall code size (not just size of the strings) by almost 6% (400 KB out of 6800 KB). That's a test suite that contains a non-trivial amount of non-ASCII code points in constant strings too, since it contains a number of test cases for string algorithms. I'm fairly confident the amount of non-ASCII code points in that codebase is above the average of production builds.

I don't think we want an encoding that mandates U/WTF-16. That would be a very significant code size regression by design.

@ajklein
Copy link
Contributor

ajklein commented Jun 13, 2024

Following on @sjrd's point, I believe it's uncommon (and not recommended) for JavaScript source to be served over the web in UTF-16 format; instead, UTF-8 is preferred. See, for example, Mozilla engineer & character encoding expert Henri Sivonen's post Always Use UTF-8[...], or this comment in Chromium's code for streaming JS parsing (which doesn't even seem to support UTF-16!).

While this doesn't imply that Wasm must do the same, it shows that UTF-8 is a very common way to send JS strings over the wire, and it's reasonable to expect JS engines to be adept at dealing with them in that encoding.

@fitzgen
Copy link
Contributor

fitzgen commented Jun 13, 2024

FWIW, the concatenated/overlapping trick could be used with utf-8 too, although you'd have to validate that indices fell on valid USVs.

@jakobkummerow
Copy link
Contributor

IIUC, I think we could cut this down to just

  • index into the custom section
  • length of this string

How would you encode that in an import, and how many bytes would that take?

@fitzgen
Copy link
Contributor

fitzgen commented Jun 14, 2024

How would you encode that in an import, and how many bytes would that take?

If not extending core Wasm import encoding, you'd need an encoding scheme that is valid utf-8. This is going to be less dense and/or have a more restricted range than LEB128s.

Just brainstorming, haven't thought deeply about this, but: I imagine you could use ' as the import module (same as the current scheme) and then have the import name be two scalar values, one for index and one for length, where

  1. if the USV is in the range 0x0..0xd7ff then the decoded integer value is the USV itself
  2. if the USV is in the range 0xe000..0x10ffff then the decoded integer value is USV - (0xe000 - 0xd7ff)
  3. (the above covers all cases, since that covers all valid USVs and we know that import names are always valid utf-8)

This is basically using the USV as the integer value directly when possible, but because USVs are not contiguous, we subtract the size of the non-contiguous hole when we have a USV larger than the hole.

This has the advantage of a pretty dense encoding, at most 4 bytes per integer, that is also pretty easy and fast to encode and decode. The downside is that it cannot represent values larger than 1_112_0621. But that limitation seems like it might be okay? How often do we expect to have string literals or a custom string constant section longer than that? Especially given that the producer toolchain can do some kind of prefix- and suffix-sharing compression scheme on the custom section?

Regardless of the exact encoding scheme for an integer chosen, we could further compress them by using a delta encoding, where we transform this series of (index, length) pairs

(0, 10)
(7, 5)
(10, 3)

into

(0, 10)
(7, -5)
(3, -2)

This delta encoding would perhaps alleviate some of the constraints imposed by the max-encodable-integer limitation, but more importantly would reduce most encoded integers into a single byte since it would be rare that consecutive integers would have a larger delta than what can be encoded in a single byte, especially if the producer toolchain sorts them.

Other ideas for posterity:

  • If the max-encodable-integer limit is too onerous, we could reserve the maximum USV value as a continuation bit, saying that this USV should be shifted and the next USV then added to this one (same as LEB128). This makes encoding/decoding more complicated but lifts that limitation.

  • We could always use something like base64 VLQs that source maps use. That has the advantage of being an existing, understood encoding, but has the disadvantages of being a not-very-dense encoding and also pretty expensive to encode and decode.

Anyways, I realize this is all getting a bit out there. I think it could work and has some pretty nice properties, but I also understand if it feels like too much complexity. I definitely don't want to side track and bog down this conversation or proposal if people don't think these ideas are worth considering.

Footnotes

  1. If I didn't introduce any off-by-one errors somewhere along the way.

@eqrion
Copy link
Collaborator Author

eqrion commented Jun 14, 2024

Agree with @jakobkummerow, SM is in a similar situation and can't really use the wire-bytes as the backing store for the strings. We also likely wouldn't want to, because that would tie the lifetime of the strings to the bytecode and keep it alive longer than it needs to.

Also agree with @ajklein and @sjrd, that UTF-8 is going to be size superior to UTF-16 in almost all use-cases.

Regarding the ability to subslice into the custom section for compression, I think that's also possible with the current import field scheme using the String.substring builtin. I also am not sure how beneficial it is in practice.

It seems like the only advantage left would be the ability to encode unpaired-surrogates directly (using WTF-8/16). This would mean we'd have to take a JS-API spec dependency on the WTF spec, which is not standards track anywhere AFAIK. It also seems like an easy thing to workaround if you need it, and I haven't heard too many complaints about it.

I know there was a complaint at the in-person meeting about how the import field names are now 'semantic' with this proposal, but I don't think using indices into a custom section make this better.

So overall, I'm inclined to keep the current string constant feature, but would like to hear more if folks want to.

@conrad-watt
Copy link
Contributor

I think one other point to bring up would be that the custom section approach might be able to handle longer strings, if a large import name would blow some implementation-specific limit.

I've made my peace with the current approach so I won't push hard if the custom section-based approach isn't preferred.

@vouillon
Copy link
Contributor

The regression over string.const is likely coming from:

* ~65K due to constant definitions (13K constants, 5 byte per definition)

* ~85K due to impact on global indices (13K imports taking over smallest indices as imported globals needs to be first)

Would it make sense to have a way to offset the indices of some imports, leaving a hole for regular definitions, to avoid that imports consume all the smallest indices? In the text format, this could correspond to allow interleaving imports and regular definitions.

@eqrion
Copy link
Collaborator Author

eqrion commented Jun 17, 2024

I think one other point to bring up would be that the custom section approach might be able to handle longer strings, if a large import name would blow some implementation-specific limit.

If we run into the implementation-defined limits for import names, I would support increasing them and don't think we'd run into any issue with that.

Although, sort of related to the length limits issue you raised, it would be nice if these string payloads were at the end of the binary to allow compilation of the function bodies to start sooner. I don't think though that's a good enough reason to start defining a custom section for this (given the complexity concerns above).

@vouillon

Would it make sense to have a way to offset the indices of some imports, leaving a hole for regular definitions, to avoid that imports consume all the smallest indices? In the text format, this could correspond to allow interleaving imports and regular definitions.

At least for spidermonkey, we take advantage of the fact that imports come first very heavily. I'd prefer to avoid changing that unless we need to.

@eqrion
Copy link
Collaborator Author

eqrion commented Jul 11, 2024

I think one other point to bring up would be that the custom section approach might be able to handle longer strings, if a large import name would blow some implementation-specific limit.

In the event that happens, I believe we could easily expand the implementation-specific limit.

I'm going to close this again as I think our current approach is preferable to the alternatives that were brought up after the in person CG meeting. If folks think of another idea, let's talk about it in a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests