Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substrings: slices vs. copies #1

Open
jakobkummerow opened this issue Aug 4, 2023 · 1 comment
Open

Substrings: slices vs. copies #1

jakobkummerow opened this issue Aug 4, 2023 · 1 comment

Comments

@jakobkummerow
Copy link
Contributor

jakobkummerow commented Aug 4, 2023

The WebAssembly.String.substring operation in the draft spec doc creates a string S2 that's a substring of another string S1. V8 originally implemented this by reusing its existing machinery for creating string slices, i.e. S2 would only store a pointer to S1, a start offset, and a length, making the substring operation itself very cheap because it doesn't need to copy any characters, and making S2 very memory-efficient (especially when it's long, so that many characters are shared with S1).

Like almost any engineering technique, this approach is a tradeoff with pros and cons. In this case, its drawback is that a relatively short string (S2) can be keeping an otherwise-unreachable other string (S1) alive, which might be much bigger. We have just encountered a real-world use case where this happens a lot and has detrimental effects on memory consumption: user-space JSON parsing. Any string values in the decoded data keep the entire JSON source string in memory when they are represented as string slices. (Note that the specifics of the JSON format are irrelevant, the same would happen for any other string-based serialization format where string values occur as substrings of the serialized data, JSON just happens to be a very common case of this.)
For comparison, JavaScript doesn't run into this problem, because the implementation of JSON.parse never creates sliced strings, because it can make a very reasonable assumption that the JSON source string is meant to be thrown away when the parsing operation is complete.

As a short-term mitigation, we have just changed V8's implementation of WebAssembly.String.substring to make full copies when the substring is at most half as long as the original string.
A possible longer-term engine-side mitigation might be to build a tracking mechanism across the entire managed heap that figures out which long strings are being used by which slices, and can then decide whether converting these slices to copies would be overall beneficial or not. That's fairly heavyweight machinery for solving a rather limited problem though, so while I'd be excited to see this prototyped, I'm not convinced that it would actually be a reasonable choice for engines to spend the engineering hours to build such machinery and, more importantly, the runtime CPU cycles to operate it.

As a proper solution, it would seem reasonable to me to add a way to let Wasm modules explicitly control the desired behavior:

  • One option for this would be to have two variants of the substring operation: substring_allow_shallow and substring_force_copy (illustrative names, to be bikeshedded).
  • Another option would be to have an explicit "make this a simple/flat/direct string" operation. That could be more generic/flexible (e.g. it could also be used to flatten ropes resulting from concatenation), but also seems harder to spec/name, because it would only affect implementation internals, with no change to observable values. It might also be somewhat harder to implement, because for maximum efficiency engines would have to fuse it with the preceding operation (to avoid allocating a short-lived intermediate string). For optimizing compilers that's not a big deal, but not every engine wants to implement an optimizing compiler, and not every function always runs in optimized mode.
  • There may well be additional options to accomplish the same goal.

I don't feel strongly about how this will be solved, I'm just saying that I think it would make sense to have some improvement over the initial design sketch in this regard.

Without new operations, copies of string slices can be forced by converting the string to an array and then back to a string (using W.S.toWtf16Array + W.S.fromWtf16Array); engines could conceivably learn to recognize this pattern and shortcut it to avoid the double copy (but not shortcut it too much: if they avoided both copies, that would defeat the point). IMHO this would not be a pretty solution though; it feels similar to dubious JavaScript performance tricks like "temporarily install this object as the prototype of some other dummy object just to make the JS engine to switch its internal representation into a certain mode".

(Same discussion for stringref: WebAssembly/stringref#63)

@eqrion
Copy link
Collaborator

eqrion commented Dec 5, 2023

Sorry for the delay in responding.

I am unsure about whether it makes sense to allow hinting for 'slice' vs 'copy'. The optimization of forcing an eager copy when copying a small percentage of a large string seems fairly easy to do, so I'd lean towards just expecting engines to do that. It seems like a nice refinement of existing optimizations.

If we had a hint, I would wonder if engines would still do this check (as it's cheap) to see if they want to override the hint for cases where toolchains get it wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants