Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Names section extensible #984

Merged
merged 4 commits into from
Feb 17, 2017
Merged

Make Names section extensible #984

merged 4 commits into from
Feb 17, 2017

Conversation

lukewagner
Copy link
Member

This PR is intended as a first step to addressing #750 which is to make the names section extensible so that more kinds of names can be added over time. While in theory new types of names could be added as new top-level custom sections, that seems a bit messy. So instead, this PR breaks the names section into subsections, each of which has a code, and byte length.

To start simple (and allow a quick, targeted update to tools/engines), this PR doesn't add any new types of names; just reformats function and local names. Since locals aren't the only thing that could be named inside a function (blocks being the other obvious thing), locals are made their own subsection.

| Name Type | Type | Description |
| --------- | ---- | ----------- |
| name_len | `varuint32` | number of bytes in name_str |
| name_str | `bytes` | UTF8 encoding of the name |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these have to UTF8 -> UTF16 properly? Be valid JavaScript properties?

Or should we just have a byte array like elsewhere, and restrict the name section in JS.md like everywhere else? That seems better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I guess we could just leave them plain old byte arrays, but there is a difference. For import/export names, they might be strings, they might be some custom data structure. In this case, I think they're really supposed to be strings, so I think it's beneficial (and consistent with the state before this PR) to say they are UTF8 strings. Also, they don't end up being reflected in JS as property names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kinda breaks the "embedder makes sense of byte arrays" approach we had elsewhere. It's weird to mention UTF anywhere but in JS.md.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's more like "embedder makes sense of imports and exports"; in this case they're actually supposed to be names (so strings).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC what we had before was a hacked-up names section that got us limping along, with no real intent of keeping it around. Maybe this would help me understand better: what's the intent of this refactoring? I thought you were going for something that'll stand some test of time, so without designing DWARF / TROLL I'm trying to see what we can minimally get "right". Is that out of scope? Happy to back away if so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the scope of the names section is just assigning (string) names to various index spaces for the benefit of binary->text rendering. If we want to do something more in the direction of source-level debugging, I think that will require a new section with a lot more structure than a collection of arrays of names; I can't imagine that being able to encode arbitrary binary contents in the names of the names section will be sufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm not proposing we broaden the scope to more than strings. What I was trying to get to is: are we trying to get this string mapping thing right? Or is this a throwaway thing to tie implementations over?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have a slight preference for not mentioning encodings here either (keep that out of the core spec altogether). Rather say in the JS API spec that names are assumed to be UTF-8 encoded and handled analogously to import/export names.

| name_payload_len | `varuint32` | size of this subsection in bytes |
| name_payload_data | `bytes` | content of this section, of length `name_payload_len` |

Since name subsections have a given length, unknown or unwanted names can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"unknown or unwanted name types" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

The function names subsection is a `name_list` which assign a name to each
function by [function index](Modules.md#function-index-space). (Note: this
assigns names to both imported and module-defined functions.) The count
may differ from the actual number of functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A count that's bigger or smaller doesn't seem useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just being consistent with what it was before; I don't have a really strong opinion here.

| local_name_str | `bytes` | valid utf8 encoding |

| count | `varuint32` | count of `name_list` in func_locals |
| func_locals | `name_list*` | sequence of `name_list`, one per function |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can't skip a function? Or to do so you need an empty name_list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you'd set count=0 (similar to what you had to do before when the locals were embedded inside each function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty brittle when combined with "you can avoid specifying functions if you want to". At this point why not use a map of { function_number => name } ?

I agree forcing all of them to be present is suboptimal (mixing debug and release is nice, especially if we consider transfer size). I just don't want us to end up with some super weird encoding, even if name sections aren't really standard right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A map is more general, but more complicated too; I think the normal case here is all-or-nothing (so either no local subsection at all or a full one).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I like the shift in index space. And in fact, there is a use for naming locals for imports: parameter names.

More generally, I wonder whether it's optimal to require dense lists for the name section. Once we extend this to other names we might encounter lots of anonymous entries (consider the type section). And even for functions I can imagine scenarios where you have large modules but only partial name information, e.g., when merging several modules into one for deployment (what I called "static linking" the other day). An index->info map seems more flexible and potentially more compact, while not being all that complex.

@lukewagner
Copy link
Member Author

@jfbastien @rossberg-chromium Ok, fair enough. How about this? (Sorted by index to attempt to keep things somewhat simple.)

@sunfishcode
Copy link
Member

There is value to mentioning UTF-8. The purpose of the name section is display, but how would a general-purpose tool display a name if it doesn't know the encoding? Specifying that it's UTF-8 for JS isn't sufficient for fully general tools, since WebAssembly may be used in non-JS contexts too.

@jfbastien If we wanted to get this string mapping thing right, what are the other concerns?

@jfbastien
Copy link
Member

Sure, this looks good.

@sunfishcode that argument applies for all strings in WebAssembly. I don't think this is a worst situation, and I like that it's consistent. With this we don't need to check UTF-8 encoding validity, we don't need to specify Unicode version, normalization, replacement character matching, etc. We just defer to the embedder to behave the same.

@ghost
Copy link

ghost commented Feb 14, 2017

I don't think naming locals will help enough to bother, not enough to bake this into a spec. Assuming that locals are even retained, they are reused to hold different definitions so might hold value x for some span and y for another span etc. There is not going to be a nice mapping from C variables to wasm locals, not without placing constraints on the use of the locals that would be unfortunate. From what I see it works better to name the definitions and similarly if their type were annotated (e.g. char *src).

Also I think we should be dog-fooding wasm as a pluggable module to do the source code presentation and that would include the use of the names section. We should be defining a sandbox in which a pluggable wasm module can safely run to present the code. That decouples this from requiring agreement here, or if there is some proposal here and agreement then that could be just one solution and does not bind other people or future work and it could be a simple throwaway solution to demonstrate the use. An added advantage would be that it would save duplicated work across the web browser teams, only one would need to develop this if they can agree or they could all pitch in to contract out the work, or just leave it to the web community.

@sunfishcode
Copy link
Member

sunfishcode commented Feb 14, 2017

@jfbastien My specific concern is ending up with a situation where tools (which aren't JS-specific) end up looking at bytes and guessing their encoding using heuristics in order to convert them to display, an activity which can be error-prone. I'm open to other approaches that address this concern, or to reasons why this shouldn't be a concern.

I'm not a Unicode expert, but it seems like the nature of WebAssembly's ~~~text~~~name section is such that specifying UTF-8 for it addresses this concern without complexity. Validation errors are ignored in the name section, so it doesn't matter which Unicode version producers or consumers use. Consumers can just ignore names that they themselves can't handle. Similarly, consumers could normalize, or not, do things with replacement characters, or not, or many other things, according to their own needs. This is quite different from import/export names, which are involved in program semantics.

@jfbastien
Copy link
Member

@sunfishcode I agree that this is a concern for external tooling. My point is simply that it's the same problem for import / export: external tooling also has to play "guess the encoding", and here different consumers can decide to fail validation if, say, they only support ASCII.

I therefore think the name section should do the same as the rest of wasm and say "byte string". Leave it up to the embedder to give it meaning. We could also specify UTF-8 everywhere, and that change could come now-ish or later.

@RyanLamansky
Copy link

Not having a text encoding in the name section seems to defeat the purpose of having it. More importantly, I don't think the binary encoding document should pre-define any custom sections. That sort of thing belongs in the embedder's spec, where useful information like how to actually read the names won't be criticized.

I remain bothered by the team's fear of UTF8. It's 2017. I was able to connect my Yamaha AV receiver to a wifi network that uses the key emoji as its SSID. I think any platform that's sophisticated enough to read a WASM file can at least drop unknown characters from a UTF8 string.

@ghost
Copy link

ghost commented Feb 14, 2017

From some points made by @tabatkins on another w3c process, there appears to be some focus on extensibility via custom code, and some interesting examples were given such as:
https://www.smashingmagazine.com/2016/03/houdini-maybe-the-most-exciting-development-in-css-youve-never-heard-of/
https://developers.google.com/web/fundamentals/getting-started/primers/customelements

With the web going in that direction, and wasm potentially offering solutions on the code side, surely the wasm text format and the names used should be customizable. Could we do this from the start, or admit that whatever is proposed now is a throwaway.


The sequence of `local_name` assigns names to the corresponding local index. The
count may differ from the actual number of locals.
where a `name` is encoded as:
Copy link
Member

@rossberg rossberg Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: How about calling such a pair a naming for more clarity, in contrast to a plain name which intuitively is just a string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, done.

Copy link
Member

@rossberg rossberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with suggestion

@lukewagner
Copy link
Member Author

@jfbastien @sunfishcode I do see how the toolchain would care about import/export strings in the same way they care about the name strings; both are things the toolchain would want to programmatically manipulate. So that would make the encoding of all kinds of strings part of the ABI (but still not necessarily part of core wasm).

@rossberg
Copy link
Member

@Kardax, there is value in keeping the dependencies for a standard minimal, especially for a low-level standard like Wasm. There also is value in avoiding restrictions that are immaterial to the standard itself, especially if we want to make it as universally useful as possible. Certain environments are free to impose additional assumptions, but I'm unconvinced that there is enough reason to lower those into the core spec.

If we officially required UTF-8 for things then it would imply the need to check. That would be a cost you impose on everybody every time, even when and where it doesn't matter.

@tabatkins
Copy link

Not having a specified encoding is also a cost you impose on everybody every time. You have to roll the dice every time that you're guessing the encoding correctly, and people will screw it up - the experience of every single example of this kind of thing ever in the entire world shows that.

The world has standardized on a single text encoding. The entire web utilizes and enforces it, except where we have legacy content preventing that. Bite the bullet and enforce the encoding; future users won't thank you, because they won't run into any problems that cause them to realize there was ever a possibility of an issue, but isn't that the greatest thanks of all?

@sunfishcode
Copy link
Member

@lukewagner It's desirable that tools like wabt's wasmdump be able to dump any module, without having to be aware of what ABIs the module may be participating in. Display is the entire purpose of the name section, and display requires an encoding.

I understand the reasons why wasm doesn't want to require UTF-8 validation in import/export names, though it it's definitely a tradeoff. Assuming the current direction, I suggest wasm at least consider adding a non-normative recommendation that, when import/export names represent strings, they be encoded as UTF-8.

Specifying UTF-8 for the name section, and suggesting it for imports/exports, do not add costly dependencies to wasm. They do not force wasm to say anything else about Unicode or think about any of the scary parts of Unicode. They do have the potential to reduce the scary unknown-encoding and mixed-encoding scenarios in the wasm tooling ecosystem.

@tabatkins
Copy link

@jfbastien said:

@sunfishcode that argument applies for all strings in WebAssembly. I don't think this is a worst situation, and I like that it's consistent. With this we don't need to check UTF-8 encoding validity, we don't need to specify Unicode version, normalization, replacement character matching, etc. We just defer to the embedder to behave the same.

Note that none of the list of concerns here (specifying "Unicode version, normalization, replacement character matching, etc.") have any relevance to requiring names be encoded in UTF-8. Just look at how web standards use strings - they just do codepoint-by-codepoint matching, and it's up to the authors to ensure different users are writing strings with the same normalization, etc. We're not looking for maximum human-friendliness here, just basic usability.

The only thing that needs to be done is verify that the bytes form a valid UTF-8 byte sequence, which is extremely cheap and simple to do, and then handle errors in whatever way you find most appropriate.

@jfbastien
Copy link
Member

@tabatkins easy enough to implement in JS quickly :)

My argument is purely one of consistency: UTF-8 everywhere, or byte strings everywhere. We're already mostly consistent with varint everywhere, I like UTF-8, but I don't think that change should be in this patch since the current focus is changing the names section.

@sunfishcode
Copy link
Member

@jfbastien UTF-8 is already present in the document. As it now stands, this PR is now proposing to remove the mention of UTF-8. As long as this is the case, it's appropriate to discuss it here.

@jfbastien
Copy link
Member

@sunfishcode github hides this discussion, which addresses your point: #984 (comment)

To recap:

If this isn't a throwaway, as the names section was before, then let's Do It Right. Either we UTF-8 everywhere, or we make this consistent with the rest of the spec. UTF-8 snuck in outside of JS.md because this was throwaway, so precedent in names section is moot because it wasn't designed. I'm fine with either approach, but I want to keep changes focused.

@sunfishcode
Copy link
Member

@jfbastien I myself don't recall anyone saying the name section was a throwaway at any point in its design. That certainly wasn't my understanding, at least, so I perceive this PR, which is otherwise unrelated to character encodings, as sneaking in a character encoding change, which I find surprising and unfocused.

@rossberg
Copy link
Member

rossberg commented Feb 15, 2017 via email

@RyanLamansky
Copy link

@rossberg-chromium

Unicode libraries are notorious for causing bloat which we may not want to impose on all engines in all environments.

So don't use one. The UTF-8 algorithm is extremely simple. A a few lines of code can parse the code points and throw out anything > 127 in a single pass. "Universal" WASMs will have to limit themselves to the supported characters, but at least this way a single file can run everywhere.

If we don't specify encoding, the industry has to coalesce around a de-facto standard (which will probably be UTF8 or ASCII anyway) or WASM sources need to make multiple builds to accommodate every embedder's specific encoding decision, which means some embedders will never get some WASMs.

@rossberg
Copy link
Member

rossberg commented Feb 15, 2017 via email

@jfbastien
Copy link
Member

@rossberg-chromium UTF-8 != Unicode. UTF-8 is super simple (encoder and decoder).

But you're all making my point very strongly. Let's focus this issue on names section, and move the UTF-8 discussion to its own issue. I've filed #989 for the discussion.

@tabatkins
Copy link

Not having a specified encoding is also a cost you impose on everybody every time.
@tabatkins, Wasm itself never interprets any strings whatsoever. The only
time the encoding matters is when an external tool wants to display them
to the user in some given environment. And those (e.g. JS embeddings)
are free to impose additional encoding restrictions/assumptions as they see
fit. This does not imply the need to guess anything, they can still choose
to reject a module that does not meet their requirements. Or degrade
gracefully, as with any other error, especially in the name section.

Right, communication is the only time encoding matters. If you're just passing data around inside your program, you can do whatever you want.

And the world has settled on a single correct encoding for strings - UTF-8. Anything else you can possibly use is a mistake (but it's still done, and it causes tons of confusion and damage). Mandating its use (heck, even just as an informative note - I don't know precisely what y'all do with these names) is helpful to prevent people from making bad assumptions.

Trust everyone who's ever had to deal with encodings - having to guess at an encoding is not a situation you want to end up in. You'll get it wrong, you'll get mojibake, it's a bad time for everyone. New web technology mandates UTF-8 for a reason.

But WebAssembly is not limited to the web or other rich user-facing
platforms.

I know, I was pointing to examples of places doing it right. Web standards have learned the lesson - if you're communicating text between two parties for any reason, mandate UTF-8.

Unicode libraries are notorious for causing bloat which we may
not want to impose on all engines in all environments. This already is a
problem for JavaScript, e.g. V8 has some embedders that have a hard time
swallowing the unicode library, and we used to provide options to omit it
(creating a technically incorrect JS implementation). And with Wasm being a
much slimmer language, the relative overhead will be much more significant.

And I think I've found the confusion - you're confusing "encode your strings in UTF-8" with "include an internationalization library". The latter is as relevant to this topic as discussing a trig library would be when deciding on how to encode a float into bytes - internationalization (canonicalization, region-dependence, casing, etc) is something you do with string values when you're manipulating them for humans. It has nothing to do with how you encode strings, which is what we're talking about here; I'm just saying that any exposed "strings" need to have a single, well-known encoding (and that encoding is UTF-8). It's solely a way of encoding numbers between 0x0 and 0x10ffff into a byte sequence, that everyone agrees on and can reliably reverse.

Whether you encode your é as a single U+00e9 or as a U+0065 followed by a U+0301 is irrelevant; the point is that your byte sequence will say either c3a9 or 65cc81, and arbitrary consumers at the other end will be able to to interpret that back into the characters you intended, without needing to guess or have out-of-band knowledge of your encoding, because it was specced/required in the language spec.

@ghost
Copy link

ghost commented Feb 16, 2017

Another interesting link from another w3c discussion see https://www.w3.org/2001/tag/doc/polyfills/ With advice that might support the names being definable by a polyfill. From the W3C TAG:

"7. Advice for spec editors. If you are developing a new feature for the web, polyfills can be hugely beneficial in helping to roll out that feature.

7.1 Don't be constrained by what is 'polyfillable'. New web features need not be constrained by the need to enable polyfill implementations. However, all other things being equal, choosing a design that is more polyfill friendly is desirable."

So how can web developers define their own names for their own encoding, and can issues such as this one be deferred to a polyfill for now and extensible by a polyfill?

@lukewagner lukewagner merged commit cd2af4d into master Feb 17, 2017
@lukewagner lukewagner deleted the extensible-names-section branch February 17, 2017 00:41
@sunfishcode
Copy link
Member

This PR was merged removing the reference to UTF-8 for the name section, despite the fact that I had specifically objected to that.

@lukewagner
Copy link
Member Author

Sorry about that; discussion for UTF-8 in core wasm spec is now in #989.

hubot pushed a commit to WebKit/WebKit-http that referenced this pull request May 10, 2017
JSTests:

https://bugs.webkit.org/show_bug.cgi?id=171263

Reviewed by Keith Miller.

* wasm/function-tests/nameSection.js: Added.
(const.compile):
* wasm/function-tests/nameSection.wasm: Added.
* wasm/function-tests/stack-trace.js: Update format

Source/JavaScriptCore:

https://bugs.webkit.org/show_bug.cgi?id=171263

Reviewed by Keith Miller.

The name section is an optional custom section in the WebAssembly
spec. At least when debugging, developers expect to be able to use
this section to obtain intelligible stack traces, otherwise we
just number the wasm functions which is somewhat painful.

This patch parses this section, dropping its content eagerly on
error, and if there is a name section then backtraces use their
value instead of numbers. Otherwise we stick to numbers as before.

Note that the format of name sections changed in mid-February:
  WebAssembly/design#984
And binaryen was only updated in early March:
  WebAssembly/binaryen#933

* CMakeLists.txt:
* JavaScriptCore.xcodeproj/project.pbxproj:
* interpreter/Interpreter.cpp:
(JSC::GetStackTraceFunctor::operator()):
* interpreter/StackVisitor.cpp:
(JSC::StackVisitor::readNonInlinedFrame):
(JSC::StackVisitor::Frame::functionName):
* interpreter/StackVisitor.h:
(JSC::StackVisitor::Frame::wasmFunctionIndexOrName):
* runtime/StackFrame.cpp:
(JSC::StackFrame::functionName):
* runtime/StackFrame.h:
(JSC::StackFrame::StackFrame):
(JSC::StackFrame::wasm):
* wasm/WasmBBQPlanInlines.h:
(JSC::Wasm::BBQPlan::initializeCallees):
* wasm/WasmCallee.cpp:
(JSC::Wasm::Callee::Callee):
* wasm/WasmCallee.h:
(JSC::Wasm::Callee::create):
(JSC::Wasm::Callee::indexOrName):
* wasm/WasmFormat.cpp:
(JSC::Wasm::makeString):
* wasm/WasmFormat.h:
(JSC::Wasm::isValidExternalKind):
(JSC::Wasm::isValidNameType):
(JSC::Wasm::NameSection::get):
* wasm/WasmIndexOrName.cpp: Copied from Source/JavaScriptCore/wasm/WasmCallee.cpp.
(JSC::Wasm::IndexOrName::IndexOrName):
(JSC::Wasm::makeString):
* wasm/WasmIndexOrName.h: Copied from Source/JavaScriptCore/wasm/WasmFormat.cpp.
* wasm/WasmModuleInformation.h:
* wasm/WasmModuleParser.cpp:
* wasm/WasmName.h: Copied from Source/JavaScriptCore/wasm/WasmCallee.cpp.
* wasm/WasmNameSectionParser.cpp: Added.
* wasm/WasmNameSectionParser.h: Copied from Source/JavaScriptCore/wasm/WasmCallee.cpp.
(JSC::Wasm::NameSectionParser::NameSectionParser):
* wasm/WasmOMGPlan.cpp:
(JSC::Wasm::OMGPlan::work):
* wasm/WasmParser.h:
(JSC::Wasm::Parser<SuccessType>::consumeUTF8String):


git-svn-id: http://svn.webkit.org/repository/webkit/trunk@216597 268f45cc-cd09-0410-ab3c-d52691b4dbfc
@juj
Copy link

juj commented Jun 10, 2017

@flagxor, @jgravelle-google et al., can you tell which version of Chrome will ship with the new Names format?

@jfbastien: How about WebKit/Safari? I presume it will ship with the new Names format immediately starting from the initial public release?

@jfbastien
Copy link
Member

@juj only the new format is implemented in WebKit, as of at least two STPs ago, I don't quite remember when.

@binji
Copy link
Member

binji commented Jun 10, 2017

@juj The new names format landed in v8 here. Looks like that made it into 59 stable.

@jfbastien
Copy link
Member

Oh and we don't have the module name yet: #1055
I should do it, it's super easy.

@lukewagner
Copy link
Member Author

FWIW, we don't have the module or any other names subsection other than function. However, we do have logic that skips over unknown subsections so that the toolchain can start using new subsections as soon as it wants without the risk of invalidating function names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants