RFC: Rust Symbol Mangling (v0) #2603

michaelwoerister · 2018-11-27T13:29:21Z

Rendered
Tracking issue
Reference Implementation
Pre-RFC

Summary

This RFC proposes a new mangling scheme that describes what the symbol names generated by the Rust compiler. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, does not depend on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names. The new scheme is based on the name mangling scheme from the [Itanium C++ ABI][itanium-mangling].

Motivation

Due to its ad-hoc nature, the compiler's current name mangling scheme has a
number of drawbacks:

It depends on compiler internals and its results cannot be replicated by another compiler implementation or external tool.
Information about generic parameters and other things is lost in the mangling process. One cannot extract the type arguments of a monomorphized function from its symbol name.
The current scheme is inconsistent: most paths use Itanium style encoding, but some don't.
The symbol names it generates can contain . characters which is not generally supported on all platforms. [1][2][3]

The proposed scheme solves these problems:

It is defined in terms of the language, not in terms of compiler data-structures that can change at any given point in time.
It encodes information about generic parameters in a reversible way.
It has a consistent definition that does not rely on pretty-printing certain language constructs.
It generates symbols that only consist of the characters A-Z, a-z, 0-9, and _.

This should make it easier for third party tools to work with Rust binaries.

…ngling format.

eddyb · 2018-11-27T15:09:44Z

For the record, I'll be starting the compiler implementation/integration work ASAP, to get this RFC in rustc nightly, and later on, in other tools (such as GDB, LLDB, etc.).

Doing this at the same time as the RFC will give us the ability to collect data at scale, and figure out edge cases and performance tradeoffs we might miss otherwise.

text/0000-symbol-name-mangling-v2.md

eddyb · 2018-11-27T18:32:38Z

text/0000-symbol-name-mangling-v2.md

+
+### Methods
+
+Methods are nested within `impl` or `trait` items. As such it would be possible to construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::<u32, char>::some_method`.


In the interest of keeping this RFC sufficiently detached from current implementation details, can we use some more general placeholder notation, such as <impl>, instead of {{impl}}?

It's just an example of how not to do it. The {{xyz}} notation is meant to remind of what some templating engines use, not what the compiler did at some point. But I can change it to <impl> if you prefer that.

eddyb · 2018-11-27T18:37:54Z

text/0000-symbol-name-mangling-v2.md

+
+ - Identifiers and trait impl path roots can have a numeric disambiguator (the `<disambiguator>` production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form `s_` then the index is 1. If it is of the form `s<base-62-digit>_` then the index is `<base-62-digit> + 2`. The suggested demangling of a disambiguator is `[<index>]`. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always be omitted.
+
+ The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[<index>]`.


Similarly here, we should avoid braces. What does C++ do for its lambdas?

GCC uses something with braces and indices too:

int square(int num) { auto foo = [num]() -> int { return num * num; }; return foo(); }

The closure is demangled as square(int)::{lambda()#1}::operator()() const
(see https://godbolt.org/z/TaXWCe)

Do debuggers work well with it? If so, how? Can we do some tests to see what works and what doesn't?

I assume that debuggers treat lambdas as regular operator() methods. What kind of tests did you have in mind?

I'm referring to the problems @m4b mentions in #2603 (comment), regarding debuggers not being able to let you refer to symbol names that contain { (or perhaps only {{?).

If we change {{closure}} in the compiler with some other notation, we can see how well GDB and LLDB interact with the symbol names.

Although it's possible debuggers only handle such symbol names when they come from a mangling, which would mean debuggers should just pick a demangling that works for them, right?

eddyb · 2018-11-27T18:40:57Z

text/0000-symbol-name-mangling-v2.md

+
+ The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[<index>]`.
+
+ - In a lossless demangling, identifiers from the value namespace should be marked with a `'` suffix in order to avoid conflicts with identifiers from the type namespace. In a user-facing demangling, where such conflicts are acceptable, the suffix can be omitted.


Wouldn't that include all the statics and functions? Seems a bit excessive.

It does, but I don't think there's a way around it. Otherwise you get conflicts for examples like:

fn foo() { fn bar() {} } mod foo { fn bar() {} }

Note though that this is only for "lossless" demanglings. For most user-facing demanglings, like in debuggers or backtraces, the suffix can just be omitted. I suggest that demanglers support lossless or verbose option that is usually set to false.

eddyb · 2018-11-27T18:42:08Z

text/0000-symbol-name-mangling-v2.md

+struct Foo<T>(T);
+
+impl<T> Clone for Foo<T> {
+  fn clone<U>(_: U) {


Clone::clone can't take type parameters.

Yeah, I'll come up with a better example.

Borrow<T> could work well

EDIT: Wait, no, missed you wanted the type param on the method.

eddyb · 2018-11-27T18:42:48Z

text/0000-symbol-name-mangling-v2.md

+}
+```
+- unmangled: `mycrate::Foo::bar::QUUX`
+- mangled: `_RNMN11mycrate_xxx3FooE3barV4QUUXVE`


Shouldn't these mention type parameters?

Sure enough. You want to be able to distinguish between these 2 cases (this code compiles today):

struct Foo<U,V>(U,V); impl<U: Fn()> Foo<U, u32> { fn foo() {} } impl<U: Fn()> Foo<u32, U> { fn foo() {} }

OK, we'll have to take care of that then.

I had a little chat with @nikomatsakis about this yesterday and the outcome was that:

we always should encode type parameters in paths like this one and

we should also always encode parameter bounds in some form because there is no way to find out if they are needed for disambiguation without looking at other impls -- which we want to avoid. The bounds could be encoded in a numeric disambiguator though.

The consequences this has on symbol syntax should be small. We just have to find the best spot for adding parameter bounds.

we should also always encode parameter bounds

I still think that's not ideal, and I'd prefer having a disambiguated path to the impl and/or to the type parameters (either of which would be hidden in the non-verbose mode).

Can you give your reasons why having the path to the impl is better than encoding the bounds? I assume because it's less complicated?

eddyb · 2018-11-27T18:43:09Z

text/0000-symbol-name-mangling-v2.md

+struct Foo<T>(T);
+
+impl<T> Clone for Foo<T> {
+  default fn clone<U>(_: U) {


Similarly here, with the extraneous <U>.

rpjohnst · 2018-11-27T19:38:41Z

text/0000-symbol-name-mangling-v2.md

+
+<path-root> := <crate-id>
+            |  M <type>
+            |  X <type> <abs-path>


Bringing this comment up again: https://internals.rust-lang.org/t/pre-rfc-a-new-symbol-mangling-scheme/8501/4?u=rpjohnst. Would it make sense to move the trait's self type into its argument list? It reorders things from how they are displayed in error messages, but simplifies the grammar a bit.

I agree, we already treat <X as Trait<Y, Z>> as sugar for Trait applied with [X, Y, Z], there's no real reason to have it separate here.

I didn't forget about the suggestion but unfortunately, while implementing it, it turned out that it makes the demangler a lot more complicated -- at least if we want to stick to the <X as Trait> demangling. If we mangle trait methods as foo::bar::Trait<SelfType, X, Y, Z>::method, the demangler cannot know that it is dealing with a trait method when it starts demangling the path at foo. It could only discover that when it gets to Trait and would then have to rewind and store the already generated output (foo::bar::Trait) on the heap, demangle the self-type, then copy back the trait path and continue demangling the trait's type arguments. It can only know that Trait is a trait if we put a special marker on the identifier, so traits would again be special cased. As a consequence, I thought, if we have to special case traits one way are the other, we can as well do it in a way that allows for efficient demangling and doesn't need the extra kind of logic.

The situation would be different if we actually wanted to demangle trait methods to foo::bar::Trait<SelfType, X, Y, Z>::method. But I don't think we want to do that, right?

Oh, wait, an on-the-fly demangler needs to have everything in demangled order, right.
Is this only needed for <X as Trait<Y, Z>>, or are there other "out of order" constructs?

As far as I can tell <X as Trait<Y, Z>> is the only case.

arielb1 · 2018-11-27T20:27:31Z

text/0000-symbol-name-mangling-v2.md

+```
+
+
+### Items Within Specialized Trait Impls


Theoretically, you could also have stuff like this:

struct Foo<T>(T); impl<T> Foo<T> where T: FnOnce() -> u32 { fn foo() { static ABC: u32 = 0; } } impl<T> Foo<T> where T: FnOnce() -> f32 { fn foo() { static ABC: u32 = 0; } }

It is not supported by today's coherence, but it might be supported someday in the future.

I suppose that for now it is enough to also let this case use the <Foo<T>>'N format for either or both impls.

Yes, the RFC proposes to use a numeric disambiguator for keeping the two impls apart -- until specialization is finalized, at which point the disambiguator would be replaced with something more human-readable, which probably amounts to an encoding of the where clauses.

That code does not depend on specialization, as there is no overlap.

michaelwoerister · 2018-11-28T09:39:42Z

@m4b Let's discuss closures a bit. I want to get them fixed. The RFC proposes to demangle them as some::function::{closure}[3] where [3] means that it is there fourth closure within some::function. Am I correct in assuming that you don't find this readable enough and would prefer something like some::function::{closure at line 77}?

michaelwoerister · 2018-11-28T13:44:40Z

@bstrie Yes, I'm aware of these problems. I personally prefer to the indexing approach (I find C++'s {lambda()#1} notation quite appealing, actually). I'm still interested in hearing alternative suggestions.

zackw · 2018-11-28T16:14:27Z

I’m an end-user interested in cross-language interop, and I have some experience with implementation of the Itanium C++ ABI. I would like to provide a few notes on this RFC.

As I said in the pre-RFC discussion, I think the Rust mangling should be intentionally incompatible with all known C++ ABIs, because the linker should not resolve a call to extern void foo(int, int) from C++ as targeting a Rust function with the signature pub fn foo(i32, i32) -> () unless that function was specifically declared as extern "C++".

The current proposal achieves this by using _R as the prefix for mangled names, and that’s plenty good enough, but I think it should be written down as a concrete reason not to go for C++ ABI compatibility.
I think functions’ mangled names should always encode the full type signature of the function, even though Rust does not have function overloading. This would be a safety feature. It would trap cross-crate type mismatches between caller and callee at link time. (I have the impression this is supposed to be one of the purposes of crate disambiguators, but I do not trust them to do the job completely. Also, putting the type signatures into the mangled names would enable the linker to give better error messages.)
The Unicode handling is underspecified. RFC 2457 and its stabilization issue indicate that issues like normalization, the subset of characters allowed in identifiers, etc. are being handled at the language level, but a demangler needs to know what to do with arbitrary nonsense produced by tools other than a correctly-implemented Rust compiler: e.g. a buggy Rust compiler, the extern "Rust" support in some other compiler (also arguably buggy, but still), and people writing assembly language by hand.

I think this RFC should say that demanglers should check whether the result of decoding the punycode is a valid Rust identifier according to the rules that end up getting stabilized in the RFC 2457 process, and display the punycode string if it doesn’t qualify. For instance, _RN15mycrate_4a3b56d9godel_fgdu6escher4bachVE, which uses the NFD encoding of gödel, should be decoded as mycrate[4a3b56d]::[godel_fgd]::escher::bach.

Also, please add a cross-reference to RFC 2457.
In my opinion, the possibility of Unicode sequences that are not valid Rust identifiers being shoved into a mangled name by a tool that doesn’t follow all of the appropriate rules is a strong reason not to allow the use of raw UTF-8 in mangled names.
ABI markers are simultaneously under- and overspecified. There’s what appears to be an exhaustive list of codes for calling conventions that are currently supported on any architecture, including several mutually exclusive groups, e.g. you’ll never need "m" and "i" on the same computer, I hope. And at the same time it doesn’t say what it means if the ABI marker is not present at all. I would suggest cutting this down drastically, e.g.
```
// If the <abi> is not present, the function uses the usual Rust calling
// convention for this architecture and OS.
<abi> = "K" (
    "c" |     // Usual C calling convention for this arch and OS
    "j" |     // Rust intrinsic calling convention
    "i" |     // Interrupt handler for this architecture
    // Other single-lowercase-letter codes may be defined by each
    // architecture and OS; for instance, "s" could mean the Win32
    // "stdcall" convention.
)
```
Also, please add cross-references to the exact definitions of each of the calling conventions that are referenced by name.
It appears to me that the N prefix and E suffix on <abs-path> are unnecessary, and the E suffix may also be unnecessary in several of the other places where it’s used. The Itanium C++ ABI only uses N ... E notation when it’s necessary to disambiguate “nested names,” e.g. when a name needs to be encoded as part of a template parameter.

Thanks for listening.

eddyb · 2018-11-28T16:33:12Z

It would trap cross-crate type mismatches between caller and callee at link time.

This can't handle all possible ABI-impacting details, only shallow ones, and on top of that, rustc already has better detection of incompatible crates, solving most, if not all, linking concerns.
C++ has this issue because of header files, but Rust doesn't have any equivalent features.

So to me, it seems like this would just increase symbol name size, without many (any?) benefits.

Wilfred · 2018-11-28T18:24:02Z

text/0000-symbol-name-mangling-v2.md

+
+ - A mangled symbol should be *decodable* to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix.
+
+ - It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo<T>() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo<u32>` or `foo<extern fn(i32, &mut SomeStruct<(char, &str)>, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name, even for simple cases.


Spelling pedantry: I think this should be predict

rocallahan · 2018-11-28T20:29:33Z

What justifies the additional complexity of the "does not itself add any new information" rule for node equivalence? Is this a microoptimization or does it make things easier to implement?

jsgf · 2018-11-29T00:13:15Z

text/0000-symbol-name-mangling-v2.md

+           "j" |   // RustInstrinsic
+           "p" |   // PlatformInstrinsic
+           "u"     // Unadjusted
+        )


This all seems a bit arbitrary. Given that in principle there could be an unbounded number of ABIs, it seems like we should splurge on using a real string here rather than a single character. I'm also going to guess that these will be relatively rare, so space isn't a consideration?

I'd also favor encoding the ABI "string" (ideally as an identifier, replacing - with _, etc.)

This makes me wonder if Rust should've used extern(C) fn syntax instead of extern "C" fn, but it's too late now.

I responded in #2603 (comment).

jsgf · 2018-11-29T00:14:02Z

text/0000-symbol-name-mangling-v2.md

+           "u"     // Unadjusted
+        )
+
+<disambiguator> = "s" [<base-62-digit>] "_"


Is this only a single digit? What if more than 62 things need disambiguation? I can imagine such things arising in generated code.
I'd propose {<base-62-digit>}.

Meta-nit: in a post-regex world, I find EBNF somewhat unintuitive: it took me a while to even notice that by {...} you meant "replace ? with *", initially I thought you were talking about "{" ... "}".

cc @Centril (who started using "lyg" syntax instead)

Whoops, that's just a mistake in the grammar. It should be {<base-62-digit>} indeed.

Yeah, I don't really care which notation is used :P

jsgf · 2018-11-29T00:16:44Z

text/0000-symbol-name-mangling-v2.md

+
+<generic-arguments> = "I" {<type>} "E"
+
+<substitution> = "S" [<base-62-digit>] "_"


Likewise seems safer to make this {...} (notwithstanding other comments about compression).

jsgf · 2018-11-29T00:23:06Z

text/0000-symbol-name-mangling-v2.md

+With this post-processing in place the Punycode strings can be treated like regular identifiers and need no further special handling.
+
+
+## Compression


I'd be very tempted to omit this kind of ad-hoc compression scheme in favour of using a standard algorithm like zstd (say, zstd with a well-defined domain-specific dictionary). (cc @Cyan4973 - are there any examples of using zstd for per-symbol compression?)

Historically C++ demanglers have been very fragile, and I suspect a big part of that is due to the implementation complexity of the Itanium ABI compression mechanism.

Aside from the implementation issues, because this is so coupled with the definition of the mangling scheme itself, it means that any future evolution of mangling needs to also take compression into account. Using a completely separate compression layer makes this a non-issue. The other nice thing about making compression largely isolated from the rest of the encoding is that it means it can be added in a second pass as an extension once we have some experience with uncompressed mangling - maybe it wouldn't be so bad?

The main problem with using an external library is that any Rust demangler introduces another dependency. This is particularly worth considering when integrating Rust demangler support into other tools like binutils/llvm/perf/valgrind/etc.

I believe we can come up with a simple compression scheme (i.e. refer back to the byte position where the first occurrence of something was encoded).

This would allow a demangler implementation to have 0 external dependencies, and the specification would also not implicitly depend on another standard.

It might also behave better than zstd given the limitation of [a-zA-Z0-9_] (which makes bit streams less appealing), and it has the advantage that any path component name is guaranteed to show up in clear.

However, I would not be opposed to at least having a compiler mangling option which disregards the [a-zA-Z0-9_] limitation and which does the best compression it can, for use in situations where that might be advantageous (although at that point, you might be better off with symbol names just, say, hashes, and keep everything else in split debuginfo, and/or an ad-hoc mapping from hashes to symbol names).

Oh and should definitely gather data on all compression schemes we can think of (that are not too painful to implement), before we accept the RFC!

I've listed the zstd option as point 5 in Rational and Alternatives. I would be interested in seeing how zstd performs. But it does come with some real downsides:

Every demangler would have to support zstd. That's another dependency that not everyone might want to pull in.

The specification of the mangling scheme would depend on the specification of zstd. I see that there's an IETF RFC for it. That's good. But it's still rather heavyweight.

Mangled symbol would not retain any human-readability at all.

I think one of the next steps would be to collect a body of symbol names for testing different compression schemes.

@jsgf I do acknowledge, btw, that an AST-independent compression scheme is clearly beneficial when it comes to evolving the grammar.

@michaelwoerister Yes, I think extra dependencies and non-readability reasonable counter-arguments to using zstd.

(waving hands wildly) I'm assuming that we wouldn't bother compressing small symbols, so they would remain directly readable, and large symbols with any compression scheme would be such a soup that even if the compression scheme leaves some parts "in the clear" they're still not directly readable in any practical sense.

zstd and similar bit-compression algorithms are not very suitable for compressing short strings at all, and encoding the compressed bit-stream into [a-zA-Z0-9] (required by toolchains) would lose most of the compression again.

jsgf · 2018-11-29T00:28:37Z

text/0000-symbol-name-mangling-v2.md

+### Punycode vs UTF-8
+During the pre-RFC phase, it has been suggested that Unicode identifiers should be encoded as UTF-8 instead of Punycode on platforms that allow it. GCC, Clang, and MSVC seem to do this. The author of the RFC has a hard time making up their mind about this issue. Here are some interesting points that might influence the final decision:
+
+- Using UTF-8 instead of Punycode would make mangled strings containing non-ASCII identifiers a bit more human-readable. For demangled strings, there would be no difference.


This is moot if compression (of any kind) is applied as well.

You mean because compressed names are unreadable in any case?

jsgf · 2018-11-29T00:35:11Z

text/0000-symbol-name-mangling-v2.md

+
+### Methods
+
+Methods are nested within `impl` or `trait` items. As such it would be possible to construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::<u32, char>::some_method`.


Given that impls can appear anywhere within the crate, would the path be to the impl itself, or to the type being impled?

Do we need distinguish between different impls, or just impls with different constraints?

Given these questions, I think the proposal below to ignore impls themselves makes sense.

The type being impled doesn't have to be a path, it can be e.g. [u8], so I think the safest thing to do would be to have both a path to the impl and the full type (and optionally trait) the impl is for.

The PR proposes to not include the path of the impl at all. @eddyb, you would rather demangle symbols to something like my_crate::foo::impl<u32, char>::some_method?

I don't understand what Self and the Trait are in that example.
What I'm thinking is mangling the equivalent information of e.g. my_crate::foo::impl'17<my_crate::foo::S as my_crate::Trait>::some_method, demangling back to that only in verbose mode, but only showing <my_crate::foo::S as my_crate::Trait>::some_method in the "user-friendly" mode.

OK, yes that's what I thought you meant.

jsgf · 2018-11-29T00:42:49Z

@zackw +1 on most of your points, but I think for 2. to matter it would mean that Rust compilation would have to change a lot. In practice with Rust code, one never sees linker errors for Rust symbols.

Centril · 2019-05-10T16:04:38Z

Huzzah! This RFC has been merged!

Tracking issue: rust-lang/rust#60705

michaelwoerister · 2019-05-14T08:40:50Z

@vitiral, thanks a lot! :)
Lot's of credit also goes to @eddyb who put a ton of work into this and really improved upon my initial proposal.

@michaelwoerister

Introduce Rust symbol mangling scheme. This is an implementation of a "feature-complete" Rust mangling scheme, in the vein of rust-lang/rfcs#2603 ~~- but with some differences, see rust-lang/rfcs#2603 (comment) for details~~ (@michaelwoerister integrated my proposed changes into the RFC itself). On nightly, you can now control the mangling scheme with `-Z symbol-mangling-version`, which can be: * `legacy`: the older mangling version, still the default currently * `v0`: the new RFC mangling version, as implemented by this PR To test the new mangling, set `RUSTFLAGS=-Zsymbol-mangling-version=v0` (or change [`rustflags` in `.cargo/config.toml`](https://doc.rust-lang.org/cargo/reference/config.html#configuration-keys)). Please note that only symbols from crates built with that flag will use the new mangling, and that tool support (e.g. debuggers) will be limited initially, and it may take a while for everything to be upstreamed. However, `RUST_BACKTRACE` should work out of the box with either mangling version. <hr/> The demangling implementation PR is rust-lang/rustc-demangle#23 ~~(this PR already uses it via a git dependency, to allow testing)~~. Discussion of the *design* of the mangling scheme should still happen on the RFC, but this PR's specific implementation details can be reviewed in parallel. *Notes for reviewers*: * ~~only the last 6 commits are specific to this branch, if necessary I can open a separate PR for everything else (it was meant to be its own small refactoring, but it got a bit out of hand)~~ ~~based on #58140~~ * the "harness" commit is only there because it does some extra validation (comparing the demangling from `rustc-demangle` to the compiler's pretty-printing, adjusted slightly to produce the same output), that I would like to try on crater * ~~there is the question of whether we should turn on the new mangling now, wait for tools to support it (I'm working on that), and/or have it under a `-Z` flag for now~~ (we're gating this on `-Z symbol-mangling-version=v0`, see above) r? @nikomatsakis / @michaelwoerister cc @rust-lang/compiler

Serentty · 2019-12-14T19:09:34Z

So, related to the ABI discussion above, I have a question. Using this name mangling scheme, should it be possible to link Rust code to other Rust code while going through the existing C ABI, while preserving modules and other niceties?

eddyb · 2019-12-17T11:21:31Z

@Serentty Nope, and the mangling doesn't matter at all there - if you have a working module system, that means you need extra information (as found in .rmeta / .rlib) in the first place, so any mangling will work (as it will have everything it needs today).

If you want to link plain object files and still retain high-level features... well, that's both not planned but also technically implausible (without coming up with some special cut-down version of those features).

Serentty · 2019-12-17T20:30:53Z

So the RLIB files serve as Rust headers, essentially? That sounds reasonable.

eddyb · 2020-03-13T18:34:39Z

Updating the RFC's "v2" to "v0" as per rust-lang/rust#57967 (comment) (should've done it back then).

Replace #2603's (Rust Symbol Mangling) "v2" with "v0".

sivizius · 2020-05-12T14:23:10Z

Change very first ›Rendered‹-Link to https://github.com/rust-lang/rfcs/blob/master/text/2603-rust-symbol-name-mangling-v0.md

This is the libiberty (mainly for binutils/gdb) counterpart of rust-lang/rustc-demangle#23. Relevant links for the new Rust mangling scheme (aka "v0"): * Rust RFC: rust-lang/rfcs#2603 * tracking issue: rust-lang/rust#60705 * implementation: rust-lang/rust#57967 This implementation includes full support for UTF-8 identifiers via punycode, so I've included a testcase for that as well. libiberty/ChangeLog: * rust-demangle.c (struct rust_demangler): Add skipping_printing and bound_lifetime_depth fields. (eat): Add (v0-only). (parse_integer_62): Add (v0-only). (parse_opt_integer_62): Add (v0-only). (parse_disambiguator): Add (v0-only). (struct rust_mangled_ident): Add punycode{,_len} fields. (parse_ident): Support v0 identifiers. (print_str): Respect skipping_printing. (print_uint64): Add (v0-only). (print_uint64_hex): Add (v0-only). (print_ident): Respect skipping_printing, Support v0 identifiers. (print_lifetime_from_index): Add (v0-only). (demangle_binder): Add (v0-only). (demangle_path): Add (v0-only). (demangle_generic_arg): Add (v0-only). (demangle_type): Add (v0-only). (demangle_path_maybe_open_generics): Add (v0-only). (demangle_dyn_trait): Add (v0-only). (demangle_const): Add (v0-only). (demangle_const_uint): Add (v0-only). (basic_type): Add (v0-only). (rust_demangle_callback): Support v0 symbols. * testsuite/rust-demangle-expected: Add v0 testcases.

This is the libiberty (mainly for binutils/gdb) counterpart of rust-lang/rustc-demangle#23. Relevant links for the new Rust mangling scheme (aka "v0"): * Rust RFC: rust-lang/rfcs#2603 * tracking issue: rust-lang/rust#60705 * implementation: rust-lang/rust#57967 This implementation includes full support for UTF-8 identifiers via punycode, so I've included a testcase for that as well. libiberty/ChangeLog 2021-01-16 Eduard-Mihai Burtescu <[email protected]> * rust-demangle.c (struct rust_demangler): Add skipping_printing and bound_lifetime_depth fields. (eat): Add (v0-only). (parse_integer_62): Add (v0-only). (parse_opt_integer_62): Add (v0-only). (parse_disambiguator): Add (v0-only). (struct rust_mangled_ident): Add punycode{,_len} fields. (parse_ident): Support v0 identifiers. (print_str): Respect skipping_printing. (print_uint64): Add (v0-only). (print_uint64_hex): Add (v0-only). (print_ident): Respect skipping_printing, Support v0 identifiers. (print_lifetime_from_index): Add (v0-only). (demangle_binder): Add (v0-only). (demangle_path): Add (v0-only). (demangle_generic_arg): Add (v0-only). (demangle_type): Add (v0-only). (demangle_path_maybe_open_generics): Add (v0-only). (demangle_dyn_trait): Add (v0-only). (demangle_const): Add (v0-only). (demangle_const_uint): Add (v0-only). (basic_type): Add (v0-only). (rust_demangle_callback): Support v0 symbols. * testsuite/rust-demangle-expected: Add v0 testcases.

Since 2016-11-16 c++filt [1] recognizes rust legacy demangling and will supports newer one(v0 name mangling) [2] [3], so it would be better choice. [1]: https://sourceware.org/binutils/docs/binutils/c_002b_002bfilt.html [2]: rust-lang/rfcs#2603 [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=27194

michaelwoerister added 9 commits October 1, 2018 14:37

wip

567bad8

First update after reference implementation is done.

6104509

Add sections on Punycode identifiers, compression, and suggested dema…

0d632ca

…ngling format.

Update symbol syntax.

bad4c90

Update description of compression algorithm.

40037cf

Switch to from hex to base-62 as the non-decimal encoding.

836fc86

Add decompression and Rust entity to AST mapping.

61c0317

Clean up and Appendix B.

faeca26

Corrections after some proof-reading.

0fc3d01

michaelwoerister changed the title ~~Symbol Mangling v2~~ RFC: Symbol Mangling v2 Nov 27, 2018

Centril added T-compiler Relevant to the compiler team, which will review and decide on the RFC. A-linkage Proposals relating to the linking step. labels Nov 27, 2018

Centril reviewed Nov 27, 2018

View reviewed changes

text/0000-symbol-name-mangling-v2.md Outdated Show resolved Hide resolved

eddyb reviewed Nov 27, 2018

View reviewed changes

rpjohnst reviewed Nov 27, 2018

View reviewed changes

arielb1 reviewed Nov 27, 2018

View reviewed changes

Wilfred reviewed Nov 28, 2018

View reviewed changes

jsgf reviewed Nov 29, 2018

View reviewed changes

Centril mentioned this pull request May 10, 2019

Tracking issue for RFC 2603, "Rust Symbol Mangling (v0)" rust-lang/rust#60705

Open

16 tasks

RFC 2603

644025d

Centril merged commit 3e08228 into rust-lang:master May 10, 2019

Centril removed the I-nominated label May 10, 2019

pnkfelix mentioned this pull request May 23, 2019

[LLVM] Segmentation fault on MacOS with > 1 codegen units and optimization > 0 rust-lang/rust#60925

Closed

eddyb mentioned this pull request May 29, 2019

Amend RFC2603 to allow mangled identifiers to start with a digit. #2705

Merged

mzabaluev mentioned this pull request Jun 6, 2019

Define a Rust ABI #600

Closed

eddyb mentioned this pull request Nov 13, 2019

Macro that expands into string holding function name (or module path, etc) #1743

Open

eddyb changed the title ~~RFC: Symbol Mangling v2~~ RFC: Rust Symbol Mangling (v0) Mar 13, 2020

eddyb added a commit that referenced this pull request Mar 13, 2020

Replace #2603's "v2" with "v0".

22836e5

eddyb mentioned this pull request Mar 13, 2020

Replace #2603's (Rust Symbol Mangling) "v2" with "v0". #2879

Merged

eddyb mentioned this pull request Mar 23, 2020

Move from {{closure}}#0 syntax to {closure#0} for (def) path components. rust-lang/rust#70334

Open

nikomatsakis added a commit that referenced this pull request Apr 8, 2020

Merge pull request #2879 from rust-lang/eddyb-rename-2603-v2-v0

d89de6a

Replace #2603's (Rust Symbol Mangling) "v2" with "v0".

serak mentioned this pull request May 12, 2020

RFC: Rust Symbol Mangling (v0) #2603 Page 404 #2925

Closed

Centril mentioned this pull request Jun 17, 2020

Tracking issue for RFC 2457, "Allow non-ASCII identifiers" rust-lang/rust#55467

Closed

23 tasks

Xuanwo mentioned this pull request Jun 18, 2021

Idea: update rfc format beyondstorage/specs#110

Closed

kubo39 mentioned this pull request Jun 27, 2021

Recommend to use c++filt denoland/deno_lint#755

Merged

bstrie mentioned this pull request Feb 27, 2022

Add vendor-specific suffixes to v0 mangling RFC 2603 #3224

Merged


		### Methods

		Methods are nested within `impl` or `trait` items. As such it would be possible to construct their symbol names as paths like `my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies the the `impl` in question. Since `impl`s don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like `my_crate::foo::impl'17::<u32, char>::some_method`.


		- Identifiers and trait impl path roots can have a numeric disambiguator (the `<disambiguator>` production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form `s_` then the index is 1. If it is of the form `s<base-62-digit>_` then the index is `<base-62-digit> + 2`. The suggested demangling of a disambiguator is `[<index>]`. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always be omitted.

		The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[<index>]`.


		The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus `{closure}[<index>]`.

		- In a lossless demangling, identifiers from the value namespace should be marked with a `'` suffix in order to avoid conflicts with identifiers from the type namespace. In a user-facing demangling, where such conflicts are acceptable, the suffix can be omitted.


		- A mangled symbol should be decodable to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix.

		- It should be possible to predict the symbol name for a given source-level construct. For example, given the definition `fn foo<T>() { ... }`, the scheme should allow to construct, by hand, the symbol names for e.g. `foo<u32>` or `foo<extern fn(i32, &mut SomeStruct<(char, &str)>, ...) -> !>()`. Since the current scheme generates its hash from the values of various compiler internal data structures, not even an alternative compiler implementation could predicate the symbol name, even for simple cases.


		<generic-arguments> = "I" {<type>} "E"

		<substitution> = "S" [<base-62-digit>] "_"

		With this post-processing in place the Punycode strings can be treated like regular identifiers and need no further special handling.


		## Compression

RFC: Rust Symbol Mangling (v0) #2603

RFC: Rust Symbol Mangling (v0) #2603

Uh oh!

Conversation

michaelwoerister commented Nov 27, 2018 • edited by mbrubeck Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Uh oh!

eddyb commented Nov 27, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eternaleye Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eddyb Dec 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arielb1 Nov 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelwoerister commented Nov 28, 2018

Uh oh!

michaelwoerister commented Nov 28, 2018

Uh oh!

zackw commented Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eddyb commented Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

michaelwoerister commented Nov 27, 2018 •

edited by mbrubeck

Loading

eternaleye Nov 28, 2018 •

edited

Loading

eddyb Dec 7, 2018 •

edited

Loading

arielb1 Nov 27, 2018 •

edited

Loading

zackw commented Nov 28, 2018 •

edited

Loading

eddyb commented Nov 28, 2018 •

edited

Loading

eddyb Nov 29, 2018 •

edited

Loading