Skip to content

nikomatsakis/rust-name-resolution-algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains a prototype for the new name resolution algorithm I would like to propose (and eventually implement) for Rust. It is a very flexible algorithm: globs are fully supported; macros can be used and impored like anything else; references to macros do not have to come "after" the definition of the macro in any sense; macros can define items (even other macro definitions) that can then be imported and used (even via globs).

Here are the basic guidelines, ignoring macros for a moment:

  1. Explicit imports and structs take precedence over globs.
    • So if I have use foo::*; and use bar::X;, then the name X always resolves to bar::X, even if foo contains a member named X.
  2. It is always legal to import the same item through multiple paths (e.g., use foo::X and use bar::X are legal, so long as they both ultimately refer to the same X).
  3. It is legal for two globs to conflict so long as the conflicting item is not referenced (e.g., a module can have both use foo::* and use bar::* even if both foo and bar define a struct X, so long as the module never refers to X).

When we add macros into the mix, things get interesting. This is because macro expansion generates new items, which can in turn affect how paths are resolved. The current algorithm takes a kind of greedy expansion strategy: we repeatedly iterate and expand, in any given round, all macros whose names we can successfully resolve.

When we expand macros, we remember the path that we expanded the macro definition that we used, and then at the end we double check that the path still leads to the same macro definition. If not, an error is reported. The goal of this check is to avoid "time travel" violations, where expanding a macro created a new item which shadowed a name found via a glob, causing the macro expansion path to point at the wrong place.

CAVEAT: The following text is out of date.

Compatibility with Rust

The core algorithm is (I think) backwards compatible with Rust today (modulo bugs in today's code that accept things they should not). But there are a few places that the algorithm as implemented here diverge from Rust which are probably not strictly backwards compatible (and may not be desirable, depending on who you ask):

  1. use statements define links that can be referenced from other modules. In other words, if a parent module has use foo::Bar, then it is legal for one of its children to do use super::Bar. This makes use and pub use statements completely analogous, except for the privacy of the resulting item.
  2. Globs from an ancestor module import both public and private items. In other words, use super::* will bring in private items defined in your parent as well as public items. Note that this combines with point 1 to mean that use super::* effectively recreates your parent's naming environment completely.

Caveats with the current state

There are also some caveats (as of the time of this writing, anyway, this summary may get out of date):

  1. I don't model multiple namespaces yet. I'll get to that.
  2. Privacy is not yet implemented. I may get to that.
  3. Hygiene and macro arguments in general are not modeled. I probably won't get to that.

I don't foresee any horrible problems with any of these things. However, multiple namespaces have surprised me in the past in terms of their complexity, so I do intend to actually implement THOSE. The others seem simple enough, I may do it if the whim strikes me.

How the algorithm works at a high level

The basic idea is that a fixed point expansion followed by a verification step. We basically progressively process the code, resolving macros, use statements, and globs until we reach a stable state. We then check that the final result was self-consistent and free of duplicate names and so forth.

Output of the algorithm

The output of the algorithm is a either an error or, upon success, a new AST tree (actually, the implementation mutates the existing one in place) and a set of bindings {Name -> ItemId} for each module. Here ItemId is an unambiguous identifier for everything in the AST. So if I have a module like mod foo { struct Bar; struct Baz; }, then the bindings for this module would be something like {Bar -> 22, Baz -> 23}, assuming that the ids assigned to the two structs Bar and Baz were 22 and 23 respectively. Note that there is nothing which says that this set of bindings must be unambigious. For example, we might have two bindings for the same name, like {Bar -> 44, Bar -> 66}.

Side note: The actual implementation gives identifies a bit more structure; they carry tags identifying the sort of item they refer to, so the id of Bar would be something like ItemId::Struct(22). This is defined in ast.rs, which is also a kind of experimental playground for other ways to struct rustc's own internal AST.

Intermediate state

The computation itself is a fixed-point computation that incrementally builds up these binding sets. During this process, the binding sets include tuples with slightly more information: (Name, Precedence, ItemId), where Precedence is either high (explicit declarations) or low (glob). It also carries along a set called "fully expanded". This is just a set of modules whose contents are fully expanded, meaning that they contain no macro references. It is important to know this because it affects when we can start processing globs (read on).

Algorithm pseudo-code

Here is the algorithm in pseudo-code. Note that this definition maps quite cleanly to the function resolve_and_expand in the actual implementation.

loop {
    expand macros where possible;
    seed names for all modules;
    glob names for all modules;
} until fixed point is reached;
verify all paths;

To see how the algorithm works, let's consider this example:

mod foo {
    use bar::*;
    use std::collections::HashMap;
    
    some_macro! { ... };
    
    struct MyStruct { }
}
mod bar {
    macro_rules! some_macro {
        ...
    }
}

Now let's go over each of the steps above. However, I'm going to go over them in a slightly

Expand macros where possible. The first we do is to try and expand macros. For every macro reference, e.g. some_macro! { .. } in our example above, we try to resolve the path (here, some_macro, resolved relative to the current module). This may or may not succeed. If it fails, it could be for a variety of reasons: there might be no binding for some_macro (yet); there might be multiple bindings for b (which indicates an error); or, the path might not lead to a macro. Whatever the cause, if path resolution fails, we don't report any errors just now, we just ignore it.

If however we can resolve the path, and it leads to a macro, then we can do macro expansion here. However, and this is important, when we expand the macro, we don't just replace the macro reference completely. Instead, we leave behind a "husk" that contains the path of the expanded macro. You can think of this as being similar to the "expansion trace" that rustc tracks: for each bit of code that resulted from macro expansion, we also remember the name of the macro that was used to create it. We'll need this path in a later step. (Note that in an actual implementation, you might choose to just remember a set of paths for expanded macros, rather than leaving a marker in the AST itself.)

This path also updates the fully_expanded set.

In the code, this is the function expand_macros.

Seed names for all modules. The seed step walk creates bindings for everything except globs. It works by traversing all the modules. For declarations, it will add a simple binding from the declared name to the declaration. So, in our example above, if the struct MyStruct declaration has id X, then we would add (MyStruct, High, StructId(X)) to the binding set for the module foo. Similarly, in the module bar, we would also add (some_macro, High, ModuleId(Y)), where Y is the id of the macro rules declaration for some_macro.

Explicit import statements deserve special mention. If we see a non-glob use, such as std::collections::HashMap, then we first try to resolve the path. If that succeeds, it will yield an item id Z, and we can add an entry like (HashMap, High, Z). However, if path resolution fails for whatever reason, then we still add a binding. But because we don't know the target id, this is a special "placeholder" binding (HashMap, High, 0), where 0 is a special placeholder id. This placeholder binding doesn't count as a normal binding and will be ignored for path resolution: it's purpose is to ensure that we don't add glob bindings for HashMap in the next step.

In the code, this is the function seed_names.

Process globs. Finally, we process globs. In the example, we see the module foo that contains a glob use bar::*. To process this glob, we would first try to resolve the path bar to a single, unambigious item (ignoring errors, as usual). Presuming this succeeds, and resolves to a module with id M, we then go over the bindings {N -> ID} found within this module M as follows:

  • Macro definitions. If ID refers to a macro definition, then we will add it to the containing module (e.g., foo) as a high-priority binding.
  • Non-macros. For all other kinds of bindings, we add the binding to the importing module as a low-priority binding, but only if the following conditions are met:
    • The macro M that we are importing from has no unexpanded macros.
    • There is no high-priority binding for N within the importing module.

In the code, this is the function glob_names.

Verify all paths. Ok, once we've fully completed expanding macros and propagating names, we proceed to the last step, which is verifying paths. This is the place where we actually report some errors. Basically in this step we iterate down all the items and check that all the paths we find can be resolved correctly. We also check that definitions are unambigious. For example, continuing with our example function above:

  • For the use a::b::c::* import, we would verify that a::b::c resolves to a module.
  • For use std::collections::HashMap, we would verify that this resolves to a single item (when multiple namespaces are supported, this would have to resolve to a single item in at least one namespace, and possibly zero items in the other).
  • For struct MyStruct, we would verify that self::MyStruct leads to this struct definition. This ensures that there is no other struct MyStruct (for example).
  • The macro reference some_macro! would presumably have been expanded in a prior step into some code. If it has not, that would be because the path some_macro cannot be resolved, and we would report an error. If it HAS been expanded, however, there will still be a "macro husk" lying around, as described in the previous step, containing the path some_macro. So we would resolve that husk and check that the path some_macro leads unambiguously to a single macro definition. (If this is the case, it must be the same definition that we used to expand.)

In the code, this is the function verify_paths.

FAQ

Is this algorithm correct?

Well, correct is naturally a bit hard to define. I believe the algorithm meets the expectations of Rust users and does logical things. But I think we can also say something a bit stronger: the result of the algorithm is coherent. Let me try to define what I mean by that (I intend to make this more formal, but for now it's rather hand-wavy). Let's consider an expanded form of the AST in which all links are made explicit:

  • Every path has a link describing the item it refers to
  • Globs are expanded to include a list of explicit items that they bring into scope
  • All macros are expanded, and code derived from expanding a macro includes some means to determine the macro it was expanded from (the "husk" in the algorithm above)

Note that this expanded form roughly corresponds to what the "semantic analysis" step of a compiler normally does (or "resolve", as rustc calls it). In this expanded, explicit form, we can define a fairly simple notion of correctness. Basically, every path should in fact lead to the destination that is specified (without encountering ambigiuities along the way) and so forth. (Here I am assuming as "obviously correct" a kind of naive path resolution algorithm that walks to each module, inspects all the bindings that are within, and selects the only matching one.)

I think of the algorithm as analogous to type inference: it infers all this extra information that is elided from the original AST and creates an explicit form. I believe that, if this inference succeeds, then the final resulting program is guaranteed to be "resolve correctly" (in the sense I described in the previous paragraph). You can see that the "verify" step comes close to checking most of the correctness requirements (and reporting an error if any of them fail). For example, it checks that every path resolves to something unambiguously. But it doesn't check the full set of properties, because it doesn't check that the result of this resolution is in fact the item we expected. This is because I believe that it is not necessary to check what the path resolves to: if it resolves to just one thing, it MUST be the thing we want. But it'd be nice to prove that (or at least argue it in a somewhat formal fashion).

It would also be nice to make an argument that the "inference" step preserves the user's intent. For example, we'd like to show that if a solution is found, there is no other possible solution. Ideally, we'd also show that if an error results, then there is no possible solution to be found that respects the criteria.

Why do explicit items shadow globs?

Because it's better for forwards compatibility. It makes it less likely that adding a public item will accidentally break a downstream client. In other words, if someone does this:

use crate_x::*;
use crate_y::Foo;
struct Bar(Foo);

And in the meantime crate_x adds a new entry Foo, the downstream create does not break. Still, it's not a guarantee. For example, someone might do:

use crate_x::*;
use crate_y::*;
struct Bar(Foo);

and, in that case, they will get an error if crate_x later adds a Foo binding. (Note though that if they were not referencing Foo, there would be no problem.)

Why don't explicit macros take precedence over globs?

This is to avoid time-travel-like paradoxes that can otherwise occur when expanding macros. The problem case is when we have a macro found with a path like a::b!. Suppose that the containing module has a use foo::*; which was the origin of the module a. But then the macro generates a module a as well. Now we are in a quandry: to load the macro in the first place, we had to use the a from the glob, but that a was shadowed by code that the macro generated. This is bad because that shadowed item should have taken precedence over the a we got from the glob, but we can't go back and undo the macro expansion we already did (and anyway, if we did so, we wouldn't have the new definition for a anymore)! So we call it an error.

Now, there are other ways to resolve this. One would be to take a "mutable view" on the set of bindings. That is, when a module expands, we only consider the names that were in scope at the time of expansion, which might later change. I wanted to avoid this because it introduces ordering between macro expansions -- that is, suppose I have to two macro expansions in a row, like foo!(); bar!();. Now if I expand foo first, it may generate names that would conflict with expanding bar -- but it would have been fine if I had just expanded bar first! I want to preserve the Rust-y notion that the order of declaration for items doesn't matter.

Another way of looking at is in terms of the expanded form I described in the question about correctness: a mutable view means that the "correctness" property of this expanded form would be a lot more complex, because we would have to reason not about all the bindings we see, but the bindings that were visible at the time of expansion.

Can we just remove macros from globs instead?

It has been proposed that we could keep the full inference rules if we just removed macros from globs. But I don't think that works. Consider this example:

mod a {
    use b::c::n
    n! { }
}
mod b {
    use c::*;
    macro_rules! m {
        () => {
            mod d {
                macro_rules! n { ... }
            }
        }
    }
    m! { }
}
mod c {
    mod d {
        macro_rules! n {
            ...
        }
    }
}

The problem here has to do with the macro reference n! invoked by the module a. It is imported with the path b::d::n. We can resolve this via the glob import from c but, once we expand the macro m in module b, we see that in fact this module generates a higher-precedence module that should have taken precedence over the glob. So we are stuck in that we cannot freely expand macros in any order. (Note: this is the test case banning_macro_globs_is_not_enough; it results in a compilation error with the current system.)

Does the user ever have to care about this algorithm?

There are one scenario where the user is sort of exposed to the limitations of the algorithm, in a sense -- that is, where there is a fully-expanded form that has no conflicts or errors, but the algorithm fails to find it. If you have a glob that brings in a module, but that module is needed to resolve a macro reference somewhere else, than the expansion algorithm gets "stuck". Here is an example:

mod a {
    pub use b::*; // (2) ...requires expanding this glob
    c::m! { } // (1) resolving this path...
}
mod b {
    mod c {
        pub macro_rules! m { }
    }
}

This is the test case unexpandable_macro. (The fear of course is that the macro m may define a module c that would invalidate the path we committed to.)

I think it should be possible to write a kind of diagnostic that walks the code and informs the user "please add pub use b::c to allow name expansion to proceed. But I have to think about it a bit. It also seems like "well, if you can write that, can we incorporate it into the algorithm?"

My rough intution for how this would work is that once a fixed point is reached, we would start expanding globs, but if a macro should shadow one of those globs, we'd report the error to the user as a kind of "causality violation". Have to think about how to handle that.

About

A prototype of my proposed name resolution algorithm for Rust.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages