Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the-parser.md #1933

Merged
merged 3 commits into from
Sep 24, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 45 additions & 39 deletions src/the-parser.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,74 @@
# Lexing and Parsing

The very first thing the compiler does is take the program (in Unicode
characters) and turn it into something the compiler can work with more
conveniently than strings. This happens in two stages: Lexing and Parsing.
The very first thing the compiler does is take the program (in Unicode) and
transmute it into a data format the compiler can work with more conveniently
than strings. This happens in two stages: Lexing and Parsing.

Lexing takes strings and turns them into streams of [tokens]. For example,
`foo.bar + buz` would be turned into the tokens `foo`, `.`,
`bar`, `+`, and `buz`. The lexer lives in [`rustc_lexer`][lexer].
1. _Lexing_ takes strings and turns them into streams of [tokens]. For
example, `foo.bar + buz` would be turned into the tokens `foo`, `.`, `bar`,
`+`, and `buz`.

[tokens]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/token/index.html
[lexer]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html

Parsing then takes streams of tokens and turns them into a structured
form which is easier for the compiler to work with, usually called an [*Abstract
Syntax Tree*][ast] (AST). An AST mirrors the structure of a Rust program in memory,
using a `Span` to link a particular AST node back to its source text.
2. _Parsing_ takes streams of tokens and turns them into a structured form
which is easier for the compiler to work with, usually called an [*Abstract
Syntax Tree* (`AST`)][ast] .


An `AST` mirrors the structure of a Rust program in memory, using a `Span` to
link a particular `AST` node back to its source text. The `AST` is defined in
[`rustc_ast`][rustc_ast], along with some definitions for tokens and token
streams, data structures/`trait`s for mutating `AST`s, and shared definitions for
other `AST`-related parts of the compiler (like the lexer and
`macro`-expansion).

The AST is defined in [`rustc_ast`][rustc_ast], along with some definitions for
tokens and token streams, data structures/traits for mutating ASTs, and shared
definitions for other AST-related parts of the compiler (like the lexer and
macro-expansion).
The lexer is developed in [`rustc_lexer`][lexer].

The parser is defined in [`rustc_parse`][rustc_parse], along with a
high-level interface to the lexer and some validation routines that run after
macro expansion. In particular, the [`rustc_parse::parser`][parser] contains
`macro` expansion. In particular, the [`rustc_parse::parser`][parser] contains
the parser implementation.

The main entrypoint to the parser is via the various `parse_*` functions and others in the
[parser crate][parser_lib]. They let you do things like turn a [`SourceFile`][sourcefile]
The main entrypoint to the parser is via the various `parse_*` functions and others in
[rustc_parse][rustc_parse]. They let you do things like turn a [`SourceFile`][sourcefile]
(e.g. the source in a single file) into a token stream, create a parser from
the token stream, and then execute the parser to get a `Crate` (the root AST
the token stream, and then execute the parser to get a [`Crate`] (the root `AST`
node).

To minimize the amount of copying that is done,
both [`StringReader`] and [`Parser`] have lifetimes which bind them to the parent `ParseSess`.
This contains all the information needed while parsing,
as well as the [`SourceMap`] itself.
To minimize the amount of copying that is done, both [`StringReader`] and
[`Parser`] have lifetimes which bind them to the parent [`ParseSess`]. This
contains all the information needed while parsing, as well as the [`SourceMap`]
itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't convert nice semantic line breaks into hard line breaks like this, it makes it harder to read and harder to diff


Note that while parsing, we may encounter macro definitions or invocations. We
set these aside to be expanded (see [this chapter](./macro-expansion.md)).
Expansion may itself require parsing the output of the macro, which may reveal
more macros to be expanded, and so on.
Note that while parsing, we may encounter `macro` definitions or invocations. We
set these aside to be expanded (see [Macro Expansion](./macro-expansion.md)).
Expansion itself may require parsing the output of a `macro`, which may reveal
more `macro`s to be expanded, and so on.

## More on Lexical Analysis

Code for lexical analysis is split between two crates:

- `rustc_lexer` crate is responsible for breaking a `&str` into chunks
- [`rustc_lexer`] crate is responsible for breaking a `&str` into chunks
constituting tokens. Although it is popular to implement lexers as generated
finite state machines, the lexer in `rustc_lexer` is hand-written.
finite state machines, the lexer in [`rustc_lexer`] is hand-written.

- [`StringReader`] integrates `rustc_lexer` with data structures specific to `rustc`.
Specifically,
it adds `Span` information to tokens returned by `rustc_lexer` and interns identifiers.
- [`StringReader`] integrates [`rustc_lexer`] with data structures specific to
`rustc`. Specifically, it adds `Span` information to tokens returned by
[`rustc_lexer`] and interns identifiers.

[rustc_ast]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/index.html
[rustc_errors]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_errors/index.html
[ast]: https://en.wikipedia.org/wiki/Abstract_syntax_tree
[`Crate`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/ast/struct.Crate.html
[`Parser`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/struct.Parser.html
[`ParseSess`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_session/parse/struct.ParseSess.html
[`rustc_lexer`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html
[`SourceMap`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_span/source_map/struct.SourceMap.html
[`StringReader`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/lexer/struct.StringReader.html
[ast module]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/ast/index.html
[rustc_parse]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html
[parser_lib]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html
[ast]: ./ast-validation.md
[parser]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/index.html
[`Parser`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/struct.Parser.html
[`StringReader`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/lexer/struct.StringReader.html
[visit module]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/visit/index.html
[rustc_ast]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/index.html
[rustc_errors]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_errors/index.html
[rustc_parse]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html
[sourcefile]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_span/struct.SourceFile.html
[visit module]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/visit/index.html