Semantic IR #1812

jamiebuilds · 2021-11-23T02:03:01Z

jamiebuilds
Nov 23, 2021

Semantic IR

For Rome to fill the role of a complete language toolchain from compiler to IDE for JavaScript/TypeScript, we'll need likely several semantic models of the language used in different places.

Examples: Scope Analysis, Call Hierarchy, Type Analysis, etc.

Right now we only have lexical (tokens) and syntactic (cst) intermediate representations of JS/TS code. But there are a number of additional intermediate representations that we'll likely want to explore:

Semantic IR
Dependency Graph
Control-Flow Graph
Type Graph
Optimizer IR

For now, I just want to focus on the Semantic IR (since it is the next thing in front of us to build).

Motivation

Because we don't have a semantic model today, if we attempted to build all the different IDE/compiler features at once, we'd end up repeating a lot of logic of translating syntax into semantics. That opens us up to more bugs and maintenance work.

It would also mean that some of our semantic queries would be harder to cache because they'd directly depend on our CST/AST which gets invalidated on every key stroke because it contains lexical information (tokens and source locations).

By introducing a Semantic IR, we can build our semantic queries without being concerned with syntax. This will simplify their implementations, and make them more cacheable across keystrokes.

Requirements

I'm not going to jump into a design right away, instead I'd like to consider the requirements we want to set for the Semantic IR.

These requirements are largely just intuition, and likely have tradeoffs with one another, so please poke any holes in this by considering how other parts of the tooling would create/make use of the Semantic IR.

That said, these are the goals I've come up with:

It should be trivial and fast to construct semantic information on top of (i.e. scopes, imports/exports, etc).
- This includes being able to construct the other IRs mentioned above on top of it.
It should be highly cacheable beyond the caching of our CST.
- This likely means that it does not contain any lexical information including source locations.
It should be incrementally/partially built so that queries can compute/recompute the minimal amount of information needed.
- This likely means that it will use files as boundaries, and possibly functions/blocks/branches.
It should enable highly parallel systems to calculate sematic information at the same time.
- This means that the data structure itself needs to be thread-safe, and likely means it's immutable.
It should be able to represent invalid programs so that we can continue to answer semantic queries while the program is being worked on.
- This likely means it will work similarly to the AST in terms of returning results in many places.

Some explicit non-goals:

It shouldn't try doing too much, it's there for a convenient intermediate step between syntax and other semantic information/IRs, not as a source of answers directly.
It shouldn't try to be language independent. By necessity that would require us to work in a really abstract, lowered semantic model. Which I'm not convinced would translate to HTML/templating languages or CSS/preprocessors in any useful way.

Other details:

It should prioritize performantly translating syntax into semantics over translating semantics back into syntax (although it should still be possible).
- Everything will be dependent on being able to quickly compute semantic information. Only some of the time will we need to update syntax (that we don't already have a reference to) based on semantics.

Inspiration

Rust Analyzer HIR

Rust Analyzer has a High-Level Intermediate Representation (HIR) that aims to fulfill similar needs. However, a significant part of it is designed around expanding Rust macros which we don't need for JS/TS.

At its core, RA's HIR breaks down a source file into an ItemTree with all of the semantic elements broken out into indexed ECS-like (Entity-Component-System) arenas. The program is further broken down into function bodies, blocks, statements, and expressions, all using the same indexed arenas.

fn identity<T>(value: T) -> T {
  return value;
}
let result = identity(42);

// HIR expressions in the body of `test`:
Idx::<Expr>(0): Path(Path { type_anchor: None, mod_path: ModPath { kind: Plain, segments: [Name(Text("identity"))] }, generic_args: [None] })
Idx::<Expr>(1): Literal(Uint(42, None))
Idx::<Expr>(2): Call { callee: Idx::<Expr>(0), args: [Idx::<Expr>(1)] }
Idx::<Expr>(3): Block { id: BlockId(82), statements: [Let { pat: Idx::<Pat>(0), type_ref: None, initializer: Some(Idx::<Expr>(2)), else_branch: None }], tail: None, label: None }

// HIR expressions in the body of `identity`:
Idx::<Expr>(0): Path(Path { type_anchor: None, mod_path: ModPath { kind: Plain, segments: [Name(Text("value"))] }, generic_args: [None] })
Idx::<Expr>(1): Return { expr: Some(Idx::<Expr>(0)) }
Idx::<Expr>(2): Block { id: BlockId(100), statements: [Expr { expr: Idx::<Expr>(1), has_semi: true }], tail: None, label: None }

Some relevant bits of code:

ItemTreeData

[Source]

struct ItemTreeData {
    imports: Arena<Import>,
    extern_crates: Arena<ExternCrate>,
    extern_blocks: Arena<ExternBlock>,
    functions: Arena<Function>,
    params: Arena<Param>,
    structs: Arena<Struct>,
    fields: Arena<Field>,
    unions: Arena<Union>,
    enums: Arena<Enum>,
    variants: Arena<Variant>,
    consts: Arena<Const>,
    statics: Arena<Static>,
    traits: Arena<Trait>,
    impls: Arena<Impl>,
    type_aliases: Arena<TypeAlias>,
    mods: Arena<Mod>,
    macro_calls: Arena<MacroCall>,
    macro_rules: Arena<MacroRules>,
    macro_defs: Arena<MacroDef>,
    vis: ItemVisibilities,
    inner_items: FxHashMap<FileAstId<ast::BlockExpr>, SmallVec<[ModItem; 1]>>,
}

Function

[Source]

struct Function {
    pub name: Name,
    pub visibility: RawVisibilityId,
    pub explicit_generic_params: Interned<GenericParams>,
    pub abi: Option<Interned<str>>,
    pub params: IdxRange<Param>,
    pub ret_type: Interned<TypeRef>,
    pub async_ret_type: Option<Interned<TypeRef>>,
    pub ast_id: FileAstId<ast::Fn>,
    pub(crate) flags: FnFlags,
}

Body

[Source]

/// The body of an item (function, const etc.).
struct Body {
    pub exprs: Arena<Expr>,
    pub pats: Arena<Pat>,
    pub labels: Arena<Label>,
    /// The patterns for the function's parameters. While the parameter types are
    /// part of the function signature, the patterns are not (they don't change
    /// the external type of the function).
    ///
    /// If this `Body` is for the body of a constant, this will just be
    /// empty.
    pub params: Vec<PatId>,
    /// The `ExprId` of the actual body expression.
    pub body_expr: ExprId,
    /// Block expressions in this body that may contain inner items.
    block_scopes: Vec<BlockId>,
    _c: Count<Self>,
}

Expr

[Source]

pub enum Expr {
    /// This is produced if the syntax tree does not have a required expression piece.
    Missing,
    Path(Path),
    If {
        condition: ExprId,
        then_branch: ExprId,
        else_branch: Option<ExprId>,
    },
    Block {
        id: BlockId,
        statements: Box<[Statement]>,
        tail: Option<ExprId>,
        label: Option<LabelId>,
    },
    Loop {
        body: ExprId,
        label: Option<LabelId>,
    },
    While {
        condition: ExprId,
        body: ExprId,
        label: Option<LabelId>,
    },
    For {
        iterable: ExprId,
        pat: PatId,
        body: ExprId,
        label: Option<LabelId>,
    },
    Call {
        callee: ExprId,
        args: Box<[ExprId]>,
    },
    MethodCall {
        receiver: ExprId,
        method_name: Name,
        args: Box<[ExprId]>,
        generic_args: Option<Box<GenericArgs>>,
    },
    Match {
        expr: ExprId,
        arms: Box<[MatchArm]>,
    },
    Continue {
        label: Option<Name>,
    },
    Break {
        expr: Option<ExprId>,
        label: Option<Name>,
    },
    Return {
        expr: Option<ExprId>,
    },
    Yield {
        expr: Option<ExprId>,
    },
    RecordLit {
        path: Option<Box<Path>>,
        fields: Box<[RecordLitField]>,
        spread: Option<ExprId>,
    },
    Field {
        expr: ExprId,
        name: Name,
    },
    Await {
        expr: ExprId,
    },
    Try {
        expr: ExprId,
    },
    TryBlock {
        body: ExprId,
    },
    Async {
        body: ExprId,
    },
    Const {
        body: ExprId,
    },
    Cast {
        expr: ExprId,
        type_ref: Interned<TypeRef>,
    },
    Ref {
        expr: ExprId,
        rawness: Rawness,
        mutability: Mutability,
    },
    Box {
        expr: ExprId,
    },
    UnaryOp {
        expr: ExprId,
        op: UnaryOp,
    },
    BinaryOp {
        lhs: ExprId,
        rhs: ExprId,
        op: Option<BinaryOp>,
    },
    Range {
        lhs: Option<ExprId>,
        rhs: Option<ExprId>,
        range_type: RangeOp,
    },
    Index {
        base: ExprId,
        index: ExprId,
    },
    Lambda {
        args: Box<[PatId]>,
        arg_types: Box<[Option<Interned<TypeRef>>]>,
        ret_type: Option<Interned<TypeRef>>,
        body: ExprId,
    },
    Tuple {
        exprs: Box<[ExprId]>,
    },
    Unsafe {
        body: ExprId,
    },
    MacroStmts {
        tail: ExprId,
    },
    Array(Array),
    Literal(Literal),
}

Statement

[Source]

enum Statement {
    Let {
        pat: PatId,
        type_ref: Option<Interned<TypeRef>>,
        initializer: Option<ExprId>,
        else_branch: Option<ExprId>,
    },
    Expr {
        expr: ExprId,
        has_semi: bool,
    },
}

I think there's a lot to like about this approach. The entity-component-system model allows for very rich data structures, and breaking the indexes down seems to make it easier to incrementally build.

The one major concern that I have is that it's a little bit too non-performant when looking syntax up from one of the semantic items. It requires iterating over all the HIR expressions and comparing their positions. However, we might be able to do something more optimized since we don't have to do the macro expansion or nearly as much lowering as Rust Analyzer needs to.

One nice bit of this system is that if you can implement fairly universal logic to traverse the data structure and get the control flow (and other semantic information) for free.

The ECS model is also known for enabling highly parallel systems.

What now?

I'd like for us all to keep adding to this exploration and refine the requirements. After that we can start doing design and experimentation work.

MichaReiser · 2021-11-24T16:10:39Z

MichaReiser
Nov 24, 2021

Thanks, @jamiebuilds for this excellent write-up.

Regarding the goals:

Do we have some more specific analysis in mind that would run on top of the HIR?
Do you think we should have a HIR for every language or should it support different languages?

2 replies

ematipico Nov 24, 2021

I think we should have a HIR for every language. Semantics, terminologies, etc. differ from language to language. CSS has concepts like rules, selectors, ancestors, etc. while JS has others such as parameters, functions, scopes, etc.

I think trying to unify everything in one single HIR is not the correct choice, although we can "share" some concepts between languages such as "imports". CSS and JS both have them and, if feasible, we might consider to share the same logic/information. This might help us too, in cases where we want to import a CSS file from a JS file.

jamiebuilds Nov 28, 2021
Author

I think it may make sense to have some common foundation and patterns to how we build HIRs, but then each language defines its own unique semantic concepts. This works well with the "component" in an ECS pattern. But having shared characteristics may make it easier for us to build analysis on top that crosses language boundaries.

For other types of IRs, specifically the ones that cross module/file boundaries (or don't care about module boundaries at all), it may make more sense to define them in ways that are "language agnostic" or in a way that spans all of our supported languages. But we're not there yet, so lets ignore it.

Do we have some more specific analysis in mind that would run on top of the HIR?

Here's in incomplete list of things I'd like it to help with:

Scopes: This can identify where bindings are created and referenced. It can't solve problems like "Find all references", but it helps with code transformations (lowering syntax features, minifying, code refactorings, etc).
Control Flow: Linearizing evaluation steps and representing forks neatly will make easier to build a control flow graph, and will enable lint rules and refactorings that only need local control flow (See ESLint's code path analysis).
Modules: Imports (static and dynamic), exports, URLs, file references, etc all need to be found and understood. Finding an export is easy, finding what was exported requires some understanding of references and control flow. A smarter module graph will make tree shaking a lot easier.

All of these together should also make it significantly easier to build other IRs on top of the Semantic IR. Building a module graph just means merging the Semantic IR's imports/exports, building a control flow graph just means merging the Semantic IR's control flow, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic IR #1812

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Semantic IR #1812

jamiebuilds Nov 23, 2021

Semantic IR

Motivation

Requirements

Inspiration

Rust Analyzer HIR

What now?

Replies: 1 comment · 2 replies

MichaReiser Nov 24, 2021

ematipico Nov 24, 2021

jamiebuilds Nov 28, 2021 Author

jamiebuilds
Nov 23, 2021

Replies: 1 comment 2 replies

MichaReiser
Nov 24, 2021

jamiebuilds Nov 28, 2021
Author