Semantic Model Architecture #2614

xunilrj · 2022-05-25T10:03:33Z

xunilrj
May 25, 2022

We have a discussion on how we want to use the semantic model inside linters: #2603
This will be implemented by #2488

This discussion is more specifically about the API around all the semantic data.

We have two options, that can somehow coexist.

1 - We offer a façade with specific methods that give access to the whole semantic model;
2 - We offer a semantic tree that gives access to common semantic functions for that specific node, for example SemanticJsFunctionDeclaration;

xunilrj · 2022-05-25T10:03:48Z

xunilrj
May 25, 2022
Author

Semantic Façade

This is what Roslyn does:

var tree = CSharpSyntaxTree.ParseText(@"
	public class MyClass 
	{
		int MyMethod() { return 0; }
	}");
var compilation = CSharpCompilation.Create("MyCompilation", ...);
var model = compilation.GetSemanticModel(tree);

Where the model variable would be https://docs.microsoft.com/en-us/dotnet/api/microsoft.codeanalysis.semanticmodel where the whole (semantic) fun is.

So to implement the "go to definition" that we implemented in the LSP, we would do

var node = GetSyntaxNodeAt(tree, lineNumber, columnNumber);
var symbol = model.GetSymbolInfo(node).Symbol;
var declaredAt = symbol.OriginalDefinition.Locations.Single();
Console.WriteLine(declaredAt);

In the code above we used: https://docs.microsoft.com/en-us/dotnet/api/microsoft.codeanalysis.modelextensions.getsymbolinfo
In our case, we could have:

let result = parse(...);
let model: SemanticModel = result.root().semantic_model(...);

let node = get_syntaxnode_at(result.root(), line_number, column_number);
let semantic_info = model.get(&node);
dbg!(semantic_info.declared_at());

Roslyn also uses the semantic model as the façade for control/data flow. We can find unused variables doing this:

// Get a list of statements to run the dataflow analysis

var someMethod = tree.GetRoot().DescendantNodes().OfType<MethodDeclarationSyntax>().First();
var start = someMethod.Body.Statements.First();
var end = someMethod.Body.Statements.Last();

// Run the data flow
var df = model.AnalyzeDataFlow(start, end);

// Find which variables are not used
var unused = new System.Collections.Generic.HashSet<string>();

Console.WriteLine("Variables Declared");
Console.WriteLine("-----------");
foreach (var symbol in df.VariablesDeclared)
{
    unused.Add(symbol.Name);
    Console.WriteLine(symbol.Name);
}

Console.WriteLine();
Console.WriteLine("Read Inside");
Console.WriteLine("-----------");
foreach (var symbol in df.ReadInside)
{
    unused.Remove(symbol.Name);
    Console.WriteLine(symbol.Name);
}

Console.WriteLine();
Console.WriteLine("Unused Variables");
Console.WriteLine("-----------");
foreach (var symbol in unused)
{
    Console.WriteLine(symbol);
}

In our case, this could be

let result = parse(...);
let decl = get_function_syntax_node(result.root()).cast::<JsFunctionDeclaration>();
let statements = decl.body().statements().into_iter();
let first = statements.next();
let last = statements.last();

let model: SemanticModel = result.root().semantic_model();
let result = model.dataflow(first, last);

// same strategy here

Of course we can split this into multiple façades if needed:

let analyzer: DataFlowAnalyzer = result.root().data_flow_analyzer(...);
let df = analyzer.run(first, last);

One important detail is that in both examples above I used ... as the parameter. This is because we may need to pass "something" into these methods to allow the SemanticModel and/or other structs to have access to other services (such as queries, cache etc...).

3 replies

MichaReiser May 25, 2022

My understanding is that Roslyn has something similar to, at least what my understanding of it is, a Semantic Tree in that sense that there are different ISymbol implementation, for example, there's a Type symbol that allows querying all the members of it (and the interfaces it implements etc). link.

xunilrj May 25, 2022
Author

True. I expanded how we could model them in the "Semantic Façade" here #2614 (comment)

ematipico May 26, 2022

Using a façade is very familiar, like we do with the actual gree/red trees. Still, I need to understand what's the different from this option and the other. I would tend to say that the semantic tree explained in your comment makes more sense to me.

xunilrj · 2022-05-25T10:04:06Z

xunilrj
May 25, 2022
Author

Semantic Tree

The semantic tree is probably the most obvious step because we already SyntaxNodes and AstNodes. In this case, we are talking about:

let result = parse(...);

let node = get_syntaxnode_at(result.root(), line_number, column_number);
let semantic_info = node.semantic(...);
dbg!(semantic_info.declared_at());

and the unused case would be:

let result = parse(...);
let decl = get_function_syntax_node(result.root()).cast::<JsFunctionDeclaration>();
let statements = decl.body().statements().into_iter();
let first = statements.next();
let last = statements.last();

let start= first.semantic(...);
let df = node.dataflow_until(last);

Nothing stopping us from having both. They would be only different ways to access the Semantic data.

The "catch" here is that we also need to pass "something" into these functions. This happens because there is no easy way to make SyntaxNodes or the whole tree hold context data.

Today this is how our tree is stored in memory:

print-type-size type: `cursor::node::SyntaxNode`: 8 bytes, alignment: 8 bytes
print-type-size     field `.ptr`: 8 bytes // Rc<NodeData>

print-type-size type: `std::rc::Rc<cursor::NodeData>`: 8 bytes, alignment: 8 bytes
print-type-size     field `.phantom`: 0 bytes
print-type-size     field `.ptr`: 8 bytes // Indirect pointer to NodeData

This means that SyntaxNodes are just pointers to NodeData, which is much more complex.

print-type-size type: `cursor::NodeData`: 40 bytes, alignment: 8 bytes
print-type-size     field `._c`: 0 bytes
print-type-size     field `.kind`: 32 bytes //NodeKind
print-type-size     field `.slot`: 4 bytes
print-type-size     field `.offset`: 4 bytes

print-type-size type: `cursor::NodeKind`: 32 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `Child`: 24 bytes
print-type-size         field `.green`: 16 bytes  //WeakGreen
print-type-size         field `.parent`: 8 bytes
print-type-size     variant `Root`: 16 bytes
print-type-size         field `.green`: 16 bytes  // NodeOrToken

print-type-size type: `utility_types::NodeOrToken<&green::node::GreenNode, &green::token::GreenToken>`: 16 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `Node`: 8 bytes
print-type-size         field `.0`: 8 bytes // GreenNode is transparent to ThinArc<GreenNodeHead, Slot>
print-type-size     variant `Token`: 8 bytes
print-type-size         field `.0`: 8 bytes // GreenToken

print-type-size type: `arc::ThinArc<green::node::GreenNodeHead, green::node::Slot>`: 8 bytes, alignment: 8 bytes
print-type-size     field `.phantom`: 0 bytes
print-type-size     field `.ptr`: 8 bytes // Indirect pointer to HeaderSlice<GreenNodeHead, [Slot]>

print-type-size type: `arc::ArcInner<arc::HeaderSlice<green::node::GreenNodeHead, [green::node::Slot]>>`: 24 bytes, alignment: 8 bytes
print-type-size     field `.count`: 8 bytes
print-type-size     field `.data`: 16 bytes // Indirect pointer to HeaderSlice<GreenNodeHead, [Slot]>

So we have something like this:


                                                                     │
  Red World                                                          │ Green World
                                                                     │
                                                                     │
                                                                     │
  ┌──────────────────────┐    ┌───────────────────────────────────┐  │
  │ SyntaxNode    8bytes │    │ NodeData                  40bytes │  │
  │                      │    │                                   │  │   ┌───────────────────────────────────┐
  │   Rc<NodeData>───────┼───►│  ...                              │  │   │                                   │
  │                      │    │  ThinArc<GreenNodeHead,[Slot]>────┼──┼──►│ HeaderSlice<GreenNodeHead,[Slot]> │◄────┐
  └──────────────────────┘    │  ...                              │  │   │                                   │     │
                              │                                   │  │   └───────────────────────────────────┘     │
                              └───────────────────────────────────┘  │                                             │
                                                                     │     ┌──────────────────────────────────┐    │
                                                                     │     │ Cache                            │    │
                                                                     │     │                                  │    │
                                                                     │     │ ┌───┐ ┌─────┐ ┌──────┐ ┌───────┐ │    │
                                                                     │     │ │   │ │     │ │      │ │       │ │    │
                                                                     │     │ │   │ │     │ │      │ │ Node  ├─┼────┘
                                                                     │     │ │   │ │     │ │      │ │       │ │
                                                                     │     │ └───┘ └─────┘ └──────┘ └───────┘ │
                                                                     │     │                                  │
                                                                     │     └──────────────────────────────────┘
                                                                     │

The "green world" is cached; and thus, if we insert a "service pointer" there, we would kill the ability to cache nodes from different workspaces. In the red world, we have two options:

SyntaxNode
NodeData

SyntaxNode is a very simple struct that is cloneable. This means that we can have many more instances of SyntaxNode than NodeData. That points to the best possible place to store a "service" pointer be the NodeData enum.

#[derive(Clone)]
pub(crate) struct SyntaxNode {
    pub(super) ptr: Rc<NodeData>,
}

#[derive(Debug)]
struct NodeData {
    _c: Count<_SyntaxElement>,
    kind: NodeKind,
    slot: u32,
    offset: TextSize,
}

Including an Arc<...> here would increase the NodeData to 48 bytes, but this would not affect all the SyntaxNode clones that we have.

print-type-size type: `cursor::NodeData`: 48 bytes, alignment: 8 bytes
print-type-size     field `._c`: 0 bytes
print-type-size     field `.kind`: 32 bytes
print-type-size     field `.services`: 8 bytes // <-------------- New
print-type-size     field `.slot`: 4 bytes
print-type-size     field `.offset`: 4 bytes

Challenges for this solution would be:
1 - We can detach a node. We need to decide what to do. Probably what makes more sense is to remove the service, and add it back when we attach the node back. In this way, we guarantee all nodes in a tree point to the same services.

2 - How to type this data. We can make NodeData generic in T, but this will infect everyone using it: cursor::SyntaxNode, syntax::SyntaxNode, AstNode and all generated nodes. This seems annoying.

We could type erase this context storing it as dyn Any and give this thing a generic name like tag.
See https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=fe395fa674a9129c14585803e5d5a0cb

This gives us a "cheap clone" and no mutation: or the object allows mutation through &self or we store Mutex or anything else in there. But I think the NodeData can ignore this as a detail.

use std::sync::*;
use std::any::*;

#[derive(Debug)]
struct ServiceContext  {
    
}

impl ServiceContext {
    pub fn new() -> Self {
        Self {}
    }
    
    pub fn to_arc(self) -> Arc<dyn Any> {
        Arc::new(self)
    }
}

struct NodeData{
    tag: Arc<dyn Any>
}

impl NodeData {
    pub fn tag<T: 'static>(&self) -> Option<&T> {
        self.tag.downcast_ref()
    }
}


fn main() {
    let ctx = ServiceContext::new().to_arc();
    let node = NodeData {
        tag: ctx
    };
    println!("{:?}", node.tag::<ServiceContext>());
}

4 replies

MichaReiser May 25, 2022

Thanks for writting this up.

I don't think I'm entirely following the approach described here. Are you proposing that each node has a "semantic" sibling node that provides access to its semantic information? What information would such a node store?

xunilrj May 25, 2022
Author

I am not sure how useful it is for every node to have a semantic sibling. What would be the reason for a SemanticJsArrayHole, for example? Maybe there is one. But I would ignore these less obvious cases for now.

I think that SyntaxNode::semantic() would return a untyped SemanticNode. This contains access to the semantic model.
And as we can cast SyntaxNode to JsFunctionDeclaration, we would be able to cast SemanticNode to SemanticJsFunctionDeclaration.

This could be:

pub struct SemanticJsFunctionDeclaration {
    node: SemanticNode
}

pub enum AnyCallable {
    FunctionDeclaration(SemanticJsFunctionDeclaration),
    ...
}

pub struct SemanticJsCallExpression {
   node: SemanticNode
}

impl SemanticJsCallExpression {
    pub fn callee(&self) -> Option<SemanticJsFunctionDeclaration> {
        let node = self.node.cast::<JsCallExpression>()?;
        let syntax = self.node.model.solve_reference(node.callee()?);
        AnyCallable::FunctionDeclaration(SemanticJsFunctionDeclaration  {
            node: syntax.semantic(self.model.clone());
        })
}

pub struct SemanticNode {
    model: Arc<SemanticModel>,
    node: SyntaxNode
}

and we would use this like:

let model = ...;
let result = parse(...);
let call = result.syntax().descendents().find(|node| node.kind == JsSyntaxKind::JS_CALL_EXPRESSION)?.semantic(model);
let function = call.callee().as_function_declaration()?;
dbg!(function); // do whatever you need here

MichaReiser May 25, 2022

Is the SemanticJsFunctionDeclaration than the equivalent of Roslyn's Function symbol?

xunilrj May 25, 2022
Author

Yes. For the "Semantic Tree" proposal, yes.

MichaReiser · 2022-05-25T12:03:34Z

MichaReiser
May 25, 2022

How would things like Symbols and Scope be represented in either of those approaches?

0 replies

xunilrj · 2022-05-25T14:35:18Z

xunilrj
May 25, 2022
Author

If we use the Roslyn definition for symbols:

Represents a symbol (namespace, class, method, parameter, etc.) exposed by the compiler.
https://docs.microsoft.com/en-us/dotnet/api/microsoft.codeanalysis.isymbol?view=roslyn-dotnet-4.2.0

In the "Semantic Façade" we would do something similar to Roslyn.

pub enum JsSymbol {
    Module(...),
    Class(...),
    Function(...),
    Variable(JsVariableSymbol),
    ...
}

We can do something very similar with scope. For example how Roslyn do this:
https://github.com/dotnet/roslyn/blob/315c2e149ba7889b0937d872274c33fcbfe9af5f/src/Compilers/Core/Portable/CodeGen/LocalScopeManager.cs#L22

pub enum JsScope {
    Function { ... },
    If { ... },
    Try { ... },
    ...
}

And we could use this like :

let var_decl_node = ...;
let symbol: JsSymbol  = model.symbol(var_decl_node);
if let JsSymbol::Variable(var_decl) = symbol {
    if var_decl.is_constant() {
        // do whatever you need here
    }
}
let parent_try = symbol.scope().ancestors().find(|x| x.is_try())?;
dbg!(parent_try); // do whatever you need here

In the "Semantic Tree", instead of an enum, we would have:

pub struct SemanticJsModule { ... }
pub struct SemanticJsClassDeclaration { ... }
pub struct SemanticJsFunctionDeclaration { ... }
pub struct SemanticJsVariableDeclaration { ... }

Scope would be similar, and the usage would be:

let var_decl_node = ...;
let var_decl: SemanticJsVariableDeclaration = var_decl_node.cast::<JsVariableDeclaration>()?.semantic(model);
if var_decl.is_constant() {
    // do whatever you need here
}
let parent_try = var_decl.scope().ancestors().find(|x| x.is_try())?;
dbg!(parent_try); // do whatever you need here

2 replies

MichaReiser May 25, 2022

I guess the thing that confuses me is that semantic returns a SemanticModel which kind of is the symbol of that model?

But it seems it also exposes more, like the scope of a node.

I think I prefer the other architecture. Feels a bit cleaner from where I'm querying data. You want some semantic information, ask the model, rather than query the node and pass the model.
I'm also wondering if it allows to overall have fewer data structures compared to this data structure.

xunilrj May 25, 2022
Author

semantic returns a SemanticModel... which kind of is the symbol of that model?

You meant a SemanticNode?
I think the best description would be that the SemanticNode would be the entrance for everything semantic about that node.

But I think the biggest difference between the "Semantic Façade" model, following Roslyn's approach and the "Semantic Tree" is that the latter offers us the entrance for all semantic information for that node: symbols, scope, types, reflection etc... and only what you can safely do to that node.
While the façade you need to know what to ask to get the information you need.

In a sense, the semantic tree can use the façade behind the scenes.

For example:

var tree = CSharpSyntaxTree.ParseText(@"
	public class MyClass {
			 int Method1(int x) { return 0; }
			 void Method2()
			 {
                var y = 1;
				int x = Method1(2) * 2.0f;
			 }
		}
	}");

var compilation = CSharpCompilation.Create("MyCompilation",
    syntaxTrees: new[] { tree }, references: Enumerable.Empty<MetadataReference>());
var model = compilation.GetSemanticModel(tree);

var multiply = tree.GetRoot().DescendantNodes().OfType<BinaryExpressionSyntax>().First();

Console.WriteLine(multiply);

// These are not useful at all, but I can still call them with the BinaryExpression above
Console.WriteLine($"GetSymbolInfo: {model.GetSymbolInfo(multiply).CandidateSymbols.Length}");
Console.WriteLine($"GetDeclaredSymbol: {model.GetDeclaredSymbol(multiply) == null}");
Console.WriteLine($"GetAliasInfo: {model.GetAliasInfo(multiply) == null}");
Console.WriteLine($"GetCollectionInitializerSymbolInfo: {model.GetCollectionInitializerSymbolInfo(multiply).CandidateSymbols.Length}");
Console.WriteLine($"GetConstantValue: {model.GetConstantValue(multiply).HasValue}");
Console.WriteLine($"GetIndexerGroup: {model.GetIndexerGroup(multiply).Length}");
Console.WriteLine($"GetMemberGroup: {model.GetMemberGroup(multiply).Length}");
Console.WriteLine($"GetPreprocessingSymbolInfo: {model.GetPreprocessingSymbolInfo(multiply).Symbol == null}");

// These are ok
Console.WriteLine($"GetConversion: {model.GetConversion(multiply).Exists}");
Console.WriteLine($"GetOperation: {model.GetOperation(multiply)}");
Console.WriteLine($"GetTypeInfo: {model.GetTypeInfo(multiply).Type}");

leops · 2022-05-30T12:10:14Z

leops
May 30, 2022

Overall I think having a generic semantic tree + a type safe facade on top sounds like the best option as this is how we already work with syntax trees (generic Rowan SyntaxNode + SyntaxToken tree with a type safe AstNode facade on top), so the general design pattern would be familiar and we may be able to reuse some architectural concepts and data structures between the two as well.

For the generic tree I think having a single SemanticNode might be too generic, just like we have the two SyntaxNode and SyntaxToken "node types" in rome_rowan for the semantic tree we could have three SemanticDeclaration, SemanticReference and SemanticScope node types in the semantic tree with specialized accessor methods for each, for instance:

fn references(&self) -> Iterator<SemanticReference> for SemanticDeclaration
fn declarations(&self) -> Iterator<SemanticDeclaration> for SemanticScope
fn declaration(&self) -> SemanticDeclaration for SemanticReference

The last one raises an interesting question in how we solve externals eg. in document.getElementById() what declaration node does document resolve to ? Does it return an error, a special "unknown" semantic node, should the semantic model be initialized with an "environment" declaring some well-known globals ?

For the type-safe facade the exact design is a bit more unclear to me, in general I think we should try to have an abstract model that doesn't necessarily closely mirror the syntax (for instance function example() {} and var example = function () {} are both "function declarations"), but the actual implementation would probably need to be driven by actual usage to try and figure out exactly what node types we need. At the very least we'll probably need methods to get a typed AstNode from a semantic node, and maybe some infallible wrappers on AST nodes like JsFunctionDeclaration to query for the corresponding semantic declaration.

The last part of all this would be how it would actually be implemented: we probably won't be using the same red-green tree structure as the syntax tree for the semantic tree, then what data structures are internally being used to support this ? Are the semantic nodes Send and / or Sync ? What's calculated eagerly / lazily, and how is it cached ? Do we support incrementally updating the model and how ? Some of that can probably be left for the implementation PR to decide but it could be useful to at least discuss where we're headed beforehand.

0 replies

MichaReiser · 2022-06-01T14:12:15Z

MichaReiser
Jun 1, 2022

The last one raises an interesting question in how we solve externals eg. in document.getElementById() what declaration node does document resolve to ? Does it return an error, a special "unknown" semantic node, should the semantic model be initialized with an "environment" declaring some well-known globals ?

This isn't just a problem with unknown globals but is the case for all cases where our semantic analysis isn't able to resolve a symbol. That may be because of a syntax error or simply because it's (impossible?) to have a precise symbol resolution in JS.

For these semantic nodes? How do we access them and what do we return for nodes that don't have any semantic information. For example, a token 5 doesn't have any references but we may still want to be able to resolve its enclosing scope.

Overall I think having a generic semantic tree + a type safe facade on top sounds like the best option as this is how we already work with syntax trees (generic Rowan SyntaxNode + SyntaxToken tree with a type safe AstNode facade on top), so the general design pattern would be familiar and we may be able to reuse some architectural concepts and data structures between the two as well.

I think this works with either of the proposed approaches? The facade approach could return different Symbol nodes depending on what the kind of the symbol is.

For the generic tree I think having a single SemanticNode might be too generic, just like we have the two SyntaxNode and SyntaxToken "node types" in rome_rowan for the semantic tree we could have three SemanticDeclaration, SemanticReference and SemanticScope node types in the semantic tree with specialized accessor methods for each, for instance:

What would be the unique identifier that SemanticDeclaration and SemanticReference share? Would they expose a symbol operation that gives access to the unique symbol declaration.symbol() == reference.symbol(). What's the type of this symbol? Is there a single symbol type or are there different symbol based on the kind of the underlying node.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Model Architecture #2614

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Semantic Model Architecture #2614

Replies: 6 comments · 9 replies

xunilrj May 25, 2022 Author

Semantic Façade

xunilrj May 25, 2022 Author

xunilrj May 25, 2022 Author

Semantic Tree

xunilrj May 25, 2022 Author

xunilrj May 25, 2022 Author

xunilrj May 25, 2022 Author

xunilrj May 25, 2022 Author

Replies: 6 comments 9 replies

xunilrj
May 25, 2022
Author

xunilrj May 25, 2022
Author

xunilrj
May 25, 2022
Author

xunilrj May 25, 2022
Author

xunilrj May 25, 2022
Author

xunilrj
May 25, 2022
Author

xunilrj May 25, 2022
Author