Skip to content

Commit 3b70f05

Browse files
committed
parser_stream: Produce green tree traversal rather than token ranges
## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places).
1 parent eceaa39 commit 3b70f05

16 files changed

+817
-589
lines changed

Project.toml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@ version = "1.0.2"
77
Serialization = "1.0"
88
julia = "1.0"
99

10-
[deps]
11-
1210
[extras]
1311
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
1412
Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

src/JuliaSyntax.jl

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ export @K_str, kind
7373

7474
export SyntaxNode
7575

76-
@_public GreenNode,
76+
@_public GreenNode, RedTreeCursor, GreenTreeCursor,
7777
span
7878

7979
# Helper utilities
@@ -95,7 +95,8 @@ include("parser_api.jl")
9595
include("literal_parsing.jl")
9696

9797
# Tree data structures
98-
include("green_tree.jl")
98+
include("tree_cursors.jl")
99+
include("green_node.jl")
99100
include("syntax_tree.jl")
100101
include("expr.jl")
101102

0 commit comments

Comments
 (0)