Parsing for Unordered Syntax Nodes #1407
faultyserver
started this conversation in
RFC
Replies: 1 comment 5 replies
-
I really love that! I guess the more verbose way looks good, especially since it's autogenerated. We need to cover both a syntax factory and a node factory for these nodes. Do you have anything in mind? I agree with the Do you know if it's possible to have combined nodes? I mean, in cases where some nodes are in a static position while others are dynamic? |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The CSS grammar is incredibly precise, and yet unrelentingly flexible at the same time. As an example, both of these declarations mean the exact same thing according to the specification, despite looking very different as a whole:
The reason this is possible is because the CSS grammar includes a number of Combinators that allow for arbitrary ordering and multiplication of elements. Specifically,
||
and&&
, which say "any or all of these elements, in any order". The formal syntax for abackground
value, as demonstrated above, is:In this case, at least one of the 6 elements must appear in the value, but up to all 6 can be given, and the order does not matter. Even properties given multiple times, like
<visual-box>
in that example, can appear anywhere and be in different positions according to this grammar (as shown above with thepadding-box
andborder-box
).For general parsers using asbtract syntax trees, this is relatively trivial to implement. The AST doesn't care about ordering, just that a node is present or not, so the definition of a
BackgroundLayer
node could just look something like:In fact, this is what most other tooling out there seems to currently do, such as the struct that LightningCSS uses. It works well, and they're able to easily parse with just a loop and a set of conditions to test each value, then take each value as it's seen and put it in the appropriate spot in the struct.
CSTs and Ordering
The problem for Biome is that we have a lossless, concrete syntax tree, and with Rowan we specifically have a syntax tree built linearly from a simple token stream. Nodes are created by taking an ordered list of tokens and child nodes from the token stream and putting each one, in order, into the Slots of the desired Node type. Referencing those properties of the Node is then done purely using a static offset. As a very simplified example, the structure is effectively.
This ordering of tokens is also important to preserve, since we want to represent the input text exactly (meaning the token stream is an exact, ordered representation of the input text, with no gaps and no re-ordering). The reason we want that is to be able to cleanly represent and edit Bogus nodes (things that don't match the expected syntax for parsing), whitespace, trivia, and anything else from the text while perfectly preserving that input.
With this representation, the node is opaque from the outside, so a consumer can just query for the property of it, like
node.name()
ornode.value()
. But internally, the struct is just looking up a statically-known index from theSyntaxNode
, which uses the iterator on its list of tokens to get the slot. Thankfully, the functions to retrieve the properties hide this implementation detail, but it is still present and is the core challenge to overcome here: how can we represent the different possible orderings of properties while preserving their original order?Dynamic Slot Assignment
While these special combinators don't care about the ordering of the elements, they do still provide one guarantee: each element of the syntax may only appear exactly once. With the double bar (
||
), the elements are optional, but they will only appear once. They can't be repeated within the value unless they use another Multiplier (like+
or*
), in which case the value is still a single "element" in the value, but that element is itself a single List node, so the number of elements in the value is still statically known.What this guarantee means is that we can still use the linear list of tokens and create a Node with the same public-facing structure, but we can re-write the accessor methods to dynamically determine which slot to read from, rather than using a static offset. To do that, we just need to keep a tiny map of elements to slot numbers and use that when accessing into the token list:
The
slot_map
here is the new piece. It provides a way for the consumer of the node to specify which element of the grammar exists at each index in the underlying token stream while keeping that information opaque from anyone just wanting to read properties off of the Node:node.magnitude()
will always return the right value, regardless of whether the magnitude was specified first or last in the input.In addition to this, should we want to, we could expose additional methods for getting the index value from the Node, which could be useful for linting or other analyses in the future:
This could also be expanded to work as an iterator over the types, allowing it to be generic over all the node types that use this type of combinator.
Optional entries
The above example uses the
&&
combinator, meaning all of the elements are required. Optionality introduces a slight hiccup here, because the index map wouldn't have a known entry for values that weren't given in the input.For this, we could just turn the
slot_map
into a list ofOption<u8>
values and then use a match for both the access and index-access methods. However, because0
is an important value here (the first slot of the syntax),Option<u8>
can't benefit from the Null Pointer Optimization and would take up additional memory and access time. Instead, we can use a sentinel value that we can be relatively confident won't ever conflict to indicateMISSING
and match based on that:This is slightly more verbose than using
Option
would be, but keeps theslot_map
as a plain[u8; N]
array, which is better for memory layout and access times. The public-facing API still has all of the results useOption
types, though, so this detail is kept internal-only.Representation in Ungrammar
So we can represent nodes that are defined with these combinators in the syntax tree, but we still need a way to define them as part of the grammar. Biome uses
ungrammar
for all of the language grammar definitions, which makes generating the nodes and syntax kinds really simple, but is also limited in the syntax that it supports. There is no support for the various additional combinators and multipliers that CSS uses in their grammar (the||
and&&
as shown above, but also the#
multiplier for comma-separated lists).One way that we've worked around this limitation so far is just to rely on naming conventions for the Node types. A node that is defined as a union of different rules should use the
Any*
prefix, which generates specific code for casting values into and out of the "slot". Nodes ending withList
use the*
or+
multiplier in the grammar, and generate additional trait implementations to handle lists and iterators.So one possibility here is to use a similar convention for generating nodes from these combinators. The background layer grammar definition from the start of this post could be written like:
Here,
SomeUnordered
is the naming convention that would represent the||
combinator and would generate the Optional version of the node structure shown above. For the&&
combinator where all fields are required, we could use the nameAllUnordered*
, or something similar. I'm not positive about the naming here, but that's a possibility here.Another option is to extend the Ungrammar syntax itself to support these alternate combinators. With that, we could keep the Node name similar to all of the others (just
CssBackgroundLayer
), but then the grammar itself would use the combinator, just like the others like|
for unions:I think this is much nicer from a flexibility point of view, and reduces the mental burden of remembering what convention each type of node uses: from a consumer perspective, this is just another normal node with optional properties.
We already have other additions we'd like to make to Ungrammar (like supporting doc comments, and most likely the
#
multiplier for the same reason as this new syntax), so I think making the additions to Ungrammar itself is worthwhile.Beta Was this translation helpful? Give feedback.
All reactions