Parsing for Unordered Syntax Nodes #1407

faultyserver · 2024-01-02T18:43:41Z

faultyserver
Jan 2, 2024
Maintainer

The CSS grammar is incredibly precise, and yet unrelentingly flexible at the same time. As an example, both of these declarations mean the exact same thing according to the specification, despite looking very different as a whole:

background: url(/background.png) repeat-x padding-box border-box local left 10% center / 100px auto;
background: local padding-box  auto url("/background.png") left 10% center / 100px repeat-x border-box;

The reason this is possible is because the CSS grammar includes a number of Combinators that allow for arbitrary ordering and multiplication of elements. Specifically, || and &&, which say "any or all of these elements, in any order". The formal syntax for a background value, as demonstrated above, is:

<bg-layer> = 
  <bg-image>                      ||
  <bg-position> [ / <bg-size> ]?  ||
  <repeat-style>                  ||
  <attachment>                    ||
  <visual-box>                    ||
  <visual-box>

In this case, at least one of the 6 elements must appear in the value, but up to all 6 can be given, and the order does not matter. Even properties given multiple times, like <visual-box> in that example, can appear anywhere and be in different positions according to this grammar (as shown above with the padding-box and border-box).

For general parsers using asbtract syntax trees, this is relatively trivial to implement. The AST doesn't care about ordering, just that a node is present or not, so the definition of a BackgroundLayer node could just look something like:

struct BackgroundLayer {
    image: Option<BackgroundImage>,
    position: Option<BackgroundPosition>,
    size: Option<BackgroundSize>,
    repeat_style: Option<RepeatStyle>,
    attachment: Option<Attachment>,
    // The first `<visual-box>` represents the background origin positioning
    origin: Option<VisualBox>,
    // The second `<visual-box>` represents the clipping boundary.
    clip: Option<VisualBox>
}

In fact, this is what most other tooling out there seems to currently do, such as the struct that LightningCSS uses. It works well, and they're able to easily parse with just a loop and a set of conditions to test each value, then take each value as it's seen and put it in the appropriate spot in the struct.

CSTs and Ordering

The problem for Biome is that we have a lossless, concrete syntax tree, and with Rowan we specifically have a syntax tree built linearly from a simple token stream. Nodes are created by taking an ordered list of tokens and child nodes from the token stream and putting each one, in order, into the Slots of the desired Node type. Referencing those properties of the Node is then done purely using a static offset. As a very simplified example, the structure is effectively.

// MyVector =
//   magnitude: Number
//   direction: 'up' | 'down'
//
// This grammar accepts values like `10 up`, or `-15 down`, but not
// `down 5`, or `up 20` because the syntax is _ordered_.
struct MyVector {
    syntax: SyntaxNode // This is the wrapper around the list of nodes/tokens
}

impl MyVector {
    fn magnitude() -> Result<Identifier> {
        get_slot(self.syntax, 0)
    }
    fn direction() -> Result<SyntaxToken> {
        get_slot(self.syntax, 1)
    }
}

This ordering of tokens is also important to preserve, since we want to represent the input text exactly (meaning the token stream is an exact, ordered representation of the input text, with no gaps and no re-ordering). The reason we want that is to be able to cleanly represent and edit Bogus nodes (things that don't match the expected syntax for parsing), whitespace, trivia, and anything else from the text while perfectly preserving that input.

With this representation, the node is opaque from the outside, so a consumer can just query for the property of it, like node.name() or node.value(). But internally, the struct is just looking up a statically-known index from the SyntaxNode, which uses the iterator on its list of tokens to get the slot. Thankfully, the functions to retrieve the properties hide this implementation detail, but it is still present and is the core challenge to overcome here: how can we represent the different possible orderings of properties while preserving their original order?

Dynamic Slot Assignment

While these special combinators don't care about the ordering of the elements, they do still provide one guarantee: each element of the syntax may only appear exactly once. With the double bar (||), the elements are optional, but they will only appear once. They can't be repeated within the value unless they use another Multiplier (like + or *), in which case the value is still a single "element" in the value, but that element is itself a single List node, so the number of elements in the value is still statically known.

What this guarantee means is that we can still use the linear list of tokens and create a Node with the same public-facing structure, but we can re-write the accessor methods to dynamically determine which slot to read from, rather than using a static offset. To do that, we just need to keep a tiny map of elements to slot numbers and use that when accessing into the token list:

// MyVector =
//   magnitude: Number &&
//   direction: 'up' | 'down'
//
// This grammar accepts all of `10 up`, or `-15 down`, `down 5`,
// _and_ `up 20` because the syntax is _unordered_, using the double-
// ampersand combinator.
struct MyVector {
    /// This is the wrapper around the list of nodes/tokens
    syntax: SyntaxNode, 
    /// This is the map of indices. The index of this array is the
    /// _declared_ order from the grammar (so magnitude => 0,
    /// direction => 1), and the value is the _written_ order,
    /// determined from parsing the input stream. Since we know the
    /// exact size, this can be allocated inline and accessed quickly
    /// with no real overhead.
    slot_map: [u8; 2],
}

impl MyVector {
    pub fn magnitude() -> Result<Identifier> {
        get_slot(self.syntax, self.slot_map[0])
    }
    pub fn direction() -> Result<SyntaxToken> {
        get_slot(self.syntax, self.slot_map[1])
    }
}

The slot_map here is the new piece. It provides a way for the consumer of the node to specify which element of the grammar exists at each index in the underlying token stream while keeping that information opaque from anyone just wanting to read properties off of the Node: node.magnitude() will always return the right value, regardless of whether the magnitude was specified first or last in the input.

In addition to this, should we want to, we could expose additional methods for getting the index value from the Node, which could be useful for linting or other analyses in the future:

impl MyVector {
    fn index_of_magnitude() -> u8 {
        self.slot_map[0]
    }
    fn index_of_direction() -> u8 {
        self.slot_map[1]
    }
}

This could also be expanded to work as an iterator over the types, allowing it to be generic over all the node types that use this type of combinator.

Optional entries

The above example uses the && combinator, meaning all of the elements are required. Optionality introduces a slight hiccup here, because the index map wouldn't have a known entry for values that weren't given in the input.

For this, we could just turn the slot_map into a list of Option<u8> values and then use a match for both the access and index-access methods. However, because 0 is an important value here (the first slot of the syntax), Option<u8> can't benefit from the Null Pointer Optimization and would take up additional memory and access time. Instead, we can use a sentinel value that we can be relatively confident won't ever conflict to indicate MISSING and match based on that:

// UpTo2DVector =
//   magnitude: Number ||
//   y: 'up' | 'down' ||
//   z: 'left' | 'right'
//
// This grammar accepts `1`, `1 up`, `left`, `down`, `down left 5`,
// `up right`, `right 10 up` and so on, because each entry is optional
// and unordered.
struct MyVector {
    /// This is the wrapper around the list of nodes/tokens
    syntax: SyntaxNode, 
    /// 
    slot_map: [u8; 3],
}
const SLOT_MAP_EMPTY_VALUE = u8::MAX;

impl MyVector {
    fn magnitude() -> Option<Identifier> {
        match self.slot_map[0] {
            SLOT_MAP_EMPTY_VALUE => None
            index => get_slot(self.syntax, index)
        }
    }
    fn y() -> Option<SyntaxToken> {
        match self.slot_map[1] {
            SLOT_MAP_EMPTY_VALUE => None
            index => get_slot(self.syntax, index)
        }
    }
    fn z() -> Option<SyntaxToken> {
        match self.slot_map[2] {
            SLOT_MAP_EMPTY_VALUE => None
            index => get_slot(self.syntax, index)
        }
    }

    fn index_of_magnitude() -> Option<u8> {
        match self.slot_map[0] {
            SLOT_MAP_EMPTY_VALUE => None
            index => Some(index)
        }
    }
    fn index_of_y() -> Option<u8> {
        match self.slot_map[1] {
            SLOT_MAP_EMPTY_VALUE => None
            index => Some(index)
        }
    }
    fn index_of_z() -> Option<u8> {
        match self.slot_map[2] {
            SLOT_MAP_EMPTY_VALUE => None
            index => Some(index)
        }
    }
}

This is slightly more verbose than using Option would be, but keeps the slot_map as a plain [u8; N] array, which is better for memory layout and access times. The public-facing API still has all of the results use Option types, though, so this detail is kept internal-only.

Representation in Ungrammar

So we can represent nodes that are defined with these combinators in the syntax tree, but we still need a way to define them as part of the grammar. Biome uses ungrammar for all of the language grammar definitions, which makes generating the nodes and syntax kinds really simple, but is also limited in the syntax that it supports. There is no support for the various additional combinators and multipliers that CSS uses in their grammar (the || and && as shown above, but also the # multiplier for comma-separated lists).

One way that we've worked around this limitation so far is just to rely on naming conventions for the Node types. A node that is defined as a union of different rules should use the Any* prefix, which generates specific code for casting values into and out of the "slot". Nodes ending with List use the * or + multiplier in the grammar, and generate additional trait implementations to handle lists and iterators.

So one possibility here is to use a similar convention for generating nodes from these combinators. The background layer grammar definition from the start of this post could be written like:

SomeUnorderedCssBackgroundLayer =
    image: CssBackgroundImage
    position: CssBackgroundPositionAndSize
    repeat_style: CssRepeatStyle
    attachment: CssAttachment
    origin: CssVisualBox
    clip: CssVisualBox

Here, SomeUnordered is the naming convention that would represent the || combinator and would generate the Optional version of the node structure shown above. For the && combinator where all fields are required, we could use the name AllUnordered*, or something similar. I'm not positive about the naming here, but that's a possibility here.

Another option is to extend the Ungrammar syntax itself to support these alternate combinators. With that, we could keep the Node name similar to all of the others (just CssBackgroundLayer), but then the grammar itself would use the combinator, just like the others like | for unions:

CssBackgroundLayer =
    image: CssBackgroundImage
    || position: CssBackgroundPositionAndSize
    || repeat_style: CssRepeatStyle
    || attachment: CssAttachment
    || origin: CssVisualBox
    || clip: CssVisualBox

I think this is much nicer from a flexibility point of view, and reduces the mental burden of remembering what convention each type of node uses: from a consumer perspective, this is just another normal node with optional properties.

We already have other additions we'd like to make to Ungrammar (like supporting doc comments, and most likely the # multiplier for the same reason as this new syntax), so I think making the additions to Ungrammar itself is worthwhile.

denbezrukov · 2024-01-03T11:16:36Z

denbezrukov
Jan 3, 2024
Maintainer

I really love that! I guess the more verbose way looks good, especially since it's autogenerated.

We need to cover both a syntax factory and a node factory for these nodes. Do you have anything in mind?

I agree with the || syntax for ungram. However, this syntax allows having both || and && within a single node declaration.

Do you know if it's possible to have combined nodes? I mean, in cases where some nodes are in a static position while others are dynamic?

5 replies

faultyserver Jan 3, 2024
Maintainer Author

I believe there are some cases where a property has some ordered and some unordered values, and I remember that the syntax uses braces [ ] around them to group the parts, though, since i think the precedence rules they set up make the || and && combinators the lowest precedence.

It looks like the font shorthand does this:

font = 
  [ [ <'font-style'> || <font-variant-css2> || <'font-weight'> || <font-stretch-css3> ]? <'font-size'> [ / <'line-height'> ]? <'font-family'> ]  |
  <system-family-name>

so style, variant-css2 'font-weight', and stretch-css3 are all unordered and optional, but then they have to appear before 'font-size' and so on.

I don't have an immediate idea of how best to handle this...My initial thought was just to extract them into a separate node, but i think that's actually a bad idea now that i've seen it like this. I think some of the work is just going to have to be done by the parser itself rather than the grammar to ensure ordering, but for writing the grammar it definitely still feels useful. As a quick thought, maybe a good ungrammar for font could look like this:

AnyCssFontPropertyValue =
  CssFontPropertyValue
  CssFontSystemFamilyName // This is really just `CssIdentifier*`, but named

CssFontPropertyValue =
  (
    style: AnyCssFontStylePropertyValue
     || variant: CssFontVariantCss2
     || weight: AnyCssFontWeightPropertyValue
     || stretch: CssFontStretchCss3
  )?
  size: AnyCssFontSizePropertyValue
  ( '/' line_height: AnyCssLineHeightPropertyValue) ?
  family: CssFontFamilyPropertyValue

Note that this uses the AnyCss*PropertyValue nodes to embed the grammars for those node types, which is how <'property'> tokens are defined to work by the css spec.

On the parser side, it would still have a slot_map that covers all of the nodes/tokens i think, but the parser would just ensure that all of the ordered properties are in the correct slots (it has to track them anyway, since the || piece could be anywhere from 1 to 4 nodes long, so a static offset still won't work there).

This...feels very complicated to make work, but I think it's also doable? I think we can make the parenthesized || piece just work with ungrammar as it currently is, since the inside of that is just turned into its own Rule and we can flatten it in the code generation. The '/' line_height: part is probably the same? Just flatten and make them both optional, maybe the / token could be named like line_height_slash_token or something to make it more apparent for what and why.

faultyserver Jan 3, 2024
Maintainer Author

However, this syntax allows having both || and && within a single node declaration.

I think that's also fine in terms of the Ungrammar syntax. We can just panic in the code generation if we encounter both without some additional parentheses to show precedence.

denbezrukov Jan 3, 2024
Maintainer

It sounds perfect,
Do you need any help?

faultyserver Jan 3, 2024
Maintainer Author

I think I have a good idea of how to get going on it, and it seems like it'll be sort of linear work (have to implement the ungrammar syntax first, then add it to codegen, then add it to our grammar), so i'm not sure how much it can be split up. Maybe we could agree on the naming convention to use and then i'll take on the ungrammar side if you want to handle the codegen on biome's side? otherwise i think i can work it all through.

faultyserver Jan 4, 2024
Maintainer Author

I managed to get the ungrammar syntax implemented and started working on codegen, so i think i should be good to get it all the way finished, but I noticed an interesting property that i think we will have to utilize for the node and syntax factory generation: while the values can be unordered, they are still parsed in the order specified by the grammar. Consider this bg-layer grammar definition from the original post:

<bg-layer> = 
  <bg-image>                      ||
  <bg-position> [ / <bg-size> ]?  ||
  <repeat-style>                  ||
  <attachment>                    ||
  <visual-box>                    ||
  <visual-box>

It has two <visual-box> elements at the end, but it says they can be unordered...so the parser has to attempt to fill each of these slots in the order they are defined, otherwise there would be ambiguity about what which visual-box is which (the first is the origin, the second is the clip).

So when we're parsing, constructing, or casting a node, we don't actually need to keep track of the order they appear in separately to pass in as a slot_map. We just need to have the constructor check the kind() of each CompletedMarker it contains, and then it can build up the slot_map internally.

The reason I realized this was because the AstNode trait has the cast method that only accepts a SyntaxNode, and we can't just add a requirement for a slot_map parameter to that for every node. So for these "dynamic" nodes (as i've started calling them in the code), the implementation of cast for that trait can just use the grammar definition to build the slot_map and then store that in the constructed AstNode struct.

The unfortunate cost of this is that each cast would have to re-evaluate the ordering, but I think that's fine. It's only when casting from the raw SyntaxNode that the cost is incurred, and will likely be pretty rare (plus the actual overhead is pretty low, just a quick loop one time over the children). If that ends up being too costly, then we can probably consider a deeper level change to Rowan itself to understand unordered nodes and how to cache/cast them on the syntax level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing for Unordered Syntax Nodes #1407

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Parsing for Unordered Syntax Nodes #1407

faultyserver Jan 2, 2024 Maintainer

CSTs and Ordering

Dynamic Slot Assignment

Optional entries

Representation in Ungrammar

Replies: 1 comment · 5 replies

denbezrukov Jan 3, 2024 Maintainer

faultyserver Jan 3, 2024 Maintainer Author

faultyserver Jan 3, 2024 Maintainer Author

denbezrukov Jan 3, 2024 Maintainer

faultyserver Jan 3, 2024 Maintainer Author

faultyserver Jan 4, 2024 Maintainer Author

faultyserver
Jan 2, 2024
Maintainer

Replies: 1 comment 5 replies

denbezrukov
Jan 3, 2024
Maintainer

faultyserver Jan 3, 2024
Maintainer Author

faultyserver Jan 3, 2024
Maintainer Author

denbezrukov Jan 3, 2024
Maintainer

faultyserver Jan 3, 2024
Maintainer Author

faultyserver Jan 4, 2024
Maintainer Author