📎 Embedded language formatting #3334

ah-yu · 2024-07-02T02:32:03Z

Preface

Some popular libraries allow code snippets in other languages to be embedded within JavaScript code. Users want to format these embedded code snippets within JavaScript to enhance the development experience.

Design

Simply put, the idea is to extract the code snippets from template strings, format them using the respective language's formatter, and then replace them back into the template string.

Handling Interpolation

We need to parse the entire template string and then format it based on the parsing results. However, template strings with interpolations are not valid CSS code (using CSS as an example here). Therefore, we need to preprocess the interpolations, turning the template string into a more valid CSS code. We plan to replace interpolations with a special string and then reinsert them after formatting.

To maximize parsing success, we chose to replace interpolations with grit metavariables. The reason for this choice you can find in #3228 (comment)

Changes to the Public API

Since JavaScript formatters cannot directly format code in other languages, we need to use external tools to format these other languages' code. To achieve this, we designed a generic trait instead of relying on specific implementations, maximizing the decoupling between different language formatters.

enum JsForeignLanguage {
    Css,
}

trait JsForeignLanguageFormatter {
    fn format(&self, language: JsForeignLanguage, source: &str) -> FormatResult<Document>;
}

Then we can add a new parameter to the format_node function to pass in the formatter for other languages.

pub fn format_node(
    options: JsFormatOptions,
+   foreign_language_formatter: impl JsForeignLanguageFormatter,
    root: &JsSyntaxNode,
) -> FormatResult<Formatted<JsFormatContext>> {
    biome_formatter::format_node(
        root,
        JsFormatLanguage::new(options, foreign_language_formatter),
    )
}

CLI

When formatting JavaScript files, we need to be aware of other languages' settings. For example, when formatting CSS code, we need to know the CSS formatter's settings.

LSP

The LSP provides a feature called format_range that formats code snippets. This feature relies on SourceMarkers generated during the printing process. Generating a SourceMarker depends on the position information of tokens in the source code. This position information is contained in the following two FormatElements:

biome/crates/biome_formatter/src/format_element.rs

Lines 36 to 50 in ce00685

    
               DynamicText { 
        
                   /// There's no need for the text to be mutable, using `Box<str>` safes 8 bytes over `String`. 
        
                   text: Box<str>, 
        
                   /// The start position of the dynamic token in the unformatted source code 
        
                   source_position: TextSize, 
        
               }, 
        
               /// A token for a text that is taken as is from the source code (input text and formatted representation are identical). 
        
               /// Implementing by taking a slice from a `SyntaxToken` to avoid allocating a new string. 
        
               LocatedTokenText { 
        
                   /// The start position of the token in the unformatted source code 
        
                   source_position: TextSize, 
        
                   /// The token text 
        
                   slice: TokenText, 
        
               },

Since the formatting of embedded languages is done by extracting, preprocessing, and then separately parsing and formatting them, the source_position in these two FormatElement is inaccurate, and the entire template string is handled as a whole. Therefore, I recommend erasing these inaccurate source_position. It is acceptable to erase them because the format_range function will still be able to find the SourceMarker closest to the range start and end. If there is a need to format parts of the embedded code in the future, we can revisit this issue.

Tasks

introduce grit metavariable in CSS
change the public API
preprocess template strings and handle the generated format elements.

The text was updated successfully, but these errors were encountered:

Sec-ant · 2024-07-02T03:17:53Z

Great great work 🚀 . I'd like to add some additional notes. I will update them in this comment later.

TLDR, I think we should broaden the concept of embedded language. I'll provide my rationale below, gradually (been struggling with a tight schedule).

The Concept and Scope of Embedded Language

We should envision the introduction of embedded language to serve two primary purposes:

Enhancing the Developer Experience: This feature aims to improve the developer experience when formatting foreign languages within their code. For example, when writing CSS template literals in a JavaScript file, it could be structured as follows and the formatter should be able to format the CSS code:
```
const style = css`
  div {
    color: red;
  }
`;
```
Supporting Composable Parsing Infrastructure: This feature should also strengthen our parser infrastructure to support languages composed of multiple embedded languages. For instance, an HTML file can contain <style> tags with embedded CSS and <script> tags with embedded JavaScript. This composable parsing infrastructure allows us to fully support modern frameworks like Vue, Svelte, and Astro. By reusing the same parsing infrastructure for each embedded language (HTML, JavaScript, and CSS), we can seamlessly integrate and compose them in various ways.

Configuration of Embedded Language

Possible Configuration

I believe the configuration of embedded languages includes these aspects at least:

Identifying CST Structures: The first step is to determine which CST (Concrete Syntax Tree) structures within a language can be considered as embedded languages. For example, in JavaScript, we might consider JsTemplateExpression as a block to have embedded languages in it.
Determining the Language/Parser: Next, we need to identify the language or parser for the embedded language. For instance, we would use the tag within a JsTemplateExpression to identify the language and select the appropriate parser to parse this embedded language.
Providing Language-Specific Options: We should offer language-specific options for embedded languages that can override the global language-specific options. So they can be parsed / formatted / linted with a different set of options.
A Switch to Opt Out: There should be a switch for users to fully opt out from embedded language detection.

For the first two configuration items, I believe we can leverage our plugin infrastructure. This means users can configure different Grit patterns to inform Biome which CST structures in a language should be considered embedded language blocks and which parser should be used to parse them.

Regarding the third configuration item, we should utilize our existing configuration file structure for language-specific options. This can be done in a way similar to overrides, allowing users to specify different configurations for embedded languages.

Regarding the fourth configuration item, the opt-out option should be configurable for each language, and it should also be a top-level option.

Extent of Configurability

Relating to the two primary purposes mentioned earlier:

For the first purpose (enhancing the developer experience), users should be able to configure the above aforementioned settings freely. This is because the support for embedded languages in this context is not inherently part of the language itself. For example, different libraries may have different APIs for embedding CSS, such as css`...`, style`...`, or css.style`...`. Additionally, users might have their own SDKs or coding styles. Therefore, we should not hardcode these configurations into our infrastructure. Instead, we should provide users with a configuration interface using Grit patterns. I believe we can provide sane default presets to provide a nice out-of-box experience, but users should be able to override them as they want.

For the second purpose (supporting composable parsing infrastructure), there should be more restrictions. For example, when parsing Astro files, the frontmatter is defined as JavaScript code according to Astro's specification. In this case, we should not let users to override this behavior, but rather enforce the logic by our parser. Users shouldn't be allowed to opt out from the enforced logic. However, I think we should still offer the configurablity to the extent that users are allowed to add more patterns to target certain CST nodes as embedded language blocks.

Note

I will add an example of how the configuration would be like here.

Integration Phase of Embedded Language

Another critical consideration is determining at which phase embedded language support should be implemented: the parsing phase or the formatting phase. In #3228, we placed it in the formatting phase, which aligns with Prettier's approach. However, I believe Biome can improve on this.

The ideal phase for supporting embedded languages is the parsing phase. The rationale behind this is that, for languages like Astro (related to purpose 2), in the CST, the frontmatter can be mapped to JavaScript CST nodes, rather than being treated as a literal text node. This approach enhances the support experience for such compositional languages. Additionally, it allows our plugin and linter systems to reuse the same CST structure to handle the content within these embedded languages, so we can add embedded linting/checking support for them later.

One thing to consider in this approach is that we might also want to preserve the original CST nodes, so, for example, linter or formatter rules will still work for the container of the embedded language. Also, for languages that have interpolations, such as template literals in JavaScript, we should also keep the nodes of the expressions in the CST.

Nested Embedded Languages

Note

TBD. One example you can think of is embedded HTML in a JS file, in which there're inline <style> tags.

Indentation Handling in Embedded Language Formatting

Note

This is a tricky one because spaces at line starting can be significant when they appear in multi-line template literals. I'll elaborate later.

Interpolation Handling

Note

This has been discussed in #3228

Examples of Embedded Languages

Note

TBD. I'll come back later.

dyc3 · 2024-09-17T11:10:28Z

Something that I've realized while working on the html parser:

Transitioning into another language while parsing is not necessarily the hard part -- it's knowing when to transition back out instead of emitting an invalid syntax error.

Take this HTML for example, where javascript is embedded within <script>.

<script>
const foo = "</script>";
</script>

I could see a more naive approach failing on the < operator in something simple like this:

<script>
const foo = 5;
if (foo < 10) {
    console.log("foo");
}
</script>

It's not enough to simply look ahead for the end tokens for an embedded language. You can even see the syntax highlighting here on github fail for this case.

In short: We can only transition out of an embedded language when the embedded language is at the root level node.

Another thing to consider is that we will need to be able to determine the language dynamically. Consider a vue template file that specifies:

the script language is typescript
the markup template language is pug

<template lang="pug">
ul
  li(v-for="item in items")
    a(v-if="item.type == 'link'" :href="item.url") some link title: {{item.title}}
    p(v-else) {{item.content}}
</template>

<script lang="ts" setup>
const items = [
    { title: "foo", link: "http://example.com" }
];
</script>

For this scenario, we should disable html parsing for the template section, and parse the script content as typescript (instead of javascript, which would be the default).

arendjr · 2024-09-17T16:40:20Z

<script>
const foo = "</script>";
</script>

Does this actually work in any browser? I would expect it to actually be invalid syntax, so the correct result is for it to fail.

Basically, I think it’s the responsibility of the parent parser (the HTML one in this case) to determine where the snippet ends, and then you only need to hand the substring to the embedded parser. So finding the end token should never be the responsibility of the embedded parser.

dyc3 · 2024-09-17T16:59:36Z

Actually, I just tested it in firefox and it didn't work.

There's also language in the spec that specifically talks about this too: https://html.spec.whatwg.org/#serialising-html-fragments:text

So that makes parsing way easier, we just scan until we hit </script>.

dyc3 · 2024-09-25T20:52:48Z

I've been tinkering with integrating embedded language parsing for HTML a little bit.

It's pretty clear at this point that we will need to mark certain areas in the CST as "this is where embedded languages can be". I created a HtmlEmbeddedLanguage node for this purpose.

With my current understanding of how biome_rowan works, I believe it could be possible to kinda shove parent and child CSTs together in the same tree. This is how SyntaxNode is currently defined:

#[derive(Clone, PartialEq, Eq, Hash)]
pub struct SyntaxNode<L: Language> {
    raw: cursor::SyntaxNode,
    _p: PhantomData<L>,
}

The only thing that's technically enforcing the language constraint is the PhantomData<L>.

So I tried something like that:

for node in html.descendants() {
            if node.kind() == HtmlSyntaxKind::HTML_EMBEDDED_LANGUAGE && !node.text().is_empty() {
                let code = node.text().to_string();
                dbg!("Found embedded language", &code);

                let js_root = biome_js_parser::parse(
                    &code,
                    // TODO: determine the correct options
                    JsFileSource::js_script(),
                    JsParserOptions::default(),
                );
                let as_html = HtmlSyntaxNode::from_foriegn(js_root.syntax());
                // TODO: then replace this node with as_html
            }
        }

It works (sorta), but it panics when you try to print it (because the SyntaxKind is different).
Of course there are several things wrong with this approach. But it gave me a couple of other ideas for implementation.

One option is that we could have a "SuperSyntaxNode" that could be any language. For formatting, when we reach a node for that language, we just use the formatter for that language to format the node. (Obviously not that simple, I'm handwaving the details for now.)

Another option could be that we have a special "transition node" with a marker to indicate what language to transition into. This node would only ever have a single child, which would be the root node for the embedded language. I think this makes the most sense, but I have absolutely no clue how we would represent that in rowan.

arendjr · 2024-09-25T21:18:04Z

Might be good to know that we also have GritTargetLanguageNode which is also intended to be agnostic over language type. So it’s actually an enum where the variants are language-specific syntax kinds (JS is the only variant for now). We could move it from there and rename it if it’s useful for other things too.

arendjr · 2024-09-25T21:21:12Z

That said, it feels like a “transition node” might actually be the better choice here. I guess you may not need to perform any special Rowan tricks if you just put the embedded tree in the transition node’s data?

dyc3 · 2024-10-01T17:13:22Z

I've hit another blocker. Our parsing infra expects all SyntaxKind to map to a specific u16 value.

impl From<u16> for HtmlSyntaxKind {
    fn from(d: u16) -> HtmlSyntaxKind {
        assert!(d <= (HtmlSyntaxKind::__LAST as u16));
        let k = unsafe { std::mem::transmute::<u16, HtmlSyntaxKind>(dbg!(d)) };
    }
}

This assumption causes lots of problems when trying to force trees from 2 different languages in the same tree. It doesn't seem like its easy to add a new SyntaxKind::FOREIGN_TOKEN(u16) to handle these in the same tree.

I'm starting to think that the better approach would be to just do the embedded language parsing after the initial parsing pass and just keep them as separate trees.

arendjr · 2024-10-01T17:28:51Z

Right, I think that's kinda what I was getting at with the "put the embedded tree in the transition node’s data". The way I would expect this to work:

Parent (HTML) parser discovers a "foreign node" that contains embedded content.
Parent parser continues looking for the closing tag. Now it knows the range of the embedded content.
Parent parser invokes the correct child (CSS, JS) parser.
Child parser produces a separate tree.
Parent parser stores the produced tree in the data of the foreign node.

ematipico · 2024-10-01T18:57:36Z

Yes, we can't do multiple parsing in the same phase because of L: Language, which is one of the greatest constraints we have in the code base (good or bad).

Theoretically, what @arendjr suggested could be done, but this would involve mutating an existing CST using the same mutations we use for the analyzer, which could be slow, but maybe we could overlook it.

I'll have to study the problem a bit more when I have more time.

arendjr · 2024-10-01T19:45:56Z

Do we need to mutate the entire tree for that? I’d think we need to mutate at most the most recently parsed node. Because the parent parser can continue parsing the remaining nodes after step 5.

I’m not exactly sure how the nodes are laid out in memory, but I think “storing the embedded tree” could even be as simple as moving the root syntax node for the embedded tree into the foreign node? If I’m not mistaken, those are fixed size themselves, so the foreign node could reserve a slot for that like any other node fields.

orimay · 2024-11-19T13:51:23Z

We also need to be able to indent script and style contents (prettier's property vueIndentScriptAndStyle allows that):

<script>
  const foo = 5;
  if (foo < 10) {
    console.log("foo");
  }
</script>

ah-yu self-assigned this Jul 2, 2024

ah-yu added A-Formatter Area: formatter L-JavaScript Language: JavaScript and super languages S-Feature Status: new feature to implement labels Jul 2, 2024

This was referenced Jul 3, 2024

feat(css_parser): introduce grit metavariable #3340

Merged

refactor(formatter_test): refactor TestFormatLanguage trait #3395

Merged

ah-yu mentioned this issue Jul 10, 2024

feat(js_formatter): provide the JS formatter with the ability to format other languages #3403

Draft

ah-yu mentioned this issue Jul 23, 2024

📝 The behavior of handling nested template literals is weird #3500

Closed

1 task

uncenter mentioned this issue Aug 2, 2024

☂️ CSS Formatter #1285

Open

11 tasks

ematipico mentioned this issue Aug 15, 2024

☂️ Biome - current developments #2455

Open

dyc3 mentioned this issue Sep 4, 2024

📎 markdown support #3718

Open

This comment has been minimized.

Sign in to view

This comment was marked as resolved.

Sign in to view

ematipico mentioned this issue Oct 17, 2024

📎 add support for graphql-tag formatting #430

Closed

fregante mentioned this issue Nov 19, 2024

Meta: Automatically format Svelte files refined-github/refined-github#8084

Merged

dyc3 mentioned this issue Dec 12, 2024

☂️ HTML Parsing and Formatting #4726

Open

11 tasks

vohoanglong0107 mentioned this issue Dec 23, 2024

📎 GraphQL support #1927

Closed

3 tasks

damassi mentioned this issue Dec 24, 2024

chore: Re-add prettier for formatting due to missing graphql support in Biome formatter artsy/force#15044

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📎 Embedded language formatting #3334

📎 Embedded language formatting #3334

ah-yu commented Jul 2, 2024 •

edited

Loading

Sec-ant commented Jul 2, 2024 •

edited

Loading

dyc3 commented Sep 17, 2024 •

edited

Loading

arendjr commented Sep 17, 2024 •

edited

Loading

dyc3 commented Sep 17, 2024

This comment has been minimized.

This comment has been minimized.

dyc3 commented Sep 25, 2024

arendjr commented Sep 25, 2024 •

edited by dyc3

Loading

arendjr commented Sep 25, 2024

This comment was marked as resolved.

This comment was marked as resolved.

dyc3 commented Oct 1, 2024

arendjr commented Oct 1, 2024

ematipico commented Oct 1, 2024

arendjr commented Oct 1, 2024

orimay commented Nov 19, 2024

📎 Embedded language formatting #3334

📎 Embedded language formatting #3334

Comments

ah-yu commented Jul 2, 2024 • edited Loading

Preface

Design

Handling Interpolation

Changes to the Public API

CLI

LSP

Tasks

Sec-ant commented Jul 2, 2024 • edited Loading

The Concept and Scope of Embedded Language

Configuration of Embedded Language

Possible Configuration

Extent of Configurability

Integration Phase of Embedded Language

Nested Embedded Languages

Indentation Handling in Embedded Language Formatting

Interpolation Handling

Examples of Embedded Languages

dyc3 commented Sep 17, 2024 • edited Loading

arendjr commented Sep 17, 2024 • edited Loading

dyc3 commented Sep 17, 2024

This comment has been minimized.

This comment has been minimized.

dyc3 commented Sep 25, 2024

arendjr commented Sep 25, 2024 • edited by dyc3 Loading

arendjr commented Sep 25, 2024

This comment was marked as resolved.

This comment was marked as resolved.

dyc3 commented Oct 1, 2024

arendjr commented Oct 1, 2024

ematipico commented Oct 1, 2024

arendjr commented Oct 1, 2024

orimay commented Nov 19, 2024

ah-yu commented Jul 2, 2024 •

edited

Loading

Sec-ant commented Jul 2, 2024 •

edited

Loading

dyc3 commented Sep 17, 2024 •

edited

Loading

arendjr commented Sep 17, 2024 •

edited

Loading

arendjr commented Sep 25, 2024 •

edited by dyc3

Loading