Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parser): GritQL parser #1998

Merged
merged 22 commits into from
Mar 22, 2024
Merged

feat(parser): GritQL parser #1998

merged 22 commits into from
Mar 22, 2024

Conversation

arendjr
Copy link
Contributor

@arendjr arendjr commented Mar 7, 2024

Summary

Implements the GritQL parser based on the merged grammar.

Most of the parser is a pretty straightforward implementation following the grammar. Things of note:

  • Grit patterns support code snippets, and the code inside those code snippets needs parsing too. This functionality is NOT covered by this parser since it requires parsing in the target language, with a bit of magic to support variable placeholders. In the open-source GritQL repo this is handled by TreeSitter grammars. I have not yet decided whether it makes sense for us to use TreeSitter too, or if we can modify Biome's grammars in a compatible way.
  • Math operators were the only part in the GritQL grammar that relied on left-associativity, so I implemented an operator-precedence parser for them (see parse_math_pattern()).
  • I noticed in the CONTRIBUTING.md it is suggested to use .ok() when parsing nodes that may be absent, while I had already used let _ = for this purpose. Just let me know if you like me to change it :)
  • Some error recovery is already implemented, let me know if you see ways to improve it further.
  • I used pretty straightforward loops to deal with lists, instead of specialized infrastructure. I'm not exactly sure if or why ParseBlockBody might be applicable here. Please advise on the best approach here.
  • As far as encoding is concerned, it appears Grit files are always UTF8, while identifiers are even limited to pure ASCII.

TODO:

  • Verify precedence rules.

Test Plan

Tests are added.

@github-actions github-actions bot added the A-Tooling Area: internal tools label Mar 7, 2024
Copy link

netlify bot commented Mar 7, 2024

Deploy Preview for biomejs ready!

Name Link
🔨 Latest commit 1857435
🔍 Latest deploy log https://app.netlify.com/sites/biomejs/deploys/65fd4313c6cd5f000847f576
😎 Deploy Preview https://deploy-preview-1998--biomejs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 96 (🔴 down 3 from production)
Accessibility: 97 (no change from production)
Best Practices: 100 (no change from production)
SEO: 93 (no change from production)
PWA: -
View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codspeed-hq bot commented Mar 7, 2024

CodSpeed Performance Report

Merging #1998 will not alter performance

Comparing arendjr:gritql-parser (1857435) with main (644d2f4)

Summary

✅ 93 untouched benchmarks

@ematipico
Copy link
Member

I will review this eventually.

Can you help us and explain how we should review the PR? How does error recovery work? What's the encoding of grit files? Do we handle UTF-16 strings?

@arendjr
Copy link
Contributor Author

arendjr commented Mar 15, 2024

Yeah, I’m finishing up the precedence rules then I’ll document it a bit more. AFAIK, it’s all UTF8 only :)

@github-actions github-actions bot added the A-Parser Area: parser label Mar 16, 2024
@github-actions github-actions bot removed the A-Parser Area: parser label Mar 16, 2024
@arendjr
Copy link
Contributor Author

arendjr commented Mar 16, 2024

Okay, precedence rules seem to work. I've updated the PR description to ask for guidance so I can move it over the finish line.

@arendjr arendjr requested review from a team, DaniGuardiola and denbezrukov March 16, 2024 12:23
@ematipico
Copy link
Member

It's really weird that grit files are UTF-8, while JavaScript is WTF-16 (https://simonsapin.github.io/wtf-8/#ill-formed-utf-16) This means that we already found the first limitation for possible extensions/plugins for Biome.

@arendjr
Copy link
Contributor Author

arendjr commented Mar 16, 2024

It's really weird that grit files are UTF-8, while JavaScript is WTF-16 (https://simonsapin.github.io/wtf-8/#ill-formed-utf-16) This means that we already found the first limitation for possible extensions/plugins for Biome.

I don't fully understand this comment. JavaScript doesn't specify an explicit encoding for JavaScript files, although I've yet to encoder the first one in my life that wasn't encoded with UTF-8. JavaScript uses UTF-16 (WTF-16) for its internal string object representation, but that only affects runtimes, not parsers.

@morgante
Copy link

For the record, we never explicitly chose an encoding for .grit files either. I'm not sure why this would be a limitation.

Copy link
Member

@ematipico ematipico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is very long and it will take a while to look at it. I wished it was broken down to smaller pieces.

I left some comments, here the summary of the things you should look after:

  • Diagnostics: we can do better, and we can provide better messages to help users to fix the possible grammar errors. Some errors can leverage the internal APIs to provide better UI in those messages.
  • Lists: you need to use the infra provided by biome_parser, e.g. ParseNodeList or ParseSeparatedList. The CSS parser uses it, for example.
  • Bogus: unfortunately I couldn't review much around parsing, but it seems that we aren't using the bogus nodes defined in grammar. An example is GritBogusDefinition, which I would expected to find in definitions.rs or where we parse the definitions (I didn't find GRIT_BOGUS_DEFINITION)
  • Error cases: I think we should add more error cases, especially cases that should prove the error recovery

crates/biome_grit_parser/Cargo.toml Outdated Show resolved Hide resolved
crates/biome_grit_parser/src/lexer/mod.rs Outdated Show resolved Hide resolved
diagnostics: Vec<ParseDiagnostic>,
}

impl<'src> Lexer<'src> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the trait Lexer that is provided by biome_parser.

Using it would help us to make the all lexers all the same, at least having all the same methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing a few more methods in the JS lexer that can be removed as well, since they match what's inside the generic lexer. Hope you don't mind if I move them in this PR.

Copy link
Member

@ematipico ematipico Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not. This PR is already big as I said before, and it would be best to ease the job of the reviewers instead of making it harder.

You can raise a PR against this branch or main

crates/biome_grit_parser/src/lexer/mod.rs Outdated Show resolved Hide resolved
crates/biome_grit_parser/src/lexer/mod.rs Outdated Show resolved Hide resolved
crates/biome_grit_parser/src/parser/mod.rs Outdated Show resolved Hide resolved
crates/biome_grit_parser/src/parser/mod.rs Outdated Show resolved Hide resolved
crates/biome_grit_parser/src/parser/mod.rs Outdated Show resolved Hide resolved
crates/biome_grit_parser/src/parser/definitions.rs Outdated Show resolved Hide resolved
Copy link
Member

@ematipico ematipico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is very long and it will take a while to look at it. I wished it was broken down to smaller pieces.

I left some comments, here the summary of the things you should look after:

  • Diagnostics: we can do better, and we can provide better messages to help users to fix the possible grammar errors. Some errors can leverage the internal APIs to provide better UI in those messages.
  • Lists: you need to use the infra provided by biome_parser, e.g. ParseNodeList or ParseSeparatedList. The CSS parser uses it, for example.
  • Bogus: unfortunately I couldn't review much around parsing, but it seems that we aren't using the bogus nodes defined in grammar. An example is GritBogusDefinition, which I would expected to find in definitions.rs or where we parse the definitions (I didn't find GRIT_BOGUS_DEFINITION)
  • Error cases: I think we should add more error cases, especially cases that should prove the error recovery

@arendjr
Copy link
Contributor Author

arendjr commented Mar 20, 2024

I will definitely try to add some more test cases, especially around error recovery, but I think I've addressed almost all review comments. Thanks a ton!

@arendjr
Copy link
Contributor Author

arendjr commented Mar 21, 2024

Alright, there might be more improvements for error handling down the line, but as you said, this PR is already too big, so I'm going to leave it at this. I think it's generally in a quite decent shape, just let me know if you think there are any blockers still.

@arendjr arendjr requested a review from ematipico March 21, 2024 19:45
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't have any bogus node, which means we aren't able to recover to parser.

Considering the grammar, I would expect at least a top-level bogus GritBogusPattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does a missing required node though, and there's also a diagnostic, so what kind of recovery would be necessary here? I thought recovery means advancing the parser to a point where it can continue parsing again, but that seems unnecessary here, because it already sees an else token and picks up from there. But I'm happy to fix it in a follow-up PR if I understand what I'm missing :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does a missing required node though

completely missed that :) sorry

@arendjr arendjr merged commit a7ef325 into biomejs:main Mar 22, 2024
14 of 16 checks passed
@arendjr arendjr mentioned this pull request Apr 15, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Tooling Area: internal tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants