Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2: Don't expose text nodes #33

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

V2: Don't expose text nodes #33

wants to merge 4 commits into from

Conversation

stsewd
Copy link
Owner

@stsewd stsewd commented Dec 30, 2023

It's very common for users to want to match TODO in addition to TODO: (ending in colon). To match that, users needed to match over all text nodes, since every word is a text node, that was slow. So the main improvement here is to not expose text nodes, and instead expose any uppercase word as a node. Unless you write all of your comments very angry in SCREAMING case, performance should improve.

Changes

  • Don't include text nodes at all, instead expose all uppercase words as tag nodes.
    This should improve performance, but it will yield more false positives, but I think it's better if users explicitly match the tag names they want (TODO, NOTE, etc), so it shouldn't be a problem in practice.
  • Tags are divided in three types:
    • Simple tags: They are basically any uppercase word.
    • Normal tags: They are uppercase words immediately followed by a :.
    • Annotated tags: They are like a normal tag, but they can include an annotation TODO (something):.
  • The (user) node was renamed to (annotation)
  • No longer exposing text nodes implies that you can no longer match things like #10 or !10. If people want, I can introduce a way to match those things, maybe a (reference) node /[@#!]([a-zA-Z0-9)/.

While this new approach makes the parser less flexible without the text nodes, it will make it faster, or at least faster to match, since the parsing itself is/was already fast (doing a match over text nodes is the slow operation). Another approach I was thinking is to parse the keywords directly, but that makes it even less flexible, since any new keywords will need to be added in the parser instead of a query.

Other notes

I tried moving everything to JS in f2318ba, but it was slow, due to the use of conflicts.

Original comment:

Moving everything to JS means losing a little bit of control over the exact matching of the tag name, resulting in matches for:

  • TODO:nospace (previously, it was needed to have a space after :).
  • TODO(\n note \n): (\n is a literal new line).

I think it's fine for those to match, don't think you see much of those in the wild, if they are a problem, we can try adjusting the rules, or go back to have that small logic in the external scanner.

UPDATE: Did a benchmark, it's SLOWWWW, from 36 ms to 1264 ms, this is parsing a file with 10K lines full of comments. This is probably due to the conflicts that tree-sitter needs to resolve...

@stsewd stsewd changed the title V2: Try another approach V2: Don't expose text nodes Dec 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant