V2: Don't expose `text` nodes #33

stsewd · 2023-12-30T00:06:07Z

It's very common for users to want to match TODO in addition to TODO: (ending in colon). To match that, users needed to match over all text nodes, since every word is a text node, that was slow. So the main improvement here is to not expose text nodes, and instead expose any uppercase word as a node. Unless you write all of your comments very angry in SCREAMING case, performance should improve.

Changes

Don't include text nodes at all, instead expose all uppercase words as tag nodes.
This should improve performance, but it will yield more false positives, but I think it's better if users explicitly match the tag names they want (TODO, NOTE, etc), so it shouldn't be a problem in practice.
Tags are divided in three types:
- Simple tags: They are basically any uppercase word.
- Normal tags: They are uppercase words immediately followed by a :.
- Annotated tags: They are like a normal tag, but they can include an annotation TODO (something):.
The (user) node was renamed to (annotation)
No longer exposing text nodes implies that you can no longer match things like #10 or !10. If people want, I can introduce a way to match those things, maybe a (reference) node /[@#!]([a-zA-Z0-9)/.

While this new approach makes the parser less flexible without the text nodes, it will make it faster, or at least faster to match, since the parsing itself is/was already fast (doing a match over text nodes is the slow operation). Another approach I was thinking is to parse the keywords directly, but that makes it even less flexible, since any new keywords will need to be added in the parser instead of a query.

Other notes

I tried moving everything to JS in f2318ba, but it was slow, due to the use of conflicts.

Original comment:

Moving everything to JS means losing a little bit of control over the exact matching of the tag name, resulting in matches for:

TODO:nospace (previously, it was needed to have a space after :).
TODO(\n note \n): (\n is a literal new line).

I think it's fine for those to match, don't think you see much of those in the wild, if they are a problem, we can try adjusting the rules, or go back to have that small logic in the external scanner.

UPDATE: Did a benchmark, it's SLOWWWW, from 36 ms to 1264 ms, this is parsing a file with 10K lines full of comments. This is probably due to the conflicts that tree-sitter needs to resolve...

stsewd added 4 commits December 29, 2023 17:10

Try another approach

f2318ba

Go back to parse the tag name from C

4cd6881

Typo

2413d83

Update readme

e5403e2

stsewd changed the title ~~V2: Try another approach~~ V2: Don't expose text nodes Dec 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V2: Don't expose `text` nodes #33

V2: Don't expose `text` nodes #33

stsewd commented Dec 30, 2023 •

edited

Loading

V2: Don't expose text nodes #33

Are you sure you want to change the base?

V2: Don't expose text nodes #33

Conversation

stsewd commented Dec 30, 2023 • edited Loading

Changes

Other notes

V2: Don't expose `text` nodes #33

V2: Don't expose `text` nodes #33

stsewd commented Dec 30, 2023 •

edited

Loading