Skip to content

Add reserved words#117

Merged
314eter merged 6 commits intotree-sitter:masterfrom
ddickstein:reserved-words
May 31, 2025
Merged

Add reserved words#117
314eter merged 6 commits intotree-sitter:masterfrom
ddickstein:reserved-words

Conversation

@ddickstein
Copy link
Copy Markdown
Contributor

Tree-sitter now supports reserved keywords for better error recovery. This commit updates the OCaml grammar to mark reserved words. For example, before

let x =

type t = int

was parsed as

(compilation_unit ; [0, 0] - [4, 0]
  (value_definition ; [0, 0] - [2, 12]
    "let" ; [0, 0] - [0, 3]
    (let_binding ; [0, 4] - [2, 12]
      pattern: (value_name) ; [0, 4] - [0, 5]
      "=" ; [0, 6] - [0, 7]
      body: (infix_expression ; [2, 0] - [2, 12]
        left: (application_expression ; [2, 0] - [2, 6]
          function: (value_path ; [2, 0] - [2, 4]
            (value_name)) ; [2, 0] - [2, 4]
          argument: (value_path ; [2, 5] - [2, 6]
            (value_name))) ; [2, 5] - [2, 6]
        operator: (rel_operator) ; [2, 7] - [2, 8]
        right: (value_path ; [2, 9] - [2, 12]
          (value_name)))))) ; [2, 9] - [2, 12]

and now it is parsed as

(compilation_unit ; [0, 0] - [4, 0]
  (value_definition ; [0, 0] - [0, 5]
    "let" ; [0, 0] - [0, 3]
    (let_binding ; [0, 4] - [0, 5]
      pattern: (value_name))) ; [0, 4] - [0, 5]
  (ERROR ; [0, 6] - [0, 7]
    "=") ; [0, 6] - [0, 7]
  (type_definition ; [2, 0] - [2, 12]
    "type" ; [2, 0] - [2, 4]
    (type_binding ; [2, 5] - [2, 12]
      name: (type_constructor) ; [2, 5] - [2, 6]
      "=" ; [2, 7] - [2, 8]
      equation: (type_constructor_path ; [2, 9] - [2, 12]
        (type_constructor))))) ; [2, 9] - [2, 12]

@ddickstein
Copy link
Copy Markdown
Contributor Author

List of reserved words taken from https://ocaml.org/manual/5.3/lex.html#sss:keywords

@aryx
Copy link
Copy Markdown
Contributor

aryx commented Apr 21, 2025

nice!

@314eter
Copy link
Copy Markdown
Collaborator

314eter commented Apr 21, 2025

I was working on this myself at 314eter/tree-sitter-ocaml. But the tests are failing because the Python bindings don't support 0.25 yet, so I was waiting on that to get released to create a PR.

Some things I did that are missing here:

  • Upgraded the dependencies to tree-sitter 0.25
  • Excluded the nonrec keyword. It's new since OCaml 4.02, so old code may be using it as a variable.
  • Included the binary operators or, lor, lxor, mod, land, lsl, lsr and asr by making them tokens in the grammar.
  • Used a different set of keywords for attribute_id.

@ddickstein
Copy link
Copy Markdown
Contributor Author

ddickstein commented Apr 21, 2025 via email

@ddickstein
Copy link
Copy Markdown
Contributor Author

ddickstein commented May 28, 2025

It looks like the tree-sitter/py-tree-sitter#333 has stalled (maintainers aren't responding to the author's write access requests); can we consider merging this and not waiting on 0.25?

Re: nonrec, is OCaml 4.02 (released over 10 years ago) an important backwards-compatibility target? The floor of the minimum tested OCaml version has been raised to 4.08.

For the other grammar changes to binary operators and attribute_id, feel free to incorporate them into this PR.

@314eter
Copy link
Copy Markdown
Collaborator

314eter commented May 28, 2025

If we merge this, tree-sitter-ocaml will become incompatible with py-tree-sitter. That's not a huge problem, but it's annoying, and there's nothing urgent about the reserved keywords feature. It just improves error recovery.

About nonrec: it's indeed not very important to support 4.02, but there aren't many downsides to excluding nonrec from the reserved keywords. It's again just error recovery that will be impacted in some specific cases.

It's a logical decision not to test new opam packages on 4.02, since probably nobody is still writing new code using 4.02. But old 4.02 code does still exist, so I think tools like tree-sitter should try to support a wide range of versions for a longer time.

@ddickstein
Copy link
Copy Markdown
Contributor Author

I don't have visibility into how annoying incompatibility with py-tree-sitter is, but I can say more about my use case. It's not just minor cosmetics - I have custom query logic (providing functionality, not highlighting) that does not work properly with the previous grammar because it tries to detect an error case that does not appear in that position b/c the tree is fundamentally changed (e.g., the in keyword becomes a function name). I currently am working off of my fork of the grammar, but I'd ideally like to publish what I've built, and it's not possible for other people to use it without this change.

Re: nonrec, fair point about successfully parsing old code, though I think a breakage would have to be old code that happens to use "nonrec" as an identifier. Which could conceivably exist, but doesn't seem that likely. Either way, I don't feel strongly about this.

@314eter
Copy link
Copy Markdown
Collaborator

314eter commented May 29, 2025

Ok, I didn't expect error recovery to make such a difference.

The problem with being one of the first to move to tree-sitter ABI 15 (none of the officially supported grammars have been updated), is that many tools (language bindings for Python and Swift, editors like Emacs) will not be ready yet. So new features that get added or bugs that get fixed, will not be available for many users.

I'd prefer to get at least OCaml 5.4 support done first in a 0.24 version, and then we can move to 0.25 (temporarily disabling Python and Swift tests).

ddickstein and others added 6 commits May 29, 2025 17:19
Tree-sitter now supports reserved keywords for better error recovery.
This commit updates the OCaml grammar to mark reserved words. For
example, before

```ocaml
let x =

type t = int
```

was parsed as

```
(compilation_unit ; [0, 0] - [4, 0]
  (value_definition ; [0, 0] - [2, 12]
    "let" ; [0, 0] - [0, 3]
    (let_binding ; [0, 4] - [2, 12]
      pattern: (value_name) ; [0, 4] - [0, 5]
      "=" ; [0, 6] - [0, 7]
      body: (infix_expression ; [2, 0] - [2, 12]
        left: (application_expression ; [2, 0] - [2, 6]
          function: (value_path ; [2, 0] - [2, 4]
            (value_name)) ; [2, 0] - [2, 4]
          argument: (value_path ; [2, 5] - [2, 6]
            (value_name))) ; [2, 5] - [2, 6]
        operator: (rel_operator) ; [2, 7] - [2, 8]
        right: (value_path ; [2, 9] - [2, 12]
          (value_name)))))) ; [2, 9] - [2, 12]
```

and now it is parsed as

```
(compilation_unit ; [0, 0] - [4, 0]
  (value_definition ; [0, 0] - [0, 5]
    "let" ; [0, 0] - [0, 3]
    (let_binding ; [0, 4] - [0, 5]
      pattern: (value_name))) ; [0, 4] - [0, 5]
  (ERROR ; [0, 6] - [0, 7]
    "=") ; [0, 6] - [0, 7]
  (type_definition ; [2, 0] - [2, 12]
    "type" ; [2, 0] - [2, 4]
    (type_binding ; [2, 5] - [2, 12]
      name: (type_constructor) ; [2, 5] - [2, 6]
      "=" ; [2, 7] - [2, 8]
      equation: (type_constructor_path ; [2, 9] - [2, 12]
        (type_constructor))))) ; [2, 9] - [2, 12]
```
@clason
Copy link
Copy Markdown

clason commented May 30, 2025

The problem with being one of the first to move to tree-sitter ABI 15 (none of the officially supported grammars have been updated)

That's not true (and "officially supported" doesn't mean much these days). I'd have to check with the Emacs people, but Neovim definitely supports ABI 15. It all depends on the whether a project uses a language-specific binding, or whether it uses the lib directly (in C or Rust) -- which I believe is the majority of "large" consumers. But of course OCaml is special, and there may be important language-specific projects you know and care about that use the Python bindings.

(To put some numbers to it, out of the 319 parsers I track, 44 are ABI 15, 249 are ABI 14, and 26 are still ABI 13.)

In any case, moving to ABI 15 just means consumers stuck on ABI 14 can't update your parser to the latest version and miss out on the benefits of this PR. Just mark the next release as breaking and let users decide.

And all this is moot since you already bumped the ABI to 15 in #123 ;) (Tree-sitter 0.25 defaults to ABI 15; you need to tree-sitter generate --abi 14 explicitly to keep it at ABI 14.)

(You would be the first language to use the new reserved words feature, though ;))

@314eter
Copy link
Copy Markdown
Collaborator

314eter commented May 30, 2025

I only looked at the version numbers of other grammars. Most of them follow the tree-sitter version. So if they are still on version 0.24, I was assuming they're not on ABI 15, but it looks like that's not true anymore. Combined with the fact that 2 out of 5 official language bindings (official meaning they're in the tree-sitter organization and tree-sitter-cli templates) don't support ABI 15 yet made me conclude it's not widely supported yet.

For most grammars that doesn't really matter, since you can regenerate with ABI 14 if necessary. But if we start using reserved keywords immediately, that will be impossible.

The master branch is indeed on ABI 15 now, as I said I'd do after #122. I rebased this PR on master and added my fixes, so I think it's ready to be merged now.

@clason
Copy link
Copy Markdown

clason commented May 30, 2025

I only looked at the version numbers of other grammars. Most of them follow the tree-sitter version.

No, that's not true. That was a convention that some -- but by no means all -- stick to. The C parser is definitely at ABI 15, and more will follow.

For most grammars that doesn't really matter, since you can regenerate with ABI 14 if necessary. But if we start using reserved keywords immediately, that will be impossible.

That is true.

@314eter 314eter merged commit 3ef7c00 into tree-sitter:master May 31, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants