tree-sitter-grammars: refactor grammar sources#408414
Conversation
4e7f3a6 to
da9fdd1
Compare
|
Wow, this looks like a big improvement. I'll need to take a closer look but this is amazing work so far! |
|
The usage examples you've gone over in your initial PR comment would be great as more official documentation. Would that make sense in this PR or would you rather add it in a follow up?
This is a good question. I'm curious how that would work?
Different projects using Tree-sitter are likely at some point going to differ on which grammar to use for same language, not to mention that Tree-sitter grammars (unfortunately) frequently go unmaintained which leads to forks and new alternatives popping up. You also said:
I think this is fine, it's good, intuitive behavior. It's also still possible to use a custom commit hash by changing the flake ref so I don't see any problems with it. |
|
The docs will absolutely be an addition here. After adding a grammer, wanting to improve the process for contributor n+1 started this. I had originally planned for that just to be docs, but I'm very easily distracted. Re multiple grammars for one language: alternatives can still be used either by calling If there was a desire to change that grammars/default.nix currently maps the grammar-sources list to an attrset. That can easilly be changed from a Are you aware of precendent or patterns elsewhere in nixpkgs for packaging multiple forks of the same tool? |
|
Some thoughts on approaches to supporting multiple grammars for the same language without introducing extra boilerplate for the majority:
# grammar-sources.nix
[
{
language = "indentlang";
attrName = "indentlang-spaces";
version = "0.1.0";
url = "github:indentlang/tree-sitter-indentlang";
hash = "";
}
{
language = "indentlang";
attrName = "indentlang-tabs";
version = "1.2.3";
url = "github:tabs4life/tree-sitter-indentlang";
hash = "";
}
]
# grammar-sources.nix
{
nix = {
# language = "nix" will be inferred here
# ...
};
indentlang-spaces = {
language = "indentlang";
# ...
};
indentlang-tabs = {
language = "indentlang";
# ...
};
} |
I'm not. And I think you're right it could get messy, I suggest leaving that for now - seems like it wouldn't be a breaking change if we were to move to |
|
Not sure if you're aware of #394870, I'd also like to introduce WASM as a target, especially since we wouldn't have to build system-specific files for something that should be simple. |
|
Just double checking - are there further steps needed on this pull request before we can move forward? I'd like to avoid having it sit for so long and not get merged! |
There was a problem hiding this comment.
First of all, great contribution! In your PR description you said that it heavily abuses IFD. Is that still the case? Chances are low that anyone wants to merge this if it still contains IFD.
I did some nits below, but overall this saves ~1600 LOC and makes the maintainer's experience a whole lot better! Thanks!
Is there a strong argument for supporting multiple grammars for a single language?
I don't think so, AFAIK most grammars live under tree-sitter or tree-sitter-grammars anyways on github. So the rule of thumb is that the grammars are first-class. I can't even think of a reason why someone would want to create their own version of, say, tree-sitter-markdown, that differs enough so that it both (a) doesn't/can't get upstreamed in some way and (b) needs to be packaged in nixpkgs ...
Oh, and also, this should probably go into nevermind I guess: #414233 (comment)staging as this would most likely trigger a rebuild of the grammars and results in 500-1000 rebuilds...
pkgs/development/tools/parsing/tree-sitter/grammars/grammar-sources.nix
Outdated
Show resolved
Hide resolved
pkgs/development/tools/parsing/tree-sitter/grammars/default.nix
Outdated
Show resolved
Hide resolved
Yes sorry - I've had a few things happening in personal life since raising that have blocked further dev. Hoping to get back onto it this week. Let me flag this as draft until that is resolved. This will also conflict with the work in #391520. Getting tests working and marking grammars as broken where appropriate are both good things IMO. If that's approaching mergable state happy to for it to sequence first, then I can integrate those same changes in what's defined here. Likewise for other grammar updates. I have a script that automates the re-write from the old JSON definitions to what's included in this PR so rework to update is minimal. Please don't let the presence of this block other maintenance and updates. |
That makes me happy. I just submitted a PR today that does a whole bunch of updates ( Also, with 3 big |
Yes. I'm keen for this change to just re-arrange internals, not what's actually built. I'm very good at making horrendously trivial mistakes so wanted to minimise manual process in that.
Good call. Done. |
|
@nekowinston I still need to come back to this and include your suggestions around For any grammars that don't have a tagged version the |
Not sure I understand, what do you mean by metadata, |
Each of the grammar repos have a |
|
Thanks for the feedback so far all. That should be most items discussed captured. I'll keep the seperate commits here for a few days to help simplify review and then if we're looking stable rebase ready for a merge. |
A little foggy on if this is still relevant but I'd argue there is a strong argument for supporting multiple grammars for a single language. For example, I recently came across https://arborium.bearcove.eu/#rust which uses https://codeberg.org/grammar-orchard/tree-sitter-rust-orchard as the Rust grammar rather than the "official" https://github.com/tree-sitter/tree-sitter-rust. Can the current system handle a differently named grammar for the same language? In this case "tree-sitter-rust-orchard" but for the same "rust" language as "tree-sitter-rust"? Sorry if this not relevant any longer. |
|
That can be accommodated in the current implementation. The attrset name defines the nix package name. If
rust-orchard = {
language = 'rust_orchard';
# ...
};Let's say it used {
rust = {
# there is an implicit `language = 'rust'` here
version = "0.23.2";
url = "github:tree-sitter/tree-sitter-rust";
hash = "sha256-aT+tlrEKMgWqTEq/NHh8Vj92h6i1aU6uPikDyaP2vfc=";
};
rust-orchard = {
language = 'rust';
# ...
};
} |
|
Great thank you for clarifying. This looks really good, documentation is great to finally properly have, excited for this to land. |
|
|
That should be the python packages fixed up. Review on that change welcome! The failing But in the grammar itself this is I'll see if I can get something working locally that detects a hyphen/underscore naming mismatch and logs a warning rather than build failure as this is likely a more general problem if tooling here becomes more widely used. Might be a couple of days before I have the bandwidth for that though as a heads up. edit: I had some time tonight so have implemented this as a logged error rather than build fail: nixpkgs/pkgs/development/tools/parsing/tree-sitter/build-grammar.nix Lines 57 to 64 in 9bfc2e2 The failing vimPlugins.nvim-treesitter-parsers.* packages have a number of different naming mismatches. This keeps builds that were previously passing still alive (as this was not previously checked) but does log an error. At some future point I think it may be good to consider moving that to a hard fail.
|
Major refactor of the approach to defining tree-sitter grammars. This provides a clearer abstraction to declare grammar sources and removes custom update logic in favour of nix-update.
cf2227d to
19713c7
Compare
|
|
I ran the updater after a Also the README.md should migrate to the manual ideally oc/languages-frameworks/ to become more visible. This being said, both points are easy to address in followups so I would like to merge this first. |
Refactor of the approach to defining tree-sitter grammars. This is intended to provide a simpler way to declare grammar sources as well as removing custom update logic in favour of
nix-update.Key items in this PR:
fetchgit.buildGrammar.With tree-sitter usage becoming more widespread this may also provide a means to define grammars in one point, that can then be re-used and referenced across other tools in nixpkgs.
Example usage
To declare a new grammar, insert an entry in grammars-sources.nix with a fake hash placeholder:
This will expand to an element within
pkgs.tree-sitter.grammarswhen built that matches common conventions for tree-sitter grammar repos:Each of these entries are passed to buildGrammar and in turn populate
pkgs.tree-sitter-grammars.Attempt to build the new grammar:
nix-build -A tree-sitter-grammars.tree-sitter-latex. This will fail - review the the downloaded source, extract the hash and update the source definition.To pin to a specific ref (either tag or commit hash), append this to the source url to override the default version tag:
The URL field curently supports
github:(fetchFromGitHub),gitlab:(fetchFromGitLab), andsourcehut:(fetchFromSourcehut) but can be exanded as other sources are required. To use other fetchers, specify thesrcattribute directly:Additional attributes in the source are forwarded to
buildGrammarand upward tomkDerivationto enable default to be modified as required. This includes metadata, as well as any other attributes used by the build process.General notes
tree-sitterorg but these are not what's packaged currently in nixpkgs. In some cases there are likely good reasons for this, others maybe not.Things done
nix.conf? (See Nix manual)sandbox = relaxedsandbox = truenix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/)Add a 👍 reaction to pull requests you find important.