Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate tree-sitter to replace syntect #1787

Closed
Keats opened this issue Mar 3, 2022 · 93 comments
Closed

Investigate tree-sitter to replace syntect #1787

Keats opened this issue Mar 3, 2022 · 93 comments

Comments

@Keats
Copy link
Collaborator

Keats commented Mar 3, 2022

Has anyone used it? The last time I looked at tree-sitter it didn't have many grammars but a quick look shows it's getting better. Our syntect syntaxes are stuck on old versions of the grammars because of new features in the Sublime grammar format not supported by Syntect.
See https://github.com/nvim-treesitter/nvim-treesitter#supported-languages for a list of supported languages.

An alternative would be a basic textmate highlighter using VSCode syntaxes/themes since that's what everyone seems to be using these days.

@mwcz
Copy link
Contributor

mwcz commented Mar 4, 2022

I haven't used tree-sitter as a library but it's really nice in nvim.

@jakelogemann
Copy link

jakelogemann commented Mar 18, 2022

👍 ts is the bee's knees.

The official tree-sitter-highlight crate has a few nice examples in the README... I haven't used it before, but I'm interested to try...

ref

@Keats
Copy link
Collaborator Author

Keats commented Mar 18, 2022

If it's adopted (I don't know yet, I need to see the theming capabilities and just try it on various inputs), it would be its own package that can probably be on crates.io as well. I'd like to move all the lines numbers/highlights etc in it.

@Keats
Copy link
Collaborator Author

Keats commented May 20, 2022

@Jieiku
Copy link
Contributor

Jieiku commented Jun 8, 2022

What would the tree-sitter output look like? would it list a ton of classes in the generated html like the current syntect solution does when you use css mode? or would it use classes that refer to css variables which we can then set to specific colors and styles, eg:
--z-1, --z-2, --z-3, etc... with a couple modifiers for bold, italic, and underline: --z-b, --z-i, --z-u

@Keats
Copy link
Collaborator Author

Keats commented Jun 8, 2022

Ideally it would the same kind as the current syntect output

@Jieiku
Copy link
Contributor

Jieiku commented Jun 8, 2022

All the class definitions make the page source code much larger in size, but I can see how it would make things simpler as far as generating goes. If you simply used classes which refer to colors, then you would have to have some sort of lookup table per programming language. (because a bracket in one language might be colored, but in another language it might not be colored or colored differently)

I just wish there was a simple way to have much leaner generated html for syntax highlighting while using the css method.

@Keats
Copy link
Collaborator Author

Keats commented Aug 11, 2022

I have the HTML renderer working, now to figure out which how to use a VS Code theme to link scopes with tree-sitter to know which colour to use...

@Keats
Copy link
Collaborator Author

Keats commented Aug 18, 2022

I think I'll forget the VSCode themes as they can be in JSON, YAML or even JS. It can probably just be a tiny ~20L long key value ini file since it's not like we are going to have hundreds of scope.

Not much to show so far but I've set up https://github.com/getzola/giallo which is still very much a clone of the html renderer in tree-sitter-highlight so far since I got stuck on theming.
The plan is to hardcode a theme in giallo for now and compare the output with the same theme in VSCode to make sure it's kinda close. The main issue is that the theme example I took (OneDark-Pro.json) has tons of language dependent colours (with scopes like constant.other.php) so it's never going to be really close if it's a common thing to have a lot of language-specific scopes.
Anyone knows more about that?

@Jieiku
Copy link
Contributor

Jieiku commented Aug 18, 2022

This solution will be able to work with css right? I ask because the description in the top right of giallo says:

Syntax highlighter to HTML using tree-sitter, using VSCode theme, the wording does not mention css eg: HTML/CSS

Really appreciate the work on this, I do hope css will still be supported, let me know if you need any help/testing.

I am actually ok without language specific scopes on most things because the resulting output will likely be a lot leaner.

I cannot speak to what is the normally because a LOT of my editing over the years was in notepad, until a few years ago when I switched to using Atom, just recently I switched to Kate because of how long atom takes to load.

@Keats
Copy link
Collaborator Author

Keats commented Aug 18, 2022

I do hope css will still be supported

Yes, it's just a matter of exporting a theme as CSS, which is trivial.

@Jieiku
Copy link
Contributor

Jieiku commented Aug 18, 2022

The main issue is that the theme example I took (OneDark-Pro.json) has tons of language dependent colours (with scopes like constant.other.php) so it's never going to be really close if it's a common thing to have a lot of language-specific scopes.
Anyone knows more about that?

So are you mostly asking for feedback from people that use VSCode? Are you asking if it is common to have a lot of language specific scopes, or are you asking if it would be ok to have less language-specific scopes?

I can install vscode in a VM and play around with it for a bit (never used it before)

Did not realize vscode was open source was as easy as sudo pacman -S vscode unfortunately it is an electron app, but I went ahead and installed it so I can play around with it for a bit.

@Keats
Copy link
Collaborator Author

Keats commented Aug 18, 2022

Mostly asking for people with knowledge of tree-sitter to see what they know about scopes. Also curious neovim-treesitter and how themes/scopes are defined.

VSCode
Screenshot 2022-08-19 at 00 00 59

Giallo
Screenshot 2022-08-19 at 00 01 12

Differences are probably due to me getting some scopes wrong when looking at the theme and/or missing some necessary scopes, I'll try to fix it when I'm not tired but it's kind of acceptable.

@xse
Copy link

xse commented Aug 23, 2022

Hey,
I've played around with it, tree-sitter is really fast and has lots of languages supported!
However something to keep in mind is that it works with programming languages, and does not have syntaxes for stuff that isn't a programming language.

Don't get me wrong it's an awesome tool and on top of that it's really fast, I just think that for a web facing thing it would be nice to have syntax highlighting for the kind of stuff you could have on a website, like for example:

  • bits of an nginx/apache config file
  • a diff
  • a git commit
  • ...
xse@krkrkr ~ $ ls -l /usr/local/share/nvim/runtime/syntax/ | wc -l                                                             │
     660

PS: Still think tree-sitter is a nice replacement. I understand that the kind of tool able to do that might not be easy to deal with, and it's really not that hard to use :tohtml with a bit of sed to get ready to copy/paste html with inline css for all those things that are not programming languages.

@Keats
Copy link
Collaborator Author

Keats commented Aug 23, 2022

The issue with syntect is that we are stuck with 2 years old buggy (the JS one for example can take forever to highlight a snippet) sublime syntaxes since they introduced new syntax not supported by syntect.

The choices are:

  1. stay on syntect + the current outdated syntaxes that kinda work (eg no async/await highlight in Rust for example)
  2. move to a pygment port in Rust for a much simpler highlight system (I started with that initially) but still based on regexes but easy to add support of many languages to it
  3. move to a tree-sitter based highlighter which gives us the same highlight as an editor with no regexes and (probably) much faster than syntect

I think 1 is a dead end in the long run as the Sublime Text people can keep changing their spec however they want. I've started porting pygments to Rust a while back and would be an easy solution for people wanting to add syntaxes since it could be just a yaml/toml file. It would also be annoying to use VSCode/Sublime themes as the scopes are very different.

Tree-sitter is nicer in that the highlights are much more accurate, it's easy to port TextMate themes and I wouldn't have to maintain it. It's harder to provider custom syntaxes like Zola currently allows though.

@Keats
Copy link
Collaborator Author

Keats commented Aug 30, 2022

I've started using Helix themes and queries and the result is really good.

With their OneDark theme and the default Rust highlight query:

Screenshot 2022-08-30 at 21 15 50

With their OneDark theme and their Rust highlight query:

Screenshot 2022-08-30 at 21 15 55

The last screenshot is pretty much the same as opening that file in VSCode. Helix is a really great match as they have already a great collection of themes and a lot of improvements to the default queries. I'll see if they are ok with moving those bits out of the main repo for collaboration, otherwise it can be solved with copying and licensing.

@Jieiku
Copy link
Contributor

Jieiku commented Aug 31, 2022

Yes that bottom one does indeed look really nice!

@the-mikedavis
Copy link

Tree-sitter is capable of really nice syntax highlighting but there are some drawbacks to consider.

For the 109 languages supported in Helix, the total size of the compiled parsers is 108.5 MiB. Most compiled parsers are somewhere on the order of hundreds of KiB with some larger parsers on the order of ones or tens of MiB. The queries are altogether very small: only 1.7 MiB for all of them. The parsers are also C and many languages have C++ external scanners, so you would need to add compile-time dependencies on a C++ toolchain.

It's a large amount of work to add support for a language which doesn't have a tree-sitter parser yet. With regular expression based highlighting you can work incrementally - start with a few highlights and add more as you go - but it's hard to write a parser that incrementally covers the full syntax of a language. Language support has become very mature recently with tree-sitter though and there are even parsers for non-programming languages (I have a few for git commits, configs, rebase syntax, diffs).

We're happy to take those tradeoffs with Helix since tree-sitter can be used to build so many features (syntax highlighting, syntax-based motions, textobjects, indentation, rainbow brackets) but those tradeoffs are worth some consideration for Zola. There's a similar project which could be more appropriate: https://lezer.codemirror.net/ but admittedly I haven't used it and I think the language support is less full. Plus then the syntax highlighting would need to be done client-side.

All of that being said, I would really love to see tree-sitter syntax highlighting in Zola. At least selfishly since the Helix website uses Zola :)

@Jieiku
Copy link
Contributor

Jieiku commented Aug 31, 2022

I don't like solutions that require client-side highlighting (unnecessary JavaScript), the page would load significantly slower. I prefer a solution that makes efficient use of html/css to style the page. I went out of my way to make the back to top button CSS only for the abridge theme so that it would be one less JavaScript file. I am not completely against JavaScript, I make plenty of use of it in abridge, I just don't like using JavaScript when there is a more efficient way of solving a problem (page speed performance).

I wonder if Zola makes use of any other tools/libraries that are also C/C++, or if supporting Helix would be the first one?

Very cool that there is parsers for: git commits, configs, rebase syntax, diffs

@Keats
Copy link
Collaborator Author

Keats commented Aug 31, 2022

Argh, I didn't know the parsers were that big :o. From Helix 22.05:

-rwxr-xr-x  1 admin    51K Aug 31 20:56 twig.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 iex.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 eex.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 regex.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 gowork.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 gomod.so*
-rwxr-xr-x  1 admin    51K Aug 31 20:56 embedded-template.so*
-rwxr-xr-x  1 admin    52K Aug 31 20:56 json.so*
-rwxr-xr-x  1 admin    52K Aug 31 20:56 gitignore.so*
-rwxr-xr-x  1 admin    52K Aug 31 20:56 git-rebase.so*
-rwxr-xr-x  1 admin    52K Aug 31 20:56 git-config.so*
-rwxr-xr-x  1 admin    52K Aug 31 20:56 tsq.so*
-rwxr-xr-x  1 admin    53K Aug 31 20:56 toml.so*
-rwxr-xr-x  1 admin    54K Aug 31 20:56 comment.so*
-rwxr-xr-x  1 admin    68K Aug 31 20:56 heex.so*
-rwxr-xr-x  1 admin    68K Aug 31 20:56 cpon.so*
-rwxr-xr-x  1 admin    68K Aug 31 20:56 gitattributes.so*
-rwxr-xr-x  1 admin    84K Aug 31 20:56 graphql.so*
-rwxr-xr-x  1 admin    84K Aug 31 20:56 meson.so*
-rwxr-xr-x  1 admin    84K Aug 31 20:56 git-diff.so*
-rwxr-xr-x  1 admin    84K Aug 31 20:56 dockerfile.so*
-rwxr-xr-x  1 admin    84K Aug 31 20:56 devicetree.so*
-rwxr-xr-x  1 admin    85K Aug 31 20:56 nix.so*
-rwxr-xr-x  1 admin    90K Aug 31 20:56 html.so*
-rwxr-xr-x  1 admin    91K Aug 31 20:56 vue.so*
-rwxr-xr-x  1 admin   100K Aug 31 20:56 git-commit.so*
-rwxr-xr-x  1 admin   101K Aug 31 20:56 css.so*
-rwxr-xr-x  1 admin   102K Aug 31 20:56 tablegen.so*
-rwxr-xr-x  1 admin   110K Aug 31 20:56 svelte.so*
-rwxr-xr-x  1 admin   116K Aug 31 20:56 protobuf.so*
-rwxr-xr-x  1 admin   116K Aug 31 20:56 wgsl.so*
-rwxr-xr-x  1 admin   117K Aug 31 20:56 cmake.so*
-rwxr-xr-x  1 admin   134K Aug 31 20:56 fish.so*
-rwxr-xr-x  1 admin   134K Aug 31 20:56 gdscript.so*
-rwxr-xr-x  1 admin   166K Aug 31 20:56 nu.so*
-rwxr-xr-x  1 admin   181K Aug 31 20:56 ledger.so*
-rwxr-xr-x  1 admin   182K Aug 31 20:56 lua.so*
-rwxr-xr-x  1 admin   182K Aug 31 20:56 nickel.so*
-rwxr-xr-x  1 admin   197K Aug 31 20:56 cairo.so*
-rwxr-xr-x  1 admin   197K Aug 31 20:56 sql.so*
-rwxr-xr-x  1 admin   197K Aug 31 20:56 make.so*
-rwxr-xr-x  1 admin   232K Aug 31 20:56 elm.so*
-rwxr-xr-x  1 admin   233K Aug 31 20:56 yaml.so*
-rwxr-xr-x  1 admin   245K Aug 31 20:56 go.so*
-rwxr-xr-x  1 admin   261K Aug 31 20:56 hare.so*
-rwxr-xr-x  1 admin   262K Aug 31 20:56 gleam.so*
-rwxr-xr-x  1 admin   280K Aug 31 20:56 python.so*
-rwxr-xr-x  1 admin   282K Aug 31 20:56 hcl.so*
-rwxr-xr-x  1 admin   294K Aug 31 20:56 llvm-mir.so*
-rwxr-xr-x  1 admin   294K Aug 31 20:56 java.so*
-rwxr-xr-x  1 admin   295K Aug 31 20:56 javascript.so*
-rwxr-xr-x  1 admin   310K Aug 31 20:56 r.so*
-rwxr-xr-x  1 admin   326K Aug 31 20:56 odin.so*
-rwxr-xr-x  1 admin   342K Aug 31 20:56 erlang.so*
-rwxr-xr-x  1 admin   358K Aug 31 20:56 c.so*
-rwxr-xr-x  1 admin   439K Aug 31 20:56 sshclientconfig.so*
-rwxr-xr-x  1 admin   456K Aug 31 20:56 rescript.so*
-rwxr-xr-x  1 admin   476K Aug 31 20:56 org.so*
-rwxr-xr-x  1 admin   519K Aug 31 20:56 glsl.so*
-rwxr-xr-x  1 admin   537K Aug 31 20:56 scala.so*
-rwxr-xr-x  1 admin   585K Aug 31 20:56 dart.so*
-rwxr-xr-x  1 admin   616K Aug 31 20:56 vala.so*
-rwxr-xr-x  1 admin   621K Aug 31 20:56 bash.so*
-rwxr-xr-x  1 admin   648K Aug 31 20:56 solidity.so*
-rwxr-xr-x  1 admin   717K Aug 31 20:56 php.so*
-rwxr-xr-x  1 admin   763K Aug 31 20:56 rust.so*
-rwxr-xr-x  1 admin   777K Aug 31 20:56 zig.so*
-rwxr-xr-x  1 admin   990K Aug 31 20:56 markdown.so*
-rwxr-xr-x  1 admin   1.0M Aug 31 20:56 ruby.so*
-rwxr-xr-x  1 admin   1.2M Aug 31 20:56 scheme.so*
-rwxr-xr-x  1 admin   1.2M Aug 31 20:56 julia.so*
-rwxr-xr-x  1 admin   1.4M Aug 31 20:56 typescript.so*
-rwxr-xr-x  1 admin   1.4M Aug 31 20:56 tsx.so*
-rwxr-xr-x  1 admin   1.5M Aug 31 20:56 llvm.so*
-rwxr-xr-x  1 admin   1.6M Aug 31 20:56 cpp.so*
-rwxr-xr-x  1 admin   1.6M Aug 31 20:56 latex.so*
-rwxr-xr-x  1 admin   1.7M Aug 31 20:56 elixir.so*
-rwxr-xr-x  1 admin   2.3M Aug 31 20:56 ocaml-interface.so*
-rwxr-xr-x  1 admin   2.7M Aug 31 20:56 ocaml.so*
-rwxr-xr-x  1 admin   2.9M Aug 31 20:56 haskell.so*
-rwxr-xr-x  1 admin   2.9M Aug 31 20:56 c-sharp.so*
-rwxr-xr-x  1 admin   3.3M Aug 31 20:56 perl.so*
-rwxr-xr-x  1 admin   3.5M Aug 31 20:56 swift.so*
-rwxr-xr-x  1 admin   3.6M Aug 31 20:56 kotlin.so*
-rwxr-xr-x  1 admin    15M Aug 31 20:56 lean.so*
-rwxr-xr-x  1 admin    18M Aug 31 20:56 verilog.so*

I'm not planning to add all of those but just the Verilog one is the same size as current Zola x)
I've had a look at tree-sitter issue tracker and it does have some issues about generating huge parsers but the size of some grammar don't make too much sense to me and there seems to be a 50kb minimum size. The size of some of is pretty fishy, eg Markdown being 1M despite HTML being 90K.
I don't care too much about the end binary size but that seems a bit extreme as an increase...

I wonder if Zola makes use of any other tools/libraries that are also C/C++, or if supporting Helix would be the first one?

Zola is super annoying to build on Windows because of libsass requirements. Definitely not the first one.

@lf-
Copy link
Contributor

lf- commented Oct 9, 2022

I have a working prototype of tree-sitter highlighting working for zola with Helix themes on branch https://github.com/lf-/zola/tree/tree-painter

It uses https://github.com/matze/tree-painter/ as a back end instead of the one @Keats was writing a couple months ago, just because it seems to have all the highlighting to HTML done already.

It's probably not upstreamable as is, containing a good many hacks, and also compiles the treesitter stuff statically which is more convenient but makes LTO infeasible due to absurd link times (I've not investigated how to selectively do LTO). Feel free to take any amount of it that you'd like; I don't have resources to clean it up to upstream it.

Also, the perf is Not Good. Even with the LTO build I had, my site build went from 36ms to 800ms. I don't know why it's slow, and probably the best way to figure that out is to instrument Zola with tracing, which is something that I don't have resources for as the bad perf is not bad enough to motivate doing it for my site.

Note This perf issue is almost certainly not due to tree-sitter itself being bad, but instead something very silly happening in rust land. For instance, the standalone tree-sitter tool takes 3s to parse the entirety of compiler/ in GHC, 400k lines of haskell. On one thread. Although that doesn't involve running queries, so maybe that's the bottleneck? Anyway it probably needs profiling.

I don't have the same constraints as Zola is designed for (binary size does not bother me, generation time does not bother me as long as it's not a workflow blocker), and I've got it good enough to power my site, so I'm stopping where I got to.

Regarding the highlight groups, it's distinctly possible that the different clients are using different queries. The treesitter parsers often come with highlight queries, but nvim-treesitter seems to vendor theirs. My speculation is that a big reason for this is that nvim-treesitter has some nonstandard features such as the (#make-range! ..) "predicate", as well as the @spell capture group.

The way that I've debugged these is by using my nvim which has nvim-treesitter-playground installed and using :TSHighlightCapturesUnderCursor on the syntax in question, and comparing it to the CSS output of tree-painter.

Anyway, good luck! Good highlighting is really important to programming blogs, and I almost got rid of Zola over it before realizing it was probably easier to hack it in instead.

Sample:

Before (notice that some -- comments are completely misparsed):

image

After:

image

@lf-
Copy link
Contributor

lf- commented Oct 9, 2022

One difficulty with any form of tree-sitter integration is that building parsers is a nightmare due to Cargo being quite very bad at submodules. More details here: matze/tree-painter#3

This could be done either statically linked or dynamically linked, but I would lean toward dynamic linking since it is otherwise impossible to add more parsers to the system without forking it.

But dynamic linking would compromise the current single-executable nature of Zola (not something that I'm bothered about, but I understand it is a design goal).

@Keats
Copy link
Collaborator Author

Keats commented Oct 10, 2022

I can't build your fork for some reason on rustc 1.64 or nightly. I'll have a deeper look when I get more time. Can you tell me how big is the generated Zola binary?

the perf is Not Good. Even with the LTO build I had, my site build went from 36ms to 800ms.

That's surprising. Sounds like something being instantiated too often? I'm expecting the tree-sitter parsers themselves to be faster than tons of regexes from syntect.

This could be done either statically linked or dynamically linked, but I would lean toward dynamic linking since it is otherwise impossible to add more parsers to the system without forking it.

It's the issue yes. If the generated parsers size was manageable, we could just add everything to the library. Of course that's not going to work for home-made languages but hey...

@phisch
Copy link

phisch commented May 13, 2024

For me personally, I'd rather have a (even significant) performance hit, but better syntax highlighting. There is also the option to implement both, make syntect the default (for performance), and treesitter an optional alternative through a config option.

Current syntax highlighting is just a bit disappointing in most cases I have used so far.

@Walnut356
Copy link
Contributor

Walnut356 commented May 15, 2024

If anyone wants an (admittedly jank) solution for the time being, .sublime-syntax files are essentially just a YAML file with regex instructions inside and aren't all that hard to modify in-place. Zola doesn't really care if it matches whatever sublimetext actually wants/expects, so you can define new regex matches and/or apply whatever custom scopes you want. If you use the highlight_theme = css config option, zola will automatically apply your scopes as css classes from the modified sublime-syntax file and then you can manually style those classes yourself. I'm not an expert at regex or css, nor have I ever used sublime text but I was able to get this working with a few hours effort.

Here's a .zip of the files i'm using for rust currently - the sublime-syntax file is based off of rust enhanced, styled to look like One Dark in VSCode. The modifications aren't pretty, but it does the job. Below is an example screenshot from my website:

image

image

@clarfonthey
Copy link
Contributor

clarfonthey commented Jul 27, 2024

I was thinking of maybe looking into this considering how a lot more programs are migrating to using tree-sitter, and at least from what I see, the existing sublime packages are slowly growing out of date due to the fact that very few people use Sublime any more.

Like, the thing that really pushed me to feel this way was the fact that the Java syntax just cannot recognise multiline comments formatted like this:

/*
 *
 */

Instead of this:

/*
*
*/

And it just, completely breaks the syntax entirely if you have comments formatted this way. On the other hand, tree-sitter feels a lot more robust, although I do have some other apprehensions about the way it's built.

I might poke around and see what a minimal version of tree-sitter highlighting might look like, without worrying about how the config looks just yet. (Presumably, the initial implementation would provide an option to use the existing syntect version or tree-sitter, since themes would have to be updated. But I won't worry about that for now.)

@Keats
Copy link
Collaborator Author

Keats commented Jul 27, 2024

I've already built the minimal version with tree-sitter, it's easy. The issues are on tree-sitter side:

  1. slow initialization: it can take more time to just init the syntaxes in tree-sitter than zola can render a big site with syntax highlighting with syntect. The tree-sitter team is planning to address that but it's not coming soon afaik
  2. huge parsers: some parsers can be 80MB... to replicate the current supported languages the zola binary size will easily add 100MB

Honestly I wish someone did a port of https://github.com/microsoft/vscode-textmate to Rust just to tap into the VSCode ecosystem. It's not ideal but there are still going to be more TextMate grammars than tree-sitter for the foreseeable future.

@phisch
Copy link

phisch commented Jul 27, 2024

I've already built the minimal version with tree-sitter, it's easy.

Why not implement it, and let the user optional pick that one? Speed might not be an issue for some people, and it's gonna get better in the future.

@clarfonthey
Copy link
Contributor

I mean, it seems pretty clear why to not implement it for now; 100MB extra binary size for a program meant to be in one binary is a hard sell.

@phisch
Copy link

phisch commented Jul 27, 2024

Fair, though I personally wouldn't mind that at all.

@Keats
Copy link
Collaborator Author

Keats commented Jul 28, 2024

Look at tree-sitter/tree-sitter#1799, just the SQL parser got to 89MB.
It's not really an issue for an editor but for a CLI tool like Zola it is.
I think a textmate parser is probably the best for Zola, considering it can tap into the whole VSCode ecosystem and they are not going to switch to tree-sitter anytime soon from what I've seen.

@everdrone
Copy link

Fair, though I personally wouldn't mind that at all.

Me neither.
Maybe cumbersome but it could become a build flag to conditionally use tree-sitter instead of the current highlighter.
The amount of grammars and the quality of the output is impressive, personally.

@Caellian
Copy link

Had an eye out for this because I'm using tree-sitter for one of my projects (also SSG highlighting).

Adding rules to a textmate grammar is much simpler than tree-sitter in my experience - only modified tree-sitter ones, didn't touch textmate, but it's just regex, tree-sitter has a unique grammar generator. Tree-sitter actually requires proper syntax support in a lot of cases or you get incorrect highlighting, but I guess TextMate is even more limited in that regard though it's easier to write something that kinda-works. That being said, the produced result is also a lot better because all tokens are semantically annotated.

In my case, using tree-sitter means that I can do stuff like recognizing declared variables (or generics/variable types) from code in rest of the blog post MD and highlight them with the appropriate color (single backtick ones).

I started writing a very complicated thing that clones repos listed in a JSON file, patches them, compiles them using system CC and then runs those parsers. I can't recommend this, it's a lot of pain, requires a lot of data annotation and is likely too slow for large blogs - in my case I don't mind because it's my personal blog that I rarely update.
However, this leads me to a possible solution to size problems - have the user list languages they'll use, or collect them on first run/change, and then download only parsers the blog will be using to some cache location.

If you go tree-sitter route, you will very likely have to maintain your own clones of grammars (and build them using CI) because they vary wildly in how they name tokens (some don't mark numbers, some use different names for same things, etc.), but that's very easy to replicate for many grammars. You could possibly store them all in a single repo. This probably applies to textmate to some extent, but is less apparent due to VSCode using those.

@Keats
Copy link
Collaborator Author

Keats commented Sep 24, 2024

If you go tree-sitter route, you will very likely have to maintain your own clones of grammars (and build them using CI) because they vary wildly in how they name tokens (some don't mark numbers, some use different names for same things, etc.), but that's very easy to replicate for many grammars. You could possibly store them all in a single repo. This probably applies to textmate to some extent, but is less apparent due to VSCode using those.

I would probably just use Helix repo as a source of truth for the grammar and the queries.

It is much more complex than textmate though and I'm still debating whether to write a textmate parser since it's so much easier for the end user and it's clearly enough for all the VSCode users

@Keavon
Copy link

Keavon commented Sep 24, 2024

The fact that Textmate doesn't require hundreds of megabytes and is used by the largest editor ecosystem (VS Code and Monaco) is a pretty compelling draw towards that approach, if I might just add my two cents. I'd personally love to have my site's syntax highlighting match the VS Code appearance as a starting point (perhaps with color theming applied, of course).

@Keats
Copy link
Collaborator Author

Keats commented Sep 24, 2024

That's a good point yes. If they don't get to reduce the size of some parsers dramatically, it's not going to happen in Zola even if they fix fast startups. We can't really make the zola binary 1GB...

@Keavon
Copy link

Keavon commented Sep 24, 2024

Yeah, that would be utterly bonkers and totally out of the question. I really have to wonder just what the heck they're doing to bloat the grammar files that immensely. That's a lot of information 🤔.

@jalil-salame
Copy link

tree-sitter parsers are parsers, not grammar files. They are dynamic libraries that are loaded by tree-sitter. I am assuming a lot of code is not shared and the parser generator/compiler does not generate size optimized code.

@Caellian
Copy link

I am assuming a lot of code is not shared and the parser generator/compiler does not generate size optimized code.

Main causes are:

  • Shema is directly translated into lookup tables afaict, so they're basically a binary representation of the tree structure defined in grammar/schema.
    • Lookup tables can't be optimized.
  • The binary stores names of all identifiers, parameters, etc.
  • Depending on the language spec complexity and edge cases (e.g. HTML's got a lot of those), all this can amount to a lot of duplication in the generated lookup tables.
  • Absolutely no shared code between parsers, not even for things like "strings" and numbers.
    • Though this wouldn't help much, this causes least bloat out of all previously listed causes.

None of that can be fixed without performance/quality regression. Correctness of a proper parser, and performance gains of LUTs inherently increase the binary size.

@jalil-salame
Copy link

jalil-salame commented Oct 6, 2024

I was looking at tree sitter stuff and found tree-sitter/tree-sitter#410, apparently the wasm versions of the tree sitter grammars are less "bloated" (and can hit ~30KiB per grammar? I need to verify that claim).

I don't know what the penalties would be if we include a wasm runtime and all the grammars into zola, but it does seem like a nice way to reduce the binary size c:

wasmtime and wasmi seem to be the smallest dependencies, but they do add about ~7MB. I don't know how much this would affect the performance of syntax highlighting...

@Caellian
Copy link

Caellian commented Oct 6, 2024

I looked into porting textmate to Rust here's the details:

  • onig-rs is barely maintained, last git update was over a year ago, oniguruma submodule hasn't been updated in over 2 years (sha a279b13b), updating the submodule to latest release however didn't fail to compile but it might have UB at this point.
    • c2rust handles oniguruma transpilation to rust really well, which would make porting it easier, though it uses a lot of macros and jumptables which produces very messy rust sources.
    • a lot of the code is just support for 10+ different encodings, only about half seem supported by existing rust crates.
  • Autogenerating TMLanguage structures and serde support from schema is straightforward with typical.
  • vscode-textmate has a "Copyright (C) Microsoft Corporation. All rights reserved." header on all relevant sources, so licensing isn't very clear because the README says the license is MIT.
    • It can be easily rewritten though because it's just glue code for TMLanguage and oniguruma.

But all that brings me to question "why?". For SSG, binary size and performance should come second after quality of the output. I'd understand the concern if zola were something every browser would have to load and run when they open a generated blog, but assuming incremental blog post compilation, having the build server hang for a second or two while it loads tree-sitter seems like a non-issue. Size can be traded for performance by downloading only the needed parsers which is something that will have to happen with TM grammars too, but it's also not as big of an issue with file caching. Free tier GH workers have 14GB SSD storage.

@clarfonthey
Copy link
Contributor

clarfonthey commented Oct 6, 2024

Thing is, tree-sitter doesn't have the quality of output. A lot of the grammars are obtuse and difficult to modify, and they contain basic errors.

For example, I filed this bug a while ago for the HTML syntax: tree-sitter/tree-sitter-html#75

An enclosed tag caused an entire document to fail to parse. And this is supposed to be designed for editing code, where this happens often!

And I tried looking into the grammar, but it genuinely made no sense where I'd begin to go about fixing this. I dunno, it feels like just a bunch of regexes for how to highlight things, while brittle, will at least do the job of highlighting, which is more than I can say about tree-sitter.

Like, to be clear, the HTML parser still doesn't work despite everything since they're failing to parse being listed in the spec directly. Imagine maintaining a syntax parser and not even reading the spec!

@Keats
Copy link
Collaborator Author

Keats commented Oct 6, 2024

onig-rs is barely maintained, last git update was over a year ago, oniguruma submodule hasn't been updated in over 2 years (sha a279b13b), updating the submodule to latest release however didn't fail to compile but it might have UB at this point.

Not a huge deal, zola currently use syntect which uses that same version of onig-rs

For SSG, binary size and performance should come second after quality of the output. I'd understand the concern if zola were something every browser would have to load and run when they open a generated blog, but assuming incremental blog post compilation, having the build server hang for a second or two while it loads tree-sitter seems like a non-issue. Size can be traded for performance by downloading only the needed parsers which is something that will have to happen with TM grammars too, but it's also not as big of an issue with file caching. Free tier GH workers have 14GB SSD storage.

You get slower startup in general. Size of TM grammars is a non issue, you can fit all programming languages grammars in one tree-sitter grammar like swift. Bumping the size of the binary by 80MB minimum (more like 200MB if we see some like newer version of the SQL one) for no visible changes except some slowdown is not super compelling.
Like @clarfonthey mentions, I also had some so-so experiences when playing with the highlighting a while back, not to mention some not really maintained anymore parsers.

I'm leaning towards textmate just to leverage the whole VSCode ecosystem which is kinda the editor that more and more people are using so it will have the bigger ecosystem of syntaxes.

@nrdxp
Copy link

nrdxp commented Dec 26, 2024

fwiw, this would be a compelling reason for me to switch to zola, which I am considering anyway.

It is a bit shocking, to me, that I can't seem to find a single static site generator that leverages actual AST for highighting over dumb regex. It might be a nice way to stand out.

I get the size issue as a developer facing concern, but is that really more important than presentation to the user? Maybe you can make it a configurable option, at least?

@Keats
Copy link
Collaborator Author

Keats commented Dec 26, 2024

I get the size issue as a developer facing concern, but is that really more important than presentation to the user? Maybe you can make it a configurable option, at least?

Clearly all the VSCode/Sublime/Jetbrains users are ok with the output they have while working. Personally I haven't seen a compelling difference between tree-sitter and textmate for code snippets.
Configuration is a no-go, they would support different languages/themes and even the people not using tree-sitter would have a much bigger binary for something they don't use.

@apiraino
Copy link
Contributor

I get the size issue as a developer facing concern, but is that really more important than presentation to the user? Maybe you can make it a configurable option, at least?

I think the size issue is not just a developer facing concern. Zola is also used in CI jobs to compile websites to HTML. A substancial increase in size of the package for a feature CI jobs don't use might not be great.

@Keavon
Copy link

Keavon commented Dec 26, 2024

I think we should focus the scope of this problem-solving by definitively writing off the idea of Tree-sitter due to its size and complexity that make it a total no-go for Zola's use case as a small, lightweight static site generator. Instead, it would be productive to focus on strategies for solving the language barriers that prevent Zola from currently using VSCode Textmate grammar. What's the status of its ecosystem of libraries, and what would need to be ported to Rust? Could that task be clearly spelled out and presented in a way that's approachable for community contributors to take a stab at?

@Jieiku
Copy link
Contributor

Jieiku commented Dec 26, 2024

If that is the idea, then it might be a good idea to rename this issue to something else, to avoid confusion.

Either that or close this issue and open a new one: Investigate VSCode Textmate to replace syntect, or something similar.

@Keavon
Copy link

Keavon commented Dec 26, 2024

I'd be in favor of closing and opening a new focused issue, since there is a lot of off-topic discussion here debating (and often rehashing) the merits of the choices, rather than being focused on what's relevant to someone interested in contributing a port. But that's just my suggestion, it's up to @Keats.

@Keats
Copy link
Collaborator Author

Keats commented Jan 2, 2025

Closing in favour of #2758

@Keats Keats closed this as completed Jan 2, 2025
@Keats Keats unpinned this issue Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests