-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser size testing (arrows) #144
Comments
Yep. I was able to reproduce it. The unicode operators affect the parser size in weird ways. That's part of the reason the ASCII operators are listed first (to comment out the unicode operators easily), but to be honest I never looked to much into it. After some of the discussions regarding binary size, I've been thinking that maybe we could generate two parsers (like typescript/tsx in tree-sitter-typescript. A small one for editors, and a bigger one for other tools. The small one could omit most unicode operators, implicit multiplications, etc. |
Did some further testing. Line 970 in f1baa5f
Removing the \\p{Emoji} here also fixes the issue and results in a smaller parser. This would probably be a less intrusive change instead of deleting a whole bunch of operators.
Could also think about using a custom scanner for the identifiers. The Base Julia definition is already in C, so it could be a relatively simple copy-paste. |
The external scanner is an option. We do exactly that in JuliaPluto/lezer-julia. To use Julia's C code directly we'd have to link JuliaStrings/utf8proc, or find another way to check unicode categories. tree-sitter used to have utf8proc as a dependency, but I've got no idea why they dropped it tho. |
Wasm parsers, which is also why pulling that in for an external scanner is a no-go. (It's still a dependency for the lib.) Tree-sitter (and its base tooling) seems to have difficulties with certain unicode character classes, especially if you pick-and-choose from them rather than taking them as-is (Perl had a related issue). This is really something that should be solved upstream, so please open an issue about it there with the "binary change" here -- which might be helpful in tracking down a bug or possible optimization. (There have already been significant improvements in this area in recent version but obviously there's still much room for improvement.) My recommendation here is to bisect the glyphs to identify low-hanging fruits (individually or in groups) and comment them out until the upstream issue is fixed. Parser size is also directly linked to load time, and you won't see many people use this parser if opening a Julia file freezes the editor for a second or half... (Personally, since actually using Julia I have transitioned from "Unicode is cute!" to "Unicode is evil!" since it makes searching and replacing in editors so much harder, and over-use quickly makes code unreadable.) |
I was doing some testing to bring down the parser size. Tried deleting some of the operators and noticed there is a significant difference when changing the
arrow
operators.Case 1 (master, f1baa5f)
Case 2 (https://github.com/ChrHorn/tree-sitter-julia/commit/e66d1bf1a73e4e42e86a70830e0d02c2016cc92d)
Deleted most of the
arrow
operators.No visible change in states and parser size.
Case 3 (https://github.com/ChrHorn/tree-sitter-julia/commit/e64b8fcfd7fcc78fdfeacd54b145ab265367799f)
Notice the only difference to Case 2 is the one deleted
↔
arrow operator.Leads to a pretty significant reduction in states and parser size.
Not really sure what's going on. I don' think it's Unicode, for example
also results in a smaller parser. I also only noticed this behavior when changing the
arrow
operators. The change is always binary (either smaller, or current larger parser size), nothing in between.@savq are you able to reproduce this on your end, any idea?
The text was updated successfully, but these errors were encountered: