Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser size testing (arrows) #144

Open
ChrHorn opened this issue Jul 30, 2024 · 4 comments
Open

Parser size testing (arrows) #144

ChrHorn opened this issue Jul 30, 2024 · 4 comments

Comments

@ChrHorn
Copy link
Contributor

ChrHorn commented Jul 30, 2024

I was doing some testing to bring down the parser size. Tried deleting some of the operators and noticed there is a significant difference when changing the arrow operators.

Case 1 (master, f1baa5f)

  arrow: `
    <-- --> <-->
    ← → ↔ ↚ ↛ ↞ ↠ ↢ ↣ ↦ ↤ ↮ ⇎ ⇍ ⇏ ⇐ ⇒ ⇔ ⇴ ⇶ ⇷ ⇸ ⇹ ⇺ ⇻ ⇼ ⇽ ⇾ ⇿ ⟵ ⟶ ⟷ ⟹ ⟺ ⟻ ⟼ ⟽ ⟾ ⟿
    ⤀ ⤁ ⤂ ⤃ ⤄ ⤅ ⤆ ⤇ ⤌ ⤍ ⤎ ⤏ ⤐ ⤑ ⤔ ⤕ ⤖ ⤗ ⤘ ⤝ ⤞ ⤟ ⤠ ⥄ ⥅ ⥆ ⥇ ⥈ ⥊ ⥋ ⥎ ⥐ ⥒ ⥓ ⥖ ⥗ ⥚ ⥛ ⥞
    ⥟ ⥢ ⥤ ⥦ ⥧ ⥨ ⥩ ⥪ ⥫ ⥬ ⥭ ⥰ ⧴ ⬱ ⬰ ⬲ ⬳ ⬴ ⬵ ⬶ ⬷ ⬸ ⬹ ⬺ ⬻ ⬼ ⬽ ⬾ ⬿ ⭀ ⭁ ⭂ ⭃ ⥷ ⭄ ⥺ ⭇ ⭈ ⭉
    ⭊ ⭋ ⭌ ← → ⇜ ⇝ ↜ ↝ ↩ ↪ ↫ ↬ ↼ ↽ ⇀ ⇁ ⇄ ⇆ ⇇ ⇉ ⇋ ⇌ ⇚ ⇛ ⇠ ⇢ ↷ ↶ ↺ ↻
  `,
❯ du -sh src/parser.c
49M     src/parser.c

❯ cat src/parser.c | rg "#define.*STATE"
#define STATE_COUNT 19881
#define LARGE_STATE_COUNT 9618

Case 2 (https://github.com/ChrHorn/tree-sitter-julia/commit/e66d1bf1a73e4e42e86a70830e0d02c2016cc92d)

Deleted most of the arrow operators.

  arrow: `
   <-- --> <-->
   ← → ↔
  `,

No visible change in states and parser size.

❯ du -sh src/parser.c
49M     src/parser.c

❯ cat src/parser.c | rg "#define.*STATE"
#define STATE_COUNT 19881
#define LARGE_STATE_COUNT 9618

Case 3 (https://github.com/ChrHorn/tree-sitter-julia/commit/e64b8fcfd7fcc78fdfeacd54b145ab265367799f)

Notice the only difference to Case 2 is the one deleted arrow operator.

  arrow: `
    <-- --> <-->
    ← →
  `,

Leads to a pretty significant reduction in states and parser size.

❯ du -sh src/parser.c
32M     src/parser.c

❯ cat src/parser.c | rg "#define.*STATE"
#define STATE_COUNT 12760
#define LARGE_STATE_COUNT 5869

Not really sure what's going on. I don' think it's Unicode, for example

  arrow: `
    ⥟ ⥢ ⥤ ⥦ ⥧ ⥨ ⥩ ⥪ ⥫ ⥬ ⥭ ⥰ ⧴ ⬱ ⬰ ⬲ ⬳ ⬴ ⬵ ⬶ ⬷ ⬸ ⬹ ⬺ ⬻ ⬼ ⬽ ⬾ ⬿ ⭀ ⭁ ⭂ ⭃ ⥷ ⭄ ⥺ ⭇ ⭈ ⭉
  `,

also results in a smaller parser. I also only noticed this behavior when changing the arrow operators. The change is always binary (either smaller, or current larger parser size), nothing in between.

@savq are you able to reproduce this on your end, any idea?

@savq
Copy link
Collaborator

savq commented Aug 1, 2024

Yep. I was able to reproduce it.

The unicode operators affect the parser size in weird ways. That's part of the reason the ASCII operators are listed first (to comment out the unicode operators easily), but to be honest I never looked to much into it.

After some of the discussions regarding binary size, I've been thinking that maybe we could generate two parsers (like typescript/tsx in tree-sitter-typescript. A small one for editors, and a bigger one for other tools. The small one could omit most unicode operators, implicit multiplications, etc.

@ChrHorn
Copy link
Contributor Author

ChrHorn commented Aug 5, 2024

Did some further testing.

const start = `[_\\p{XID_Start}${validMathSymbols}\\p{Emoji}&&[^0-9#*]]`;

Removing the \\p{Emoji} here also fixes the issue and results in a smaller parser. This would probably be a less intrusive change instead of deleting a whole bunch of operators.

Could also think about using a custom scanner for the identifiers. The Base Julia definition is already in C, so it could be a relatively simple copy-paste.

https://github.com/JuliaLang/julia/blob/996351f5f6651d1508aef3c35c7d37eb22a0fb1e/src/flisp/julia_extensions.c#L127

@savq
Copy link
Collaborator

savq commented Aug 5, 2024

Could also think about using a custom scanner for the identifiers. The Base Julia definition is already in C, so it could be a relatively simple copy-paste.

The external scanner is an option. We do exactly that in JuliaPluto/lezer-julia.

To use Julia's C code directly we'd have to link JuliaStrings/utf8proc, or find another way to check unicode categories. tree-sitter used to have utf8proc as a dependency, but I've got no idea why they dropped it tho.

@clason
Copy link

clason commented Aug 9, 2024

tree-sitter used to have utf8proc as a dependency, but I've got no idea why they dropped it tho.

Wasm parsers, which is also why pulling that in for an external scanner is a no-go. (It's still a dependency for the lib.)

Tree-sitter (and its base tooling) seems to have difficulties with certain unicode character classes, especially if you pick-and-choose from them rather than taking them as-is (Perl had a related issue). This is really something that should be solved upstream, so please open an issue about it there with the "binary change" here -- which might be helpful in tracking down a bug or possible optimization. (There have already been significant improvements in this area in recent version but obviously there's still much room for improvement.)

My recommendation here is to bisect the glyphs to identify low-hanging fruits (individually or in groups) and comment them out until the upstream issue is fixed. Parser size is also directly linked to load time, and you won't see many people use this parser if opening a Julia file freezes the editor for a second or half...

(Personally, since actually using Julia I have transitioned from "Unicode is cute!" to "Unicode is evil!" since it makes searching and replacing in editors so much harder, and over-use quickly makes code unreadable.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants