-
-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using Unicode Category #175
Comments
This is definitely something to be added post |
nom+regex is s possible. re_find_static macro If character matching using function call, it is possible to do the same. |
There's the The |
I think the best way to go about adding Unicode Property support is to add a new terminal - a regex literal, using the regex crate. Note that this shouldn't break any complexity promises provided by the PEG grammar structure, as the Rust So, for my identifier rule, instead of writing something like:
I could write the following:
When pest is turning this into parser code, it should prepend a Of course, this would not preclude using pest's composition tools directly, but in addition to the literal terminal adds another terminal option that gives a bit more power while still being O(1) to match that terminal. If someone can give me pointers on where to go about adding this new terminal, I'm interested in using it (I think a PEG grammar is just what my project wants!), and I'd love to help with implementing it. In the mean time, I could dump Unicode categories into pest rules if anyone's interested. And speaking as a co-developer for unic, I can't really recommend |
Here's some very WIP progress (it doesn't even come close to working yet): master...CAD97:regex-terminal The basic outline that I have drafted out:
I'm actually not exactly sure what steps of the parse I've missed that the derive is panicking, but that's pobably partially due to the fact that it's really late now and I'm going to bed as soon as I post this. Derive panic
|
OK, I've got a somewhat-working version at #246! I've yet to try actually using it, but the added test rule under |
@CAD97 Thank you opening up the discussion and this PR. 🎉Here are my philosophical 2 cents on the matter: One of the goals with pest was to have a library with a small number of dependencies making it a pretty bare bones solution for parsing. We've also been constantly working on making compilation times smaller since the first alpha releases of pest. The regex crate is an amazing piece of engineering, but it's also a pretty big project with a considerable amount of overlap with pest's functionality. I personally consider a more focused approach to be the winning solution here. What do you think about about using As a side node, one of the first designs with pest was to have a regex-based terminal parser, but I finally went with the current approach because:
Really excited to hear your take and to finally add properties to pest! |
Here's a short summary of some of the points I considered before prototyping out the regex terminals: Reasons not to avoid regex/lazy_static:
Reasons to reuse regex
Reasons to avoid regex
On using
On a more targeted implementation
|
I've just created a tool that translates ucd-generate generated tables for binary properties (this includes General_Category value inclusion) into hopefully correct pest code. https://github.com/CAD97/pest-unicode/tree/master/pest Note that this does assume that an optimization pass for transforming I'd be perfectly happy to bring that in-tree and have that close #246. Then this issue could be closed when #197 is added -- maybe they could even be default-provided for using. |
@CAD97 I quite like this approach. I agree with the arguments about regex and I think that the door is very much open to using it in order to gain better performance without altering the current grammar. #197 requires some non-trivial amount of redesign in order to implement, I'm afraid. It would probably come into fruition post 2.0 launch, however these ranges could be lazily generated as needed in the grammar that's using the property. I would imagine a Map from property names to pest rules defined in Rust code, similar to builtins. After #197, these could be migrated into actual pest grammars. In terms of optimization, I think it's a good idea to make use of ucd-trie. |
Alright, I'll take a look at adding some on-demand builtins for these properties then! |
@CAD97 Feel free to experiment with this. I'm far from an expert when it comes to Unicode properties and the optimizations possible, so I'm really excited for your proposal. 😄 |
I noticed something while doing this: The set of pest_keywords is { ANY, DROP, EOI, PEEK, PEEK_ALL, POP, POP_ALL, PUSH, skip, SOI }. However, there are already further builtins beyond this: { DIGIT, NONZERO_DIGIT, BIN_DIGIT, OCT_DIGIT, HEX_DIGIT, ALPHA_LOWER } and so on. The behavior of all builtins are that their parse functions only get generated if they're asked for. Thus, these builtins already behave how we'd want the Unicode Property builtins to behave -- provided when asked for, but silently shadowable. The design that I've settled on for initially is to have So, if I just add all of these properties to the built-in list, then we get plumbing from existing infrastructure. I'll send the proof of concept PR later today. |
@CAD97 This sounds great! Having particularly long and many |
I just submitted #247 for the smaller change using |
I want to do lexical analysis by the Unicode category.
Is there any way to get enough performance?
The text was updated successfully, but these errors were encountered: