using Unicode Category #175

ekicyou · 2017-12-06T23:42:52Z

I want to do lexical analysis by the Unicode category.
Is there any way to get enough performance?

id          = id1 idn*
id1         = [\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}]
idn         = [\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\p{Cf}]

The text was updated successfully, but these errors were encountered:

dragostis · 2017-12-08T14:42:30Z

This is definitely something to be added post 1.0. Is there any crate that is able to do this on crates.io?

ekicyou · 2017-12-09T00:07:48Z

Is there any crate that is able to do this on crates.io?

nom+regex is s possible. re_find_static macro

If character matching using function call, it is possible to do the same.
Or, it would be appreciated if you had a peg grammar to convert to a regex call.

wirelyre · 2018-04-16T00:34:20Z

There's the unic-ucd-category crate (part of unic).

The regex crate uses a special tool to directly parse the Unicode Character Database into Rust code.

CAD97 · 2018-05-31T03:37:10Z

I think the best way to go about adding Unicode Property support is to add a new terminal - a regex literal, using the regex crate.

Note that this shouldn't break any complexity promises provided by the PEG grammar structure, as the Rust regex crate is non-backtracking.

So, for my identifier rule, instead of writing something like:

identifier = @{ identifier_head ~ identifier_tail* }
identifier_head = { 'a'..'z' | 'A'..'Z' }
identifier_tail = { identifier_head | '0'..'9' }

I could write the following:

identifier = { r"\p{XID_START}\p{XID_CONTINUE}*" }

When pest is turning this into parser code, it should prepend a ^ to the regex to anchor it to the start of the input. This will allow the regex crate to use its highly optimized byte-level recognizer.

Of course, this would not preclude using pest's composition tools directly, but in addition to the literal terminal adds another terminal option that gives a bit more power while still being O(1) to match that terminal.

If someone can give me pointers on where to go about adding this new terminal, I'm interested in using it (I think a PEG grammar is just what my project wants!), and I'd love to help with implementing it.

In the mean time, I could dump Unicode categories into pest rules if anyone's interested.

And speaking as a co-developer for unic, I can't really recommend unic-ucd-category for this use case. For this specific use case, the regex crate beats us out in every possible way currently. We are very much just trying for exposing every bit of information from ICU in a modular way rather than optimization or even stability currently. That plus the limited time either I or Behnam have been able to spend on the project recently... regex is a better fit for this use case.

CAD97 · 2018-05-31T06:18:00Z

Here's some very WIP progress (it doesn't even come close to working yet): master...CAD97:regex-terminal

The basic outline that I have drafted out:

Using a regex terminal adds a peer runtime dependency on regex and lazy_static for the parser
Collect regex literal terminals in the form of raw strings
- e.g. r#"\p{XID_START}\p{XID_CONTINUE}"#
- Supporting # demarcation is to be nice
Prepend ^ to regex literal to anchor to start of remaining input
When "optimizing", associate each unique regex string with a unique ID
On emitting the parser, create a lazy_static!ed array of Regex structs
On starting the parse, dereference that lazy static
During parse, index into that array to refer to each regex

I'm actually not exactly sure what steps of the parse I've missed that the derive is panicking, but that's pobably partially due to the fact that it's really late now and I'm going to bed as soon as I post this.

Derive panic

error: proc-macro derive panicked
  --> derive\tests\grammar.rs:15:10
   |
15 | #[derive(Parser)]
   |          ^^^^^^
   |
   = help: message: error parsing "grammar.pest"
           
             --> 26:12
              |
           26 | regex = { r"\p{XID_START}\p{XID_CONTINUE}*" }
              |            ^---
              |
              = expected `{`, `}`, `&`, `|`, `?`, `*`, or `+`

CAD97 · 2018-06-01T03:09:07Z

OK, I've got a somewhat-working version at #246! I've yet to try actually using it, but the added test rule under pest_derive suggests that it's working.

dragostis · 2018-06-01T08:38:22Z

@CAD97 Thank you opening up the discussion and this PR. 🎉Here are my philosophical 2 cents on the matter:

One of the goals with pest was to have a library with a small number of dependencies making it a pretty bare bones solution for parsing. We've also been constantly working on making compilation times smaller since the first alpha releases of pest. The regex crate is an amazing piece of engineering, but it's also a pretty big project with a considerable amount of overlap with pest's functionality. I personally consider a more focused approach to be the winning solution here. What do you think about about using regex-syntax as a dependency instead? (or maybe even a fork?) From there, we can make direct use of the Unicode ranges without adding needless complexity to the pest crate.

As a side node, one of the first designs with pest was to have a regex-based terminal parser, but I finally went with the current approach because:

regex knowledge should not be a prerequisite for pest users; pest should be particularly aimed at newcommers
having a unified, integrated syntax from terminal parsing all the way to high level rules instead of having to combine regex with PEG
as mentioned in your PR, pest's greediness would be confusing when mixed with regex

Really excited to hear your take and to finally add properties to pest!

CAD97 · 2018-06-01T17:50:22Z

Here's a short summary of some of the points I considered before prototyping out the regex terminals:

Reasons not to avoid regex/lazy_static:

Both crates are version 1.0 with strong promises of stability.
regex will only effect first-compile times -- subsequent compiles will reuse the compiled crate.
regex is a rust-lang repository and thus backed by the rust core team.
As implemented in [WIP] Add regex terminal #246, it's a strictly additive feature to the pest grammar. Everything that worked before still works, and I'd expect the standard style to remain the same, only dropping into regex terminals where required.
Though regex has a bias towards dense blobs, you can write good clear regex.
The public/peer dependencies can be hidden.
- Add a newtype to pest that just opaquely wraps regex::Regex, then
- Expose a macro pest_regexes! or similar that wraps up a set of regexes as in [WIP] Add regex terminal #246.

Reasons to reuse regex

More often than not, parsing starts off as a blob of regexps to extract information.
We get the performance of a highly tuned byte-based parser for these property-based terminals.
Code addition to pest is minimal (I was able to do (most of) it in less than a day with no prior pest internals knowledge!).

Reasons to avoid regex

Semantics of + and * repetitions differ between pest and regex, which could be confusing
Adds a second way of doing produced-terminal parsing ({r"regex"} instead of @{pest}).

On using regex-syntax instead

The crate is explicitly not stabilized, so this would add a 0.X dependency to chase updates on.

On a more targeted implementation

I could use ucd-generate to create char-based parsers.
Sharing rules between grammars? #197 would be a really interesting way to solve this:
- Turn binary properties into pest matchers, and
- Use these external pest files as fragments.
I'd be happy to maintain these fragments (in @pest-parser or externally)!
Having done the experimentation around adding regex, this might actually be the best path of action.
Just translating the property into pest, it gets the pest behavior and optimizations automagically!

CAD97 · 2018-06-01T19:03:51Z

I've just created a tool that translates ucd-generate generated tables for binary properties (this includes General_Category value inclusion) into hopefully correct pest code.

https://github.com/CAD97/pest-unicode/tree/master/pest

Note that this does assume that an optimization pass for transforming Expr::Range(c, c) into Expr::Str(c) exists for simplicity during generation. Given the amount of Expr::Choice involved, I'm not sure whether that will actually have a noticeable effect on the generated code or not, but it's definitely a simple pass that should be done to clean up many of the single-char ranges created.

I'd be perfectly happy to bring that in-tree and have that close #246. Then this issue could be closed when #197 is added -- maybe they could even be default-provided for using.

dragostis · 2018-06-01T20:56:02Z

@CAD97 I quite like this approach. I agree with the arguments about regex and I think that the door is very much open to using it in order to gain better performance without altering the current grammar.

#197 requires some non-trivial amount of redesign in order to implement, I'm afraid. It would probably come into fruition post 2.0 launch, however these ranges could be lazily generated as needed in the grammar that's using the property. I would imagine a Map from property names to pest rules defined in Rust code, similar to builtins. After #197, these could be migrated into actual pest grammars.

In terms of optimization, I think it's a good idea to make use of ucd-trie.

CAD97 · 2018-06-01T20:58:40Z

Alright, I'll take a look at adding some on-demand builtins for these properties then!

dragostis · 2018-06-01T21:00:34Z

@CAD97 Feel free to experiment with this. I'm far from an expert when it comes to Unicode properties and the optimizations possible, so I'm really excited for your proposal. 😄

CAD97 · 2018-06-01T22:10:17Z

I noticed something while doing this:

The set of pest_keywords is { ANY, DROP, EOI, PEEK, PEEK_ALL, POP, POP_ALL, PUSH, skip, SOI }. However, there are already further builtins beyond this: { DIGIT, NONZERO_DIGIT, BIN_DIGIT, OCT_DIGIT, HEX_DIGIT, ALPHA_LOWER } and so on.

The behavior of all builtins are that their parse functions only get generated if they're asked for. Thus, these builtins already behave how we'd want the Unicode Property builtins to behave -- provided when asked for, but silently shadowable.

The design that I've settled on for initially is to have #[doc(hidden)] pub fn(char) -> bool in pest for each property, and rely on dead code elimination to remove unused tables. This allows us to encapsulate the ucd-trie dependency by internalizing it.

So, if I just add all of these properties to the built-in list, then we get plumbing from existing infrastructure.

I'll send the proof of concept PR later today.

dragostis · 2018-06-01T22:53:17Z

@CAD97 This sounds great! Having particularly long and many quote! macro invocations might affect compile times dramatically, but this is very hard to hypothesize.

CAD97 · 2018-06-02T00:33:44Z

I just submitted #247 for the smaller change using ucd-generate-generated ucd-trie tables to power non-keyword builtins.

dragostis added Difficulty: Easy labels Dec 8, 2017

dragostis added this to the v1.1 milestone Jan 22, 2018

jstnlef added Difficulty: Medium and removed Difficulty: Easy labels May 28, 2018

CAD97 mentioned this issue Jun 1, 2018

[WIP] Add regex terminal #246

Closed

CAD97 mentioned this issue Jun 2, 2018

Add Unicode Property builtins #247

Merged

dragostis closed this as completed in #247 Jun 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using Unicode Category #175

using Unicode Category #175

ekicyou commented Dec 6, 2017

dragostis commented Dec 8, 2017 •

edited

Loading

ekicyou commented Dec 9, 2017

wirelyre commented Apr 16, 2018

CAD97 commented May 31, 2018

CAD97 commented May 31, 2018

CAD97 commented Jun 1, 2018

dragostis commented Jun 1, 2018

CAD97 commented Jun 1, 2018

CAD97 commented Jun 1, 2018 •

edited

Loading

dragostis commented Jun 1, 2018

CAD97 commented Jun 1, 2018

dragostis commented Jun 1, 2018 •

edited

Loading

CAD97 commented Jun 1, 2018 •

edited

Loading

dragostis commented Jun 1, 2018

CAD97 commented Jun 2, 2018

using Unicode Category #175

using Unicode Category #175

Comments

ekicyou commented Dec 6, 2017

dragostis commented Dec 8, 2017 • edited Loading

ekicyou commented Dec 9, 2017

wirelyre commented Apr 16, 2018

CAD97 commented May 31, 2018

CAD97 commented May 31, 2018

CAD97 commented Jun 1, 2018

dragostis commented Jun 1, 2018

CAD97 commented Jun 1, 2018

CAD97 commented Jun 1, 2018 • edited Loading

dragostis commented Jun 1, 2018

CAD97 commented Jun 1, 2018

dragostis commented Jun 1, 2018 • edited Loading

CAD97 commented Jun 1, 2018 • edited Loading

dragostis commented Jun 1, 2018

CAD97 commented Jun 2, 2018

dragostis commented Dec 8, 2017 •

edited

Loading

CAD97 commented Jun 1, 2018 •

edited

Loading

dragostis commented Jun 1, 2018 •

edited

Loading

CAD97 commented Jun 1, 2018 •

edited

Loading