Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using Unicode Category #175

Closed
ekicyou opened this issue Dec 6, 2017 · 15 comments · Fixed by #247
Closed

using Unicode Category #175

ekicyou opened this issue Dec 6, 2017 · 15 comments · Fixed by #247
Milestone

Comments

@ekicyou
Copy link

ekicyou commented Dec 6, 2017

I want to do lexical analysis by the Unicode category.
Is there any way to get enough performance?

id          = id1 idn*
id1         = [\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}]
idn         = [\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\p{Cf}]
@dragostis
Copy link
Contributor

dragostis commented Dec 8, 2017

This is definitely something to be added post 1.0. Is there any crate that is able to do this on crates.io?

@ekicyou
Copy link
Author

ekicyou commented Dec 9, 2017

Is there any crate that is able to do this on crates.io?

nom+regex is s possible. re_find_static macro

If character matching using function call, it is possible to do the same.
Or, it would be appreciated if you had a peg grammar to convert to a regex call.

@dragostis dragostis added this to the v1.1 milestone Jan 22, 2018
@wirelyre
Copy link
Contributor

There's the unic-ucd-category crate (part of unic).

The regex crate uses a special tool to directly parse the Unicode Character Database into Rust code.

@CAD97
Copy link
Contributor

CAD97 commented May 31, 2018

I think the best way to go about adding Unicode Property support is to add a new terminal - a regex literal, using the regex crate.

Note that this shouldn't break any complexity promises provided by the PEG grammar structure, as the Rust regex crate is non-backtracking.

So, for my identifier rule, instead of writing something like:

identifier = @{ identifier_head ~ identifier_tail* }
identifier_head = { 'a'..'z' | 'A'..'Z' }
identifier_tail = { identifier_head | '0'..'9' }

I could write the following:

identifier = { r"\p{XID_START}\p{XID_CONTINUE}*" }

When pest is turning this into parser code, it should prepend a ^ to the regex to anchor it to the start of the input. This will allow the regex crate to use its highly optimized byte-level recognizer.

Of course, this would not preclude using pest's composition tools directly, but in addition to the literal terminal adds another terminal option that gives a bit more power while still being O(1) to match that terminal.

If someone can give me pointers on where to go about adding this new terminal, I'm interested in using it (I think a PEG grammar is just what my project wants!), and I'd love to help with implementing it.

In the mean time, I could dump Unicode categories into pest rules if anyone's interested.


And speaking as a co-developer for unic, I can't really recommend unic-ucd-category for this use case. For this specific use case, the regex crate beats us out in every possible way currently. We are very much just trying for exposing every bit of information from ICU in a modular way rather than optimization or even stability currently. That plus the limited time either I or Behnam have been able to spend on the project recently... regex is a better fit for this use case.

@CAD97
Copy link
Contributor

CAD97 commented May 31, 2018

Here's some very WIP progress (it doesn't even come close to working yet): master...CAD97:regex-terminal

The basic outline that I have drafted out:

  • Using a regex terminal adds a peer runtime dependency on regex and lazy_static for the parser
  • Collect regex literal terminals in the form of raw strings
    • e.g. r#"\p{XID_START}\p{XID_CONTINUE}"#
    • Supporting # demarcation is to be nice
  • Prepend ^ to regex literal to anchor to start of remaining input
  • When "optimizing", associate each unique regex string with a unique ID
  • On emitting the parser, create a lazy_static!ed array of Regex structs
  • On starting the parse, dereference that lazy static
  • During parse, index into that array to refer to each regex

I'm actually not exactly sure what steps of the parse I've missed that the derive is panicking, but that's pobably partially due to the fact that it's really late now and I'm going to bed as soon as I post this.

Derive panic
error: proc-macro derive panicked
  --> derive\tests\grammar.rs:15:10
   |
15 | #[derive(Parser)]
   |          ^^^^^^
   |
   = help: message: error parsing "grammar.pest"
           
             --> 26:12
              |
           26 | regex = { r"\p{XID_START}\p{XID_CONTINUE}*" }
              |            ^---
              |
              = expected `{`, `}`, `&`, `|`, `?`, `*`, or `+`

@CAD97
Copy link
Contributor

CAD97 commented Jun 1, 2018

OK, I've got a somewhat-working version at #246! I've yet to try actually using it, but the added test rule under pest_derive suggests that it's working.

@dragostis
Copy link
Contributor

@CAD97 Thank you opening up the discussion and this PR. 🎉Here are my philosophical 2 cents on the matter:

One of the goals with pest was to have a library with a small number of dependencies making it a pretty bare bones solution for parsing. We've also been constantly working on making compilation times smaller since the first alpha releases of pest. The regex crate is an amazing piece of engineering, but it's also a pretty big project with a considerable amount of overlap with pest's functionality. I personally consider a more focused approach to be the winning solution here. What do you think about about using regex-syntax as a dependency instead? (or maybe even a fork?) From there, we can make direct use of the Unicode ranges without adding needless complexity to the pest crate.

As a side node, one of the first designs with pest was to have a regex-based terminal parser, but I finally went with the current approach because:

  • regex knowledge should not be a prerequisite for pest users; pest should be particularly aimed at newcommers
  • having a unified, integrated syntax from terminal parsing all the way to high level rules instead of having to combine regex with PEG
  • as mentioned in your PR, pest's greediness would be confusing when mixed with regex

Really excited to hear your take and to finally add properties to pest!

@CAD97
Copy link
Contributor

CAD97 commented Jun 1, 2018

Here's a short summary of some of the points I considered before prototyping out the regex terminals:

Reasons not to avoid regex/lazy_static:

  • Both crates are version 1.0 with strong promises of stability.
  • regex will only effect first-compile times -- subsequent compiles will reuse the compiled crate.
  • regex is a rust-lang repository and thus backed by the rust core team.
  • As implemented in [WIP] Add regex terminal #246, it's a strictly additive feature to the pest grammar. Everything that worked before still works, and I'd expect the standard style to remain the same, only dropping into regex terminals where required.
  • Though regex has a bias towards dense blobs, you can write good clear regex.
  • The public/peer dependencies can be hidden.
    • Add a newtype to pest that just opaquely wraps regex::Regex, then
    • Expose a macro pest_regexes! or similar that wraps up a set of regexes as in [WIP] Add regex terminal #246.

Reasons to reuse regex

  • More often than not, parsing starts off as a blob of regexps to extract information.
  • We get the performance of a highly tuned byte-based parser for these property-based terminals.
  • Code addition to pest is minimal (I was able to do (most of) it in less than a day with no prior pest internals knowledge!).

Reasons to avoid regex

  • Semantics of + and * repetitions differ between pest and regex, which could be confusing
  • Adds a second way of doing produced-terminal parsing ({r"regex"} instead of @{pest}).

On using regex-syntax instead

  • The crate is explicitly not stabilized, so this would add a 0.X dependency to chase updates on.

On a more targeted implementation

  • I could use ucd-generate to create char-based parsers.
  • Sharing rules between grammars? #197 would be a really interesting way to solve this:
    • Turn binary properties into pest matchers, and
    • Use these external pest files as fragments.
  • I'd be happy to maintain these fragments (in @pest-parser or externally)!
  • Having done the experimentation around adding regex, this might actually be the best path of action.
  • Just translating the property into pest, it gets the pest behavior and optimizations automagically!

@CAD97
Copy link
Contributor

CAD97 commented Jun 1, 2018

I've just created a tool that translates ucd-generate generated tables for binary properties (this includes General_Category value inclusion) into hopefully correct pest code.

https://github.com/CAD97/pest-unicode/tree/master/pest

Note that this does assume that an optimization pass for transforming Expr::Range(c, c) into Expr::Str(c) exists for simplicity during generation. Given the amount of Expr::Choice involved, I'm not sure whether that will actually have a noticeable effect on the generated code or not, but it's definitely a simple pass that should be done to clean up many of the single-char ranges created.

I'd be perfectly happy to bring that in-tree and have that close #246. Then this issue could be closed when #197 is added -- maybe they could even be default-provided for using.

@dragostis
Copy link
Contributor

@CAD97 I quite like this approach. I agree with the arguments about regex and I think that the door is very much open to using it in order to gain better performance without altering the current grammar.

#197 requires some non-trivial amount of redesign in order to implement, I'm afraid. It would probably come into fruition post 2.0 launch, however these ranges could be lazily generated as needed in the grammar that's using the property. I would imagine a Map from property names to pest rules defined in Rust code, similar to builtins. After #197, these could be migrated into actual pest grammars.

In terms of optimization, I think it's a good idea to make use of ucd-trie.

@CAD97
Copy link
Contributor

CAD97 commented Jun 1, 2018

Alright, I'll take a look at adding some on-demand builtins for these properties then!

@dragostis
Copy link
Contributor

dragostis commented Jun 1, 2018

@CAD97 Feel free to experiment with this. I'm far from an expert when it comes to Unicode properties and the optimizations possible, so I'm really excited for your proposal. 😄

@CAD97
Copy link
Contributor

CAD97 commented Jun 1, 2018

I noticed something while doing this:

The set of pest_keywords is { ANY, DROP, EOI, PEEK, PEEK_ALL, POP, POP_ALL, PUSH, skip, SOI }. However, there are already further builtins beyond this: { DIGIT, NONZERO_DIGIT, BIN_DIGIT, OCT_DIGIT, HEX_DIGIT, ALPHA_LOWER } and so on.

The behavior of all builtins are that their parse functions only get generated if they're asked for. Thus, these builtins already behave how we'd want the Unicode Property builtins to behave -- provided when asked for, but silently shadowable.

The design that I've settled on for initially is to have #[doc(hidden)] pub fn(char) -> bool in pest for each property, and rely on dead code elimination to remove unused tables. This allows us to encapsulate the ucd-trie dependency by internalizing it.

So, if I just add all of these properties to the built-in list, then we get plumbing from existing infrastructure.

I'll send the proof of concept PR later today.

@dragostis
Copy link
Contributor

@CAD97 This sounds great! Having particularly long and many quote! macro invocations might affect compile times dramatically, but this is very hard to hypothesize.

@CAD97
Copy link
Contributor

CAD97 commented Jun 2, 2018

I just submitted #247 for the smaller change using ucd-generate-generated ucd-trie tables to power non-keyword builtins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants