Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pest language evolution #333

Open
dragostis opened this issue Nov 6, 2018 · 30 comments
Open

pest language evolution #333

dragostis opened this issue Nov 6, 2018 · 30 comments
Milestone

Comments

@dragostis
Copy link
Contributor

dragostis commented Nov 6, 2018

Summary

This RFC hopes to address the concerns in #197, #261, #271, and #329 by laying the foundation of pest's evolution and transition.

Motivation

While pest grammars offer an expressive language for building grammars, they lack certain features we've become accustomed with in programming languages which weakens their effectives as expressive and reusable tools. With the growing popularity of the project, more and more discussion has been focused on improving the predictability of pest as a language and a number of needs have been put forth: trivia handling, reusability, expressiveness, and general consistency.

Trivia handling complexity

Probably the hardest concept to grasp when first learning the ropes is how trivia, i.e. whitespace and comments, are handled. pest has an automatic mechanism that simply permits trivia to live between expressions which is controlled by atomicity. Since atomic rules are cascading, it's not immediately obvious if two sequenced expressions a ~ b accept trivia—it wholly depends on whether or not the current rule inherits atomicity.

atomic                = @{ definitely_not_atomic }
not_atomic            =  { confusing }
definitely_not_atomic = !{ confusing }

confusing = { a ~ b }

The example above illustrates how confusing can accept trivia in some cases but not others.

Reusability of expressions and rules

While rules can be composed from one another, there is currently no means to parametrize them. Parametrization can be extremely useful in cases where some idioms are often reused, e.g. repeated, separated values. Currently, you need to repeat some form of e ~ ("," ~ e)* which is less readable than separated(e, ",").

Though less immediately useful, another addition would be to be able to use rules from different grammars.

Expressiveness

Improving expressiveness is somewhat of a continuously open question. In 2.0 we've added additional stack calls that help recognize indentation-sensitive languages, namely PEEK_ALL, POP_ALL, and DROP. This conservative design was adopted in order to better understand what exactly is needed in real-world examples.

However, legitimate need of more refined localization within the stack has been illustrated in #329. Being able to accurately slice the stack for every one of the PEEK, POP, DROP calls seems to be required going forward.

General consistency

With the introduction of built-in rules, capitalization has been selected as a form of differentiation from user-defined rules. Capitalized are also stack calls, start- and end-of-input calls, and unicode categories. The only way of differentiating between them is to simply know ahead of time what they do.

Guide-level explanation

Versioning

The pest language will be versioned according to the semver guide and grammar language versions will be optionally selected before parsing. This will ensure a smoother transition to 3.0, and beyond, it will be enabling users to opt-in to the newer version early on.

Modules

Akin to Rust's modules, a module can contain rules or other modules. This removes the need for capitalization of built-in rules. They can be part of separate modules.

/// Modules can be created by importing other grammars and are immediately public.
use "cool.pest";
use "this.pest" as that;

/// pest has its own sub-modules.
any     = { pest::any }
stack   = { pest::stack::peek }
unicode = { pest::unicode::binary::punctuation }

Parametrizable rules

Rules will have optional arguments. Their definition will be parametrizable with argument names, all of them being valid pest expressions.

/// Definition
separated(e, s) = _{ e ~ (s ~ e)* }

/// Use
comma_separated(e) = _{ separated(e, ",") }

Controlled trivia

The infix sequence operator ~ itself will be a user-defined rule:

~(lhs, rhs) = { lhs ~ " "* ~ rhs }

Without any ~ defined, ~, *, +, and {} operators will all run according to their definitions without accepting any trivia between expressions. When it is defined, the repetitions will make use of the sequence operator:

*(e) = { e? ~ e* }
+(e) = { e ~ e* }
/// ... etc.

In order to be able to have both trivia-accepting and non-trivia-accepting operators working together, separate non-trivia operators will be introduced, namely - for sequence and all repetitions preceded by it:

Operator Trivia Non-trivia
Sequence ~ -
Repeat zero or more times * -*
Repeat one or more times + -+
Repeat exactly n times {n} -{n}
Repeat minimum of n times {n..} -{n..}
Repeat maximum of n - 1 times {..n} -{..n}
Repeat maximum of n times {..=n} -{..=n}
Repeat between m and n - 1 times {m..n} -{m..n}
Repeat between m and n times {m..=n} -{m..=n}

Stack slicing

Stack slicing will work similarly to Rust slicing with the exception that ranges will accept negative end values, similarly to Python. Slicing will happen from bottom to top such that for a stack [a, b, c, d, e]:

  • [1] == a
  • [-1] == e
  • [1..4] == [b, c, d]
  • [1..-1] == [b, c, d]
  • [1..=-1] == [b, c, d, e]
  • [..-2] == [a, b, c]

As such, pest::stack::*, i.e. peek, pop, drop, can be optionally sliced or indexed, e.g. pest::stack::peek[..-1]. The indices will be constant with the exception of those relative to the top of the stack since the stack's size is variable.

Reference-level explanation

The grammar's version will be selected through the grammar attribute:

#[grammar = "grammar.pest", version = "3.0"]

pest_meta will handle both grammar language versions during the 2.* transition period, then migrate to 3.0. This will need to be enforced if we want to take advantage of the more concise grammars during optimization and generation.

Much of the rest of this RFC is straight-forward:

  1. add second grammar
  2. implement validation
  3. add module resolution to AST (in pest_meta and pest_generator)
  4. add rule parameters to AST (in pest_meta and pest_generator)

Drawbacks

Breaking compatibility so early could be dangerous, but we can offer help for people migrating to 3.0. If need be, we could also offer a pest fix tool that would be able to convert 2.0 to 3.0 grammars.

Some of the syntax introduced in the trivia handling might be a little heavy on the eye and we might want to fine tune it before it's set in stone.

@felix91gr
Copy link
Contributor

This sounds quite good, specially the part about modules. It would make things like ASCII_LOWER and WHITESPACE feel less magical c:

I probably need more time to give you a more complete opinion (as mentioned in gitter, this semester is getting quite hectic in assignments 😿) but overall I think the issues this RFC addresses make a lot of sense!

@CAD97
Copy link
Contributor

CAD97 commented Nov 6, 2018

Would it not make sense to declare the pest version in the pest file? That keeps the information closer (and I wouldn't have to duplicate the attribute for pest-ast).

Draft: pest files MAY start with a version_attrubute = "#" ~ "!" ~ "[" ~ "version" ~ "=" ~ semver_bound ~ "]" }. If not specified, it defaults to #[version = ^2.0].

The rest of the file is then parsed with the appropriate parser. (Or if only semantics change and the new grammar is a superset, one parser is used and the semantics are set.)

@Kroc
Copy link

Kroc commented Nov 6, 2018

Due to the way the trivia behaves in an unexpected and non-obvious way from looking at the grammer alone, I feel like you should use these operators instead:

Operator Trivia Non-trivia
Sequence ~ -
Repeat zero or more times ~* *
Repeat one or more times ~+ +
Repeat exactly n times ~{n} {n}
Repeat minimum of n times ~{n..} {n..}
Repeat maximum of n - 1 times ~{..n} {..n}
Repeat maximum of n times ~{..=n} {..=n}
Repeat between m and n - 1 times ~{m..n} {m..n}
Repeat between m and n times ~{m..=n} {m..=n}

These would communicate behaviour and intent more clearly as well as being a bit more 'obvious' on first look. I think most programmers would expect the non-trivia operation to be the default -- i.e. one thing follows another without surprise and the inclusion of the ~ in the operators makes it visually consistent that trivia is being processed.

Keeping the non-trivia operators plain / spare better matches standard regex behaviour which is where the majority of programmers are going to recognise these operators from.

@flying-sheep
Copy link
Contributor

i agree with @Kroc: ~+ can more intuitively mean “repeat with something inbetween” than -+ can mean “directly repeat things”

@Kroc
Copy link

Kroc commented Nov 6, 2018

Other than concern over operators, this spec is a skilful and elegant summary of Pest that aims to increase functionality in critical areas without causing a combinatorial explosion of complexity.

@CAD97
Copy link
Contributor

CAD97 commented Nov 7, 2018

@Kroc

I think most programmers would expect the non-trivia operation to be the default

As a counterpoint: "traditional" parsing that uses a lexer tends to drop trivia and ignore it in the first preprocessing step. e.g. one two three becomes Identifier("one"), Identifier("two"), Identifier("three"), not Identifier("one"), WS(" "), Identifier("two"), WS(" "), Identifier("three").

The whole idea of "trivia" in programming language grammars is that the trivia can be ignored; that it can appear anywhere and not interfere with anything. The only time I'd see a "normal" language using strict no-trivia sequences is in the definition of the trivia itself and in tokens. In my "modern parser tool" sketch, every repetition consumed trailing trivia. In the last version of syn that used nom instead of the proc_macro lexer, each "terminal" parser was expected to eat (and ignore) leading trivia.

I don't disagree that making it clear is a bad idea, just that no-trivia is an expected default mode of operation.

@CAD97
Copy link
Contributor

CAD97 commented Nov 7, 2018

Minor note:

Currently, repetitions match trailing trivia.

This doesn't cause much issue in practice, but in a world where strict sequencing is easier to do we should probably try to avoid that footgun and make a~+ be a - (TRIVIA? - a)-* and not (a - TRIVIA?)-+.

@dragostis
Copy link
Contributor Author

dragostis commented Nov 7, 2018

I think the right choice here would be to simply have the uglier version where it will be used least. To me, that seems to be the non-trivia case, but I think it's best to actually take a look over multiple real-world use cases in order to figure this one out. I guess @CAD97 already has this in nafi.

Also, @CAD97, that seems to be a bug! It's caused by the fact that e+ is rewritten as e ~ e* which are not actually equivalent when trivia is present. Really looking forward to the formal optimizer!

@dragostis
Copy link
Contributor Author

I'm still not completely sure whether in-file version definition is entirely possible in the long run. I'm approaching a second RFC for an intermediate-representation that the optimizer and generator would use that is long over-due. The optimizer should have a much cleaner approach that is smarter and less error-prone.

@vi
Copy link

vi commented Nov 14, 2018

The whole idea of "trivia" in programming language grammars is that the trivia can be ignored.

When compiling or interpreting. But not when e.g. auto-refactoring.

@grncdr
Copy link

grncdr commented Dec 26, 2018

I'd like to propose a third sequencing operator that requires trivia between two other rules. To keep things symmetric (and because it's already reserved) I like _. The use case is parsing SQL-like languages, which are usually parsed by tokenizing first. I currently define the following rule in the grammar for a language I'm working on:

__ = _{ (" " | "\t" | "\n")+ }

And have quite a few rules that look something like:

some_statement = ${ "do" ~ __ ~ identifier ~ __ ~ "with" ~ __ ~ argument_list ~ ";" }

It would be great to just write { "do" _ identifier _ "with" _ argument_list ";" } instead. (See also #337, it seems I'm not the only person trying to do this).

@grncdr
Copy link

grncdr commented Dec 26, 2018

Also, I tend to agree with @Kroc and @flying-sheep regarding repetition operators. ~ has more implicit behaviour than - and should have louder notation for repetitions, even if it's relatively common.

It's also worth considering how parameterized rules will shift the trade-offs. For example:

// this definition of `separated` does not allow trivia by default
separated(e, sep) = { e - (sep - e)* }
// we can choose to allow trivia at the usage site
statement_list = { separated((TRIVIA? - statement), (TRIVIA? - ";" - TRIVIA?)) }
// or abstract that pattern
separated_allowing_trivia(e, sep) = { separated((TRIVIA? - e), (TRIVIA? - sep - TRIVIA?)) }
statement_list = { separated_allowing_trivia(statement, ";") }

Generally speaking, the expressive power of parameterized rules makes any implicit behaviour of builtin operators much less of a clear win, as it's trivial to explicitly define reusable precise parsing behaviour.

@grncdr
Copy link

grncdr commented Dec 29, 2018

I missed the "Controlled trivia" section of the RFC when I wrote the above comments. It seems like I might be able to do what I want by redefining ~ or another infix rule?. I'm confused about one thing though, what does ~ mean on the right-hand-side of this example?

~(lhs, rhs) = { lhs ~ " "* ~ rhs }

counter proposal

It seems to me that the grammar language could be simplified a lot (without losing much in the way of power) by completely removing automatic trivia and it's associated operators and adopting two straightforward rules.

  1. Sequencing is expressed by adjacency. { a b } means a followed by b with nothing in between. (this also has the benefit of looking like almost every other PEG-based parser generator).

Given the above, users can trivially define the current behaviour of ~:

~ = { (WHITESPACE | COMMENT)* }

To get shorthand operators for repetition, we add a second rule:

  1. Rule names consisting solely of punctuation ([^a-zA-Z_]) do not require whitespace to separate them from adjacent expressions. When there is no whitespace, the adjacent expressions are considered parenthesized by repetition operators. In other words, { a~b } == { a ~ b }, { a~* } == { (a ~)* }, and { a~b* } == { (a ~ b)* }.

I imagine that many grammars are relying heavily on the trivia handling of * and +, so this provides a "search and replace" upgrade path.

benefits

  1. Much less magic, everything about parsing for a particular rule is explicitly defined in the rule.
  2. ${ ... } and !{ ... } can go away. There's possibly still a use for @ to force an expression combined into a single pair, but I expect it would be used much less than it currently is.
  3. The grouping of { a b | c | d e } is arguably more visually clear than { a ~ b | c | d ~ e }.
  4. None of the above actually requires parameterized rules yet.

trade-offs

  1. It's a breaking change for close to 100% of existing Pest grammars. It's trivial to update these grammars, but still potentially annoying.
  2. The grouping of { a b | c | d e } is arguably more visually ambiguous than { a ~ b | c | d ~ e }. 😉

@dragostis
Copy link
Contributor Author

@grncdr, this is a really good proposal. I think it's fairly safe to say that the transition will not be very hard to do with a small tool. The only issue I have with this approach is that a~* would expand to a repetition with trailing whitespace. One could make an exception for ~ and not have it accept trailing, but that would probably be too confusing.

@CAD97
Copy link
Contributor

CAD97 commented Dec 29, 2018

@grncdr The biggest problem with "simple adjacency" a b meaning strict sequencing is that we use ( <expr> ) to group things, and we do want to have parameterizeable rules with (). We could use {} to group instead, a la gll or lalrpop, but this runs into human ambiguity problems with a{2} style specified repetitions.

I actually do like the purity of having "simple adjacency" be "strict sequencing" and ~ is "just another production". I doubt "a~* being {a~}* instead of separated(a, ~) will be a problem in most cases, either; as I keep saying, trivia should be what's allowed basically everywhere, the kind of things that would be skipped over if you were pre-tokenizing. (Case in point: [a+ is currently a ~ a* and not a - ("" ~ a)+ due to a bug in translation/optimization.)

Also, the definition ~(lhs, rhs) = { lhs ~ TRIVIA* ~ rhs } would be forbidden as left-recursive. The proper way to define it would be ~(lhs, rhs) = { lhs - TRIVIA-* - rhs }.

@dragostis
Copy link
Contributor Author

We could have the following approach:

  1. ~ is allowed to be a production like any other.
  2. Repetitions can have an optional symbol production prepended which would mean separation, i.e. a~*
  3. This could then be optionally extended for more cases. E.g. a#* where # = _{ "," }.

@grncdr
Copy link

grncdr commented Dec 30, 2018

problems with adjacency

The biggest problem with "simple adjacency" a b meaning strict sequencing is that we use ( ) to group things, and we do want to have parameterizeable rules with ().

I agree that a(b | c) meaning "apply a to b c" and a (b | c) meaning "a followed by b or c" is probably too subtle. Maybe use [...] for parameterized rules instead? (see also the note about higher-order rules)

should repetition include trailing productions?

There's some difference of opinion here:

@CAD97: I doubt a~* being {a~}* instead of separated(a, ~) will be a problem in most cases

@dragostis: Repetitions can have an optional symbol production prepended which would mean separation, i.e. a~*

It boils down to a subjective choice, repetitions allowing trailing productions are common (e.g. trailing whitespace, comments, and even commas in many languages) but so is strict separated behaviour. With parameterized rules its easy to express both so I don't have a strong opinion on what one should get the short-hand notation.

Parenthesis ambiguity and higher-order rules

I separated this out because it's quite a bit more "hypothetical" than my other comments

Regarding the () ambiguity, the subtlety could probably be largely mitigated by making it an error to refer to a parameterized rule without supplying parameters. Consider the following grammar.

~ = { " "* }
p(e) = { "(" ~ e ~ ")" }
ident = { ("a".."z")+ }
number = { ("0".."9")+ }
call = { ident ~ p (number | ident) }

If Pest generated an error like this:

call = { ident ~ p (number | ident) }
------------------^

Parameters to rule `p` were not supplied. Perhaps you need to remove a space?

I think most people would not be confused by the distinction between p(number | ident) and p (number | ident).

The only thing that prevents me from proposing this more strongly is the possibility of passing unapplied parameterized rules as a parameters. E.g. should the following grammar should be legal?

~ = { " "* }
p(e) = { "(" ~ e ~ ")" }
angle(e) = { "<" ~ e ~ ">" }
ap(id, delim, e) = { id ~ delim(e) }
ident = { ("a".."z")+ }
number = { ("0".."9")+ }
call = { ap(ident, p, number | ident) } // matches foo(3) and foo(bar)
type_ap = { ap(ident, angle, ident) }   // matches foo<bar> but not foo<3>

If not, then overloading the meaning of parenthesis should be fine.

If you do want to allow higher-order parameterized rules, then an exception like "a parameterized rule can be referenced without applying it only when passed as a parameter" would probably work, but I don't (yet?) see much purpose for this degree of indirection in grammars.

@wiogit
Copy link

wiogit commented Jan 18, 2019

Questions

I'm familiar with other peg parsers, so there are some things about pest that make me scratch my head. It would help if I understood the motivation of some decisions.

The first would be that pest doesn't let you specify code that executes and specifies the return value for each production variant of a rule. Instead, it seems pest just creates its own AST and lets you traverse through that to do your own execution. I'm curious about the advantages of doing things this way.

The second would be that pest requires ~ for concatenation, but it also has ~ consume trivia. To get around cases where you don't want to allow whitespace, it has atomic @{ ... } rules, but this also makes the rule's AST shallow. To get around that it has compound atomic ${ ... } rules. To me, this seems like adding weird syntax to overcome a design oversight. What's the reason for this?

The third would be that pest needs to have { ... } wrapping around expansions. We know all expansions begin with = so all they really need is a termination specifier. Alternatively, you don't need an = since { should be sufficient. I'm wondering if I'm missing something here. Is this due to how nested rules work?

Suggestions

I think pest should have versioning at the top of the .pest file. It can default to version 1.0 if none is specified. This would allow future versions of pest to check if it can handle the specified version, and specify which crate version to use otherwise. It could also specify a tool to convert the grammar to a later version.

I like the proposal @grncdr put forth of using simple adjacency for concatenation and having ~ being a default trivia rule that can be overridden. I think it eliminates the need for ${ ... } compound syntax. The controlled trivia feature, which introduces a sort of operator overloading, seems too confusing.

The _{ ... } silent and @{ ... } atomic rules seems necessary because one doesn't get to specify the rule's return value via code execution. I'm going to assume inline code execution is a no go. I'm thinking there could be other ways this could be accomplished. For example, you could allow syntax like rule = my_transform{ ... } where my_transform is user defined function that takes a pest AST and returns an optional pest AST. Then perhaps _ and @ are just built in functions.

For parametrized rules, I have some questions.
Would nested rules work with them?

sum = { list(term, "+") }
term = { ASCII_DIGIT+ }

list(item, delim) = ${ item ~ (sep ~ item)* }
    sep = _{ delim }

Would it allow quantifiers as arguments?

array = { open ~ list(elem, ",", *)  ~ close }
    open = _{ "[" }
    close = _{ "]" }
    elem = { ASCII_ALPHANUMERIC+ }

list(item, delim, quantifier) = ${ item ~ (sep ~ item) quantifier }
    sep = _{ delim }

If so, this would justify the need for a concatenation operator, or alternatively a quantifier operator would be needed. For example doing so with [ ... ] would look like:

list(item, delim, quantifier) = ${ item (sep item)[ quantifier ] }
    sep = _{ delim }

@CAD97
Copy link
Contributor

CAD97 commented Jan 18, 2019

Disclaimer: I joined the party pretty recently, only just before the release of 2.0. I don't speak about the history from a authoritative source, but more from what I've picked up along the way.

The first would be that pest doesn't let you specify code that executes and specifies the return value for each production variant of a rule. Instead, it seems pest just creates its own AST and lets you traverse through that to do your own execution. I'm curious about the advantages of doing things this way.

The main benefit is purity. Pest's area of focus is as a parsing library, not as an AST. If you want to use something like rowan to handle your syntax tree, then you'll need to adapt your parsing tools' output anyway. When I was using ANTLR5, which allows arbitrary actions to be attached to rules, I personally still found myself mapping the generated parse tree to a custom abstract syntax tree.

Another benefit of that purity is that things like the web VM are possible. ANTLR has a interpreter mode, but it chokes on any real grammar because those contain actions rather than everything being sematic in the grammar. Because pest doesn't allow actions, the interpreter works just as perfectly as the compiled version. Pest's ideal is also that the grammar serves as a formal grammar for your language, rather than just a series of instructions on how to generate a parser.

Pest's job is to generate the parse tree, and full control over the structure of the parse tree allows it to do some clever things to reduce the required nesting (as in, the pest tree is really just a single vector behind the tree abstraction). It's the same reason that the unified rule typing exists: working with a singular type is more convenient than a hundred or more. In anything much bigger than an example project, you'll probably want any parse tree to convert into a different tree structure so that the rest of your program can ignore information that has to be stored in order to parse.

pest-ast has proven the feasibility of building a syntax tree layer on top of pest's parse tree. That said, we are definitely interested in potentially making 3.0's user-exposed API a lot more typed, since that's a common pain point. It's mostly just blocked on @dragostis's work on the optimizing parsing engine that it's planned to be built on.

The second would be that pest requires ~ for concatenation, but it also has ~ consume trivia. To get around cases where you don't want to allow whitespace, it has atomic @{ ... } rules, but this also makes the rule's AST shallow. To get around that it has compound atomic ${ ... } rules. To me, this seems like adding weird syntax to overcome a design oversight. What's the reason for this?

IIUC, the initial reason for using ~ for concatenation was a limitation of macro_rules!, which the first version of pest was built with. The reason for ~ consuming trivia by default is that trivia should be trivial. Traditional lexing approaches don't even pass trivia along to the parser, but just strip that information. That's what trivia is designed for: stuff that doesn't really need to matter.

That said, you still have to consider the atomic rules. The keyword for cannot be followed by another identifier letter, otherwise by the longest-match rule encoded in traditional lexers, it'd be an identifier. So the rule becomes @{ "for" ~ !id_cont }. @ is called "atomic" because that's what it's for: atoms of the grammar, the smallest pieces. $ and ! were added later to enable higher-level nesting productions to opt-out (and back in) of implicit trivia, but they're still supposed to be niche uses.

I realize real languages are messy. I personally agree that the conflation of atomic rules with removing implicit trivia is problematic; that's the reason for the proposed trivia handling being ~/- and @/_ being used solely for control over emitted parse nodes.

The third would be that pest needs to have { ... } wrapping around expansions. We know all expansions begin with = so all they really need is a termination specifier. Alternatively, you don't need an = since { should be sufficient. I'm wondering if I'm missing something here. Is this due to how nested rules work?

It's just a stylistic choice, though I believe it's one again rooted in pest originally being implemented by macro_rules. Technically we wouldn't even need a termination, even without adding a new lookahead case to the meta grammar, as id ~ id isn't a valid expression anyway. But it's purely a stylistic choice.

Maybe you'd like the draft I have of a syn-compatible parser generator?

Grammar: {
    lexers: {&"use" LexerImport}*
    items: {syn::Attribute NamedItem}*
}

For parametrized rules, I have some questions.

Would nested rules work with them?

I'm not sure what you mean about nested rules. You seem to be implying that indenting rules somehow "scope" them to the prior rule? No, they're just normal rules.

Would it allow quantifiers as arguments?

No, arguments must themselves be expressions.

@wiogit
Copy link

wiogit commented Jan 19, 2019

You seem to be implying that indenting rules somehow "scope" them to the prior rule? No, they're just normal rules.

Yeah, that's what I was thinking. I took a look at the sample in the introduction and was not sure what the indentation was doing, so I got carried away. There have been a few times where I've wanted rule scoping in the past.

@dragostis
Copy link
Contributor Author

The third would be that pest needs to have { ... } wrapping around expansions. We know all expansions begin with = so all they really need is a termination specifier. Alternatively, you don't need an = since { should be sufficient. I'm wondering if I'm missing something here. Is this due to how nested rules work?

It's just a stylistic choice, though I believe it's one again rooted in pest originally being implemented by macro_rules. Technically we wouldn't even need a termination, even without adding a new lookahead case to the meta grammar, as id ~ id isn't a valid expression anyway. But it's purely a stylistic choice.

Apart from a few other quality of life additions like raw strings, the 3.0 grammar will probably make braces optional. They were added for historical reasons, as @CAD97 mentioned, and kept for the sake of consistency. With 3.0 we should be able to take more time and refine as much as possible before the final release.

I currently have a working parser for 3.0 that I'm experimenting with and the solution I found most intuitive is having overridable sequence operators. Apart from -, one can use ~ and ^ as infix sequence operators. E.g. ~ could be used for optional trivia, while ^ could be used for mandatory trivia. With this addition, repetition can then follow the trend rule~*, rule^*, rule~{3}, rule^{5}, and so on. I marginally prefer this approach over @grncdr's more implicit solution, since I think it's easier to teach and more intuitive.

WIP of what I've been playing with.

@Keats
Copy link
Contributor

Keats commented Oct 30, 2019

Is there a way to track progress of v3?

@dragostis
Copy link
Contributor Author

@Keats, the somewhat little progress I've made lives here: https://github.com/pest-parser/pest3 I'd like to invest more time, but it would be far easier for me if there was a way to collaborate with someone on this and talk about what needs to be done.

@Keats
Copy link
Contributor

Keats commented Nov 2, 2019

Is the initial comment up to date with the goals?

Maybe the scope could be lowered a bit and not handle multiple versions if it's a lot of work. A pest2 parser will still work and someone can update to pest3 if wanted/needed.
I don't have much comment on the grammar itself, I'm satisfied with the current state. The only two things I would like are:

  1. input parameters to the parser. It can be as simple as passing a Hashmap<String, String> where the key would be a variable name to use in the parser. My usecase is the following: Tera uses {{, {%, }} etc as delimiters, which conflict with Latex. It would be amazing if a user could pass [ and ] instead of the default { and } and have the parser just work. Right now since they need to be hardcoded in the .pest, I have no way short of copy/pasting the .pest file, changing the delimiters to another hardcoded thing and have another parser.
  2. have a way to not depend on pest_derive: no clue how that would look like or if it is even possible but it would lower compilation time. Maybe a CLI tool to expand the .pest into a nicely formatted rust file, with a watch mode when developing instead of a derive? This would mean committing a generated file in git but in exchange we get faster compilation + easy to debug.

@Nadrieril
Copy link
Contributor

Small idea regarding adjacency and parentheses: we could require parameterized rules to be macro-like:

separated!(e, s) = _{ e (s e)* }
comma_separated!(e) = _{ separated!(e, ",") }

I believe this removes all ambiguity and we can use adjacency for sequencing.

@Nadrieril
Copy link
Contributor

To make the transition smooth, it would be easy to support two syntaxes for the grammar for a while: the code generation need not change, and we would be able to deprecated the old syntax over a long period of time. No need for any breaking change in fact I think.

@jhoobergs
Copy link

Would COMMENT and WHITESPACE still exist with this proposal? If it does, how would it work with modules? Each module will have it's own definition of them that or not overridable? Or will they be overridable?

@tomtau
Copy link
Contributor

tomtau commented Jul 14, 2023

I tried to re-hash this discussion and other ideas from different issues into multiple threads here: #885

@tomtau
Copy link
Contributor

tomtau commented May 17, 2024

@Kroc @dragostis any thoughts on this: #1016 (reply in thread) ?

@tomtau
Copy link
Contributor

tomtau commented May 17, 2024

@jhoobergs in the prototype, those two special builtin rules don't make sense (given one can define the trivia on the operators). I guess it'll make sense if they are scoped to each module?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests