Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential blowup in recursive parsers #74

Closed
p-e-w opened this issue Jan 15, 2022 · 3 comments
Closed

Exponential blowup in recursive parsers #74

p-e-w opened this issue Jan 15, 2022 · 3 comments

Comments

@p-e-w
Copy link

p-e-w commented Jan 15, 2022

Consider this simple parser:

use chumsky::prelude::*;

fn parser() -> impl Parser<char, String, Error = Simple<char>> {
  recursive(|expr| {
    let atom = text::ident()
      .or(expr.clone().delimited_by('(', ')'));

    let expression = atom
      .clone()
      .then_ignore(just('+'))
      .then(atom.clone())
      .map(|(a, b)| format!("{}{}", a, b))
      .or(atom);

    expression
  }).then_ignore(end())
}

fn main() {
  println!("{:?}", parser().parse("((((((((((((((((((((((a+b))))))))))))))))))))))"));
}

Parsing the string

((((((((((((((((((((((a+b))))))))))))))))))))))

takes 8 seconds when compiling in release mode (around a minute in debug mode), and that time rapidly grows as more pairs of parentheses are added.

When the parser inside the recursive closure is more complex, the problem becomes even worse. With the parser I'm working on, I'm seeing multi-second parse times already for 3-4 levels of nested brackets.

Boxing parsers does not solve this issue. What can be done here?

@zesterer
Copy link
Owner

zesterer commented Feb 7, 2022

The problem here is subtle, but is actually a more general problem with backtracking parsers, rather than an issue with chumsky itself. A similar problem will appear with a hand-written recursive descent parser.

Within your code, you have two patterns that can be parsed that both look very similar. They are:

expr

and

expr + expr

Notice something important: both patterns are ambiguous right the way up until the point at which the + is hit, if it exists. This means that chumsky (which is a recursive descent PEG parser internally) is forced to try one of the patterns first, then the next pattern if the first fails.

Normally, this is not a problem. For an expression like a + b, it's obvious - very quickly - that the second pattern is correct after just the second symbol. What really causes this to become a problem is that these patterns are defined recursively: at each level of recursion, the parser needs to try one pattern and then the other. If it guesses wrong the first time, it's just done a lot of work for nothing!

In fact, for every layer of recursion for which the wrong guess is made, the amount of work required to parse the input doubles. Your input has 21 layers of nesting, where the parser defaults to the wrong guess each time, meaning that the resulting parser is doing 2^21 = 2,097,152x the amount of work that strictly needs to be done. Hopefully, it should now be clear why we're seeing such pessimistic numbers here.

Fixing this

Fixing this problem requires restating the grammar in a way that gives the parser less work to do. While your parser is logically correct, in practice it explodes the space of possible parse routes and recursive descent isn't smart enough to handle such cases (there are techniques such as memoization that can avoid this exponential blowup, but they usually have worse performance for non-pessimistic grammars and inputs so chumsky does not implement them).

A better approach is to restate the grammar like so:

use chumsky::prelude::*;

fn parser() -> impl Parser<char, String, Error = Simple<char>> {
  recursive(|expr| {
    let atom = text::ident()
      .or(expr.clone().delimited_by(just('('), just(')')));

    let expression = atom.clone()
        .then(just('+').ignore_then(atom).or_not())
        .map(|(a, b)| if let Some(b) = b {
            format!("{}{}", a, b)
        } else {
            a
        });

    expression
  }).then_ignore(end())
}

fn main() {
  println!("{:?}", parser().parse("((((((((((((((((((((((a+b))))))))))))))))))))))"));
}

Here, we use or_not() to express the fact that + expr may or may not be present after the initial expression, allowing the parser to avoid backtracking and preventing the exponential blowup.

I hope this helps, and sorry about not replying to this earlier: GitHub has not been showing me all of my notifications for some reason.

@p-e-w
Copy link
Author

p-e-w commented Feb 9, 2022

Thank you for the in-depth explanation. I completely understand the underlying issue now and can easily fix it in my own parser.

I guess the recommended approach can for many cases be distilled down to "prefer conditional matching over alternation where possible, in order to prevent branching during recursion".

That being said, I can't help but feel that my original parser looks a lot more intuitive than your modified version. It would be nice if it were possible to write code like the first version that performs like the second version. Can't Chumsky do that transformation automatically somehow, at least in simple cases? That is, if there is an alternation (or) between two patterns, where one is a prefix of the other, perform parsing by conditionally matching the suffix instead?

@zesterer
Copy link
Owner

zesterer commented Feb 9, 2022

I'm not sure whether it's feasible to transform one into the other, but memoisation would at least prevent the exponential blowup by 'remembering' past parse failures at each position, at the cost of increased memory usage and poorer performance in less pessimistic cases.

I'll open an issue for memoisation, because it might be desirable for some particularly ambiguous grammars. It's not going to be an immediate focus for me, but it's possible that it might be worked into #82.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants