Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Earley parser produces wrong parse Tree #1283

Closed
MegaIng opened this issue Jun 1, 2023 · 8 comments · Fixed by #1433
Closed

Earley parser produces wrong parse Tree #1283

MegaIng opened this issue Jun 1, 2023 · 8 comments · Fixed by #1433
Labels
Earley Issues regarding the Earley parser

Comments

@MegaIng
Copy link
Member

MegaIng commented Jun 1, 2023

Over on SO someone asked this question: https://stackoverflow.com/questions/76366280/parsing-formulas-using-lark-ebnf/76381256

As far as I can tell, it shows a bug in the earley parser where it duplicates some Tokens because of enormous amounts of ambiguities. I don't have the expertise with earley to figure out what is happening or the time to create a minimal example right now.

@chanicpanic
Copy link
Contributor

I have it down to:

grammar = """
start: (a+)*
!a.1: "A" |
"""

l = Lark(grammar)
tree = l.parse("A")
print(tree.pretty())

Output:

start
  a
  a     A
  a
  a     A

I suspect that the issue is somewhere in ForestToParseTree with resolve_ambiguity=True. I do not see the same problem in the tree returned when using ambiguity="explicit". I'll look into it more when I have a chance.

@chanicpanic
Copy link
Contributor

The issue is that when ambiguity="resolve", the ParseTreeBuilder will use the ChildFilterLALR filters. However, the ForestToParseTree cache breaks the ChildFilterLALR assumption that it is safe to change the parse tree. We have two options:

  1. Disable the ForestToParseTree cache when ambiguity="resolve".
  2. Always use the regular ChildFilter for earley.

Either will work, but they may have different performance implications.

@erezsh
Copy link
Member

erezsh commented Jun 7, 2023

@chanicpanic Thanks for looking into it!

Can you explain more about the performance implications?

@chanicpanic
Copy link
Contributor

The ForestToParseTree cache and ChildFilterLALR filter are both optimizations that happen to be incompatible with each other. So, the question is: which optimization gives the most performance benefit for earley with ambiguity resolution?

While the ForestToParseTree cache provides a significant speed-up for explicit ambiguous parses, its impact on ambiguity resolution is likely minimal. My hypothesis is that the cache is only relevant when a rule matches an empty string, and even then, the forest nodes have to be visited in a particular order.

On the other hand, I believe that child filtering is an operation that is performed for many grammars and inputs. So, the ChildFilterLALR would be used much more often.

Thus, I am leaning toward option 1 because I expect the ChildFilterLALR to have a greater overall performance benefit.

@erezsh
Copy link
Member

erezsh commented Jun 25, 2023

@chanicpanic Thanks for the explanation.

I think it's best to make an evidence-based choice, which makes me wonder, do you have an idea for what a good benchmark grammar&input for Earley would be?

I could run benchmarks on trick-grammars like start: start start | "(" start ")" |, and normal grammars that are low on ambiguity. But maybe you know some real-world examples that are ambiguous and would serve as a better measurement?

cc: @MegaIng

@Shildifreak
Copy link

I stumbled across what I think is the same problem.
Here is the minimal example from my grammar:

GRAMMAR = r'''
start : _s x _s
x : "X"?
_s : " "?
'''

l = Lark(GRAMMAR)

# behave as expected
print(l.parse( " X " ).pretty())
print(l.parse( "X" ).pretty())
print(l.parse( " " ).pretty())

# produces two x nodes
print(l.parse( "" ).pretty())

output:

start
  x

start
  x

start
  x

start
  x
  x

@chanicpanic
Copy link
Contributor

I think it's best to make an evidence-based choice, which makes me wonder, do you have an idea for what a good benchmark grammar&input for Earley would be?

I could run benchmarks on trick-grammars like start: start start | "(" start ")" |, and normal grammars that are low on ambiguity. But maybe you know some real-world examples that are ambiguous and would serve as a better measurement?

@erezsh I don't have any real-world grammars that I think would be a better benchmark for Earley.
For this issue, I think it would be good to use the grammars in this thread, or variations that incorporate the same patterns, which are known to utilize the ForestToParseTree cache with ambiguity="resolve". Also, it would be good to benchmark with normal grammars that reflect more typical lark usage.

I made branches for the two options that can be used for comparison:

@erezsh erezsh added the Earley Issues regarding the Earley parser label Aug 23, 2023
@erezsh
Copy link
Member

erezsh commented Jun 26, 2024

Hi @chanicpanic and everyone,

Sorry it took me this long to look into this!

I did some benchmarks, and my conclusion is it that - it really doesn't matter.

Both approaches seem to have the same performance in my tests, with less than 5% difference.

So, just choose whatever seems more "correct". I will accept a PR of either approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Earley Issues regarding the Earley parser
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants