-
-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Earley parser produces wrong parse Tree #1283
Comments
I have it down to: grammar = """
start: (a+)*
!a.1: "A" |
"""
l = Lark(grammar)
tree = l.parse("A")
print(tree.pretty()) Output:
I suspect that the issue is somewhere in |
The issue is that when
Either will work, but they may have different performance implications. |
@chanicpanic Thanks for looking into it! Can you explain more about the performance implications? |
The While the On the other hand, I believe that child filtering is an operation that is performed for many grammars and inputs. So, the Thus, I am leaning toward option 1 because I expect the |
@chanicpanic Thanks for the explanation. I think it's best to make an evidence-based choice, which makes me wonder, do you have an idea for what a good benchmark grammar&input for Earley would be? I could run benchmarks on trick-grammars like cc: @MegaIng |
I stumbled across what I think is the same problem. GRAMMAR = r'''
start : _s x _s
x : "X"?
_s : " "?
'''
l = Lark(GRAMMAR)
# behave as expected
print(l.parse( " X " ).pretty())
print(l.parse( "X" ).pretty())
print(l.parse( " " ).pretty())
# produces two x nodes
print(l.parse( "" ).pretty()) output:
|
@erezsh I don't have any real-world grammars that I think would be a better benchmark for Earley. I made branches for the two options that can be used for comparison: |
Hi @chanicpanic and everyone, Sorry it took me this long to look into this! I did some benchmarks, and my conclusion is it that - it really doesn't matter. Both approaches seem to have the same performance in my tests, with less than 5% difference. So, just choose whatever seems more "correct". I will accept a PR of either approach. |
Over on SO someone asked this question: https://stackoverflow.com/questions/76366280/parsing-formulas-using-lark-ebnf/76381256
As far as I can tell, it shows a bug in the earley parser where it duplicates some Tokens because of enormous amounts of ambiguities. I don't have the expertise with earley to figure out what is happening or the time to create a minimal example right now.
The text was updated successfully, but these errors were encountered: