Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent grabbing of whitespace #519

Closed
twilight-flower opened this issue Jun 11, 2021 · 4 comments
Closed

Inconsistent grabbing of whitespace #519

twilight-flower opened this issue Jun 11, 2021 · 4 comments

Comments

@twilight-flower
Copy link

Consider the following grammar:

integer = { ASCII_DIGIT+ }
d = { "d" }
WHITESPACE = _{ " " }
diceroll = { integer ~ d ~ integer }

When parsing the input 2 d 3 as a diceroll, its bottom-level tokens are integer 2 , the d d, and the integer 3 . Note how spaces are captured in the integers but not in the d.

If the d rule is changed to d = { "d"+ }, suddenly it starts capturing the space after the d too. The same doesn't happen for d = { "d"* }, which continues to capture only d.

If you increase the number of spaces in a given position, they'll all be captured; so, given the input:

2     d 3

The first integer will be captured as:

2     

(Apologies for the full-line code blocks there; inline ones collapse multiple spaces down to just one in traditional HTML fashion, whereas full-line ones don't.)

If the integers are more than one digit long, spaces are no longer captured after them. 22 d 3's first integer is 22, no space included. This holds irrespective of how many spaces are present in the source text; the number always turns to zero, it doesn't just decrement by one per digit or anything along those lines.

This behavior all seems to ultimately flow down from the WHITESPACE rule; with the rules as defined here, parsing 2 as an integer yields 2 , but with the WHITESPACE rule removed, it instead yields just 2.

I'm not at all sure what the intended behavior here is—whether the intention is that the whitespace be consistently captured in the token to its left, or that it be consistently not-captured—but I'm almost certain this isn't the intended behavior. There are too many weird inconsistencies, with the spaces only being captured after one-character inputs whose rules use plus signs. But I figure it's worth raising this issue as an alert of "things are probably not working the way they're supposed to", even in the absence of knowledge of exactly how they are supposed to work.

@ishitatsuyuki
Copy link

I'm new to pest and also facing this, and this sounds like a bug to me.

According to the manual, silent rule will never appear in parse results. Therefore, I think whitespaces appearing inside the pair violates this specification.

@CAD97
Copy link
Contributor

CAD97 commented Aug 23, 2021

I'm not certain around the difference between + and *, but the reason you're capturing whitespace at all in integer here is that it's not atomic.

I've never worked with this part of the codebase, so this is just conjecture, but the following "desugaring" should help illustrate why I believe this is occuring:

WHITESPACE = _{ " " }

integer = { ASCII_DIGIT+ }
d = { "d"* }
diceroll = { integer ~ d ~ integer }

// inline all the rules for clarity, fake syntax
diceroll:{
  integer:{ ASCII_DIGIT+ }
~ d:{ "d"* }
~ integer:{ ASCII_DIGIT+ }
}

// "desugar" + to *
diceroll:{
  integer:{ ASCII_DIGIT ~ ASCII_DIGIT* }
~ d:{ "d"* }
~ integer:{ ASCII_DIGIT ~ ASCII_DIGIT* }
}

// now I introduce the idea of `-`; `~` without WHITESPACE, for clarity

diceroll:{
  integer:{ ASCII_DIGIT - WHITESPACE-* - ASCII_DIGIT~* }
- WHITESPACE-*
- d:{ "d"~* }
- WHITESPACE-*
- integer:{ ASCII_DIGIT - WHITESPACE-* - ASCII_DIGIT~* }
}

I don't actually know how "~*" handles capturing WHITESPACE. But I think this desugaring of rule+ to rule ~ rule* at least explains how the inconsistency emerges.

Ultimately I agree that this manifests as an inconsistency to the user, and ideally should be improved. However, I'm really not certain how to go about doing so.

I think the most expected to end-user would be to never include trailing whitespace in a capture. But this is problematic in practice; consider the simple case of rule ~ rule?, where it "desugars" to rule - WHITESPACE-* - rule?. Instead, it'd need to expand to something more along the lines of rule - { WHITESPACE-* - rule }?, which breaks the simple interpretation of ~ as just - WHITESPACE-* -.

silent rule will never appear in parse results

This refers specifically to the fact that the WHITESPACE node itself is not present. It's perfectly valid to capture a nonsilent node that contains a silent node, and the outer node's text will obviously contain the silent inner node's text.

doy added a commit to doy/nbsh that referenced this issue Jan 6, 2022
the end of rule whitespace handling is weird and inconsistent, see
pest-parser/pest#396 and
pest-parser/pest#519
@tomtau
Copy link
Contributor

tomtau commented Dec 18, 2023

This looks like #396 which was fixed in #878
(currently feature-guarded under "grammar-extras" in order not to break grammars that rely on the old behaviour).

But feel free to re-open the issue if it was something else

@tomtau tomtau closed this as completed Dec 18, 2023
@wongjiahau
Copy link

This behavior is expected, see https://pest.rs/book/examples/ini.html?highlight=whitespace#whitespace.

To fix your problem, mark the integer rule as atomic (by prefixing it with @).

integer = @{ ASCII_DIGIT+ }

For non-atomic rules, the WHITESPACE rule is implicitly injected, thus the rule { "d"+ } desugars into:

"d" ~ WHITESPACE ~ "d"*

Automatic whitespace handling on the other hand is disabled for atomic rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants