Inconsistent grabbing of whitespace #519

twilight-flower · 2021-06-11T16:35:37Z

Consider the following grammar:

integer = { ASCII_DIGIT+ }
d = { "d" }
WHITESPACE = _{ " " }
diceroll = { integer ~ d ~ integer }

When parsing the input 2 d 3 as a diceroll, its bottom-level tokens are integer 2 , the d d, and the integer 3 . Note how spaces are captured in the integers but not in the d.

If the d rule is changed to d = { "d"+ }, suddenly it starts capturing the space after the d too. The same doesn't happen for d = { "d"* }, which continues to capture only d.

If you increase the number of spaces in a given position, they'll all be captured; so, given the input:

2     d 3

The first integer will be captured as:

(Apologies for the full-line code blocks there; inline ones collapse multiple spaces down to just one in traditional HTML fashion, whereas full-line ones don't.)

If the integers are more than one digit long, spaces are no longer captured after them. 22 d 3's first integer is 22, no space included. This holds irrespective of how many spaces are present in the source text; the number always turns to zero, it doesn't just decrement by one per digit or anything along those lines.

This behavior all seems to ultimately flow down from the WHITESPACE rule; with the rules as defined here, parsing 2 as an integer yields 2 , but with the WHITESPACE rule removed, it instead yields just 2.

I'm not at all sure what the intended behavior here is—whether the intention is that the whitespace be consistently captured in the token to its left, or that it be consistently not-captured—but I'm almost certain this isn't the intended behavior. There are too many weird inconsistencies, with the spaces only being captured after one-character inputs whose rules use plus signs. But I figure it's worth raising this issue as an alert of "things are probably not working the way they're supposed to", even in the absence of knowledge of exactly how they are supposed to work.

The text was updated successfully, but these errors were encountered:

ishitatsuyuki · 2021-08-22T14:06:03Z

I'm new to pest and also facing this, and this sounds like a bug to me.

According to the manual, silent rule will never appear in parse results. Therefore, I think whitespaces appearing inside the pair violates this specification.

CAD97 · 2021-08-23T06:33:06Z

I'm not certain around the difference between + and *, but the reason you're capturing whitespace at all in integer here is that it's not atomic.

I've never worked with this part of the codebase, so this is just conjecture, but the following "desugaring" should help illustrate why I believe this is occuring:

WHITESPACE = _{ " " }

integer = { ASCII_DIGIT+ }
d = { "d"* }
diceroll = { integer ~ d ~ integer }

// inline all the rules for clarity, fake syntax
diceroll:{
  integer:{ ASCII_DIGIT+ }
~ d:{ "d"* }
~ integer:{ ASCII_DIGIT+ }
}

// "desugar" + to *
diceroll:{
  integer:{ ASCII_DIGIT ~ ASCII_DIGIT* }
~ d:{ "d"* }
~ integer:{ ASCII_DIGIT ~ ASCII_DIGIT* }
}

// now I introduce the idea of `-`; `~` without WHITESPACE, for clarity

diceroll:{
  integer:{ ASCII_DIGIT - WHITESPACE-* - ASCII_DIGIT~* }
- WHITESPACE-*
- d:{ "d"~* }
- WHITESPACE-*
- integer:{ ASCII_DIGIT - WHITESPACE-* - ASCII_DIGIT~* }
}

I don't actually know how "~*" handles capturing WHITESPACE. But I think this desugaring of rule+ to rule ~ rule* at least explains how the inconsistency emerges.

Ultimately I agree that this manifests as an inconsistency to the user, and ideally should be improved. However, I'm really not certain how to go about doing so.

I think the most expected to end-user would be to never include trailing whitespace in a capture. But this is problematic in practice; consider the simple case of rule ~ rule?, where it "desugars" to rule - WHITESPACE-* - rule?. Instead, it'd need to expand to something more along the lines of rule - { WHITESPACE-* - rule }?, which breaks the simple interpretation of ~ as just - WHITESPACE-* -.

silent rule will never appear in parse results

This refers specifically to the fact that the WHITESPACE node itself is not present. It's perfectly valid to capture a nonsilent node that contains a silent node, and the outer node's text will obviously contain the silent inner node's text.

the end of rule whitespace handling is weird and inconsistent, see pest-parser/pest#396 and pest-parser/pest#519

tomtau · 2023-12-18T01:59:11Z

This looks like #396 which was fixed in #878
(currently feature-guarded under "grammar-extras" in order not to break grammars that rely on the old behaviour).

But feel free to re-open the issue if it was something else

wongjiahau · 2024-06-16T06:13:09Z

This behavior is expected, see https://pest.rs/book/examples/ini.html?highlight=whitespace#whitespace.

To fix your problem, mark the integer rule as atomic (by prefixing it with @).

integer = @{ ASCII_DIGIT+ }

For non-atomic rules, the WHITESPACE rule is implicitly injected, thus the rule { "d"+ } desugars into:

"d" ~ WHITESPACE ~ "d"*

Automatic whitespace handling on the other hand is disabled for atomic rule.

doy added a commit to doy/nbsh that referenced this issue Jan 6, 2022

stop using implicit whitespace

3ec1f55

the end of rule whitespace handling is weird and inconsistent, see pest-parser/pest#396 and pest-parser/pest#519

tomtau added bug pest labels Jul 14, 2022

tomtau closed this as completed Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent grabbing of whitespace #519

Inconsistent grabbing of whitespace #519

twilight-flower commented Jun 11, 2021

ishitatsuyuki commented Aug 22, 2021

CAD97 commented Aug 23, 2021

tomtau commented Dec 18, 2023

wongjiahau commented Jun 16, 2024

Inconsistent grabbing of whitespace #519

Inconsistent grabbing of whitespace #519

Comments

twilight-flower commented Jun 11, 2021

ishitatsuyuki commented Aug 22, 2021

CAD97 commented Aug 23, 2021

tomtau commented Dec 18, 2023

wongjiahau commented Jun 16, 2024