Skip to content

(design) Exterior whitespace handling #487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Oct 17, 2023
Merged

(design) Exterior whitespace handling #487

merged 14 commits into from
Oct 17, 2023

Conversation

aphillips
Copy link
Member

Provide a document with the options being considered for "pattern exterior whitespace"

Provide a document with the options being considered for "pattern exterior whitespace"
@aphillips aphillips added syntax Issues related with syntax or ABNF design Design document or issues related to design labels Oct 6, 2023
Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few additions proposed inline, but looks good.

I fixed the numbering and some of the formatting directly on the branch.

_What is this proposal trying to achieve?_

The WG is discussing how to handle "pattern exterior" whitespace,
which is ASCII whitespace (tab, CR, LF, or U+0020) that is **_part_** of the pattern
Copy link
Collaborator

@gibson042 gibson042 Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#487 (comment) (updated) :

This definition of "ASCII whitespace" differs from that of the web platform (which additionally includes U+000C FORM FEED) and of Unicode (which additionally includes both U+000C FORM FEED and U+000B VERTICAL TABULATION).

Response from @eemeli :

I'd be fine with any of these whitespace definitions, as long as it includes \n, \r, \t and space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We omitted form feed and vtab, which are ASCII whitespace characters. We are consistent with JSON, HTML, and CSS's definition of whitespace. Many other languages also observe this definition (Java and I believe C++, although not sure about the latter) Is there a technical reason for us to permit form feed and vtab into our whitespace definition? If we don't permit them in our whitespace production, they would not be "pattern exterior" (for many of the cases in this document). We don't currently allow them in expressions, declarations and the like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We omitted form feed and vtab, which are ASCII whitespace characters. We are consistent with JSON, HTML, and CSS's definition of whitespace.

This is consistent with JSON (ref), but not with HTML or CSS by my reading of those specifications:

  • HTML uses the Infra Standard definition of ASCII whitespace (U+0009 TAB, U+000A LF, U+000C FF, U+000D CR, or U+0020 SPACE) that I linked to above (example).
  • CSS is more complicated because preprocessing replaces U+000D CARRIAGE RETURN and U+000C FORM FEED with U+000A LINE FEED, but basically boils down to the union of those with U+0009 CHARACTER TABULATION and U+0020 SPACE, aligning with HTML (ref).

Many other languages also observe this definition (Java and I believe C++, although not sure about the latter) Is there a technical reason for us to permit form feed and vtab into our whitespace definition? If we don't permit them in our whitespace production, they would not be "pattern exterior" (for many of the cases in this document). We don't currently allow them in expressions, declarations and the like.

I'm not arguing for inclusion or exclusion of the extra control characters, just for clarity. "ASCII whitespace" is a term of art in the web platform that includes U+000C FORM FEED, and in Unicode implies an interpretation like "code points that are both ASCII and White_Space" that includes U+000C FORM FEED and U+000B VERTICAL TABULATION. This text was therefore misleading, but cd188aa resolved that. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I get a "ship-it" (approval)?

@aphillips aphillips merged commit e5100b1 into main Oct 17, 2023
@aphillips aphillips deleted the aphillips-whitespace branch October 17, 2023 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Design document or issues related to design syntax Issues related with syntax or ABNF
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants