Skip to content

Commit 35ec5fe

Browse files
committed
[lex] Better specify whitespace characters
This commit defines a grammar term for _whitespace-character_ and uses it consistently where the plain text term whitespace character is used. A whitespace character is defined as one of the five characters that are mentioned in the text closest to provifing a defifinition. The unicode character name is (mostly) consistently used to name these characters, and for consistency, similar changes were made to name unicode characters rather than render specified characters in code font throughout [lex]. The one exception is backslash, which is retained as-is to avoid making more issues for P2348. Note that this commit is not a replacement for P2348, merely a clearer statement of the existing specification without any normative changes.
1 parent bf43925 commit 35ec5fe

File tree

1 file changed

+48
-29
lines changed

1 file changed

+48
-29
lines changed

source/lex.tex

+48-29
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,9 @@
110110
\indextext{line splicing}%
111111
If the first translation character is \unicode{feff}{byte order mark},
112112
it is deleted.
113-
Each sequence of a backslash character (\textbackslash)
113+
Each sequence of a backslash character (\unicode{005c}{reverse solidus})
114114
immediately followed by
115-
zero or more whitespace characters other than new-line followed by
115+
zero or more \grammarterm{whitespace-character}s other than new-line followed by
116116
a new-line character is deleted, splicing
117117
physical source lines to form \defnx{logical source lines}{source line!logical}. Only the last
118118
backslash on any physical source line shall be eligible for being part
@@ -126,9 +126,13 @@
126126
shall be processed as if an additional new-line character were appended
127127
to the file.
128128

129-
\item The source file is decomposed into preprocessing
130-
tokens\iref{lex.pptoken} and sequences of whitespace characters
131-
(including comments). A source file shall not end in a partial
129+
\item
130+
\indextext{whitespace}%
131+
\indextext{comment}%
132+
\indextext{token!preprocessing}%
133+
The source file is decomposed into preprocessing
134+
tokens\iref{lex.pptoken} and whitespace\iref{lex.whitespace} (sequences of \grammarterm{whitespace-character}s
135+
and comments). A source file shall not end in a partial
132136
preprocessing token or in a partial comment.
133137
\begin{footnote}
134138
A partial preprocessing
@@ -140,9 +144,9 @@
140144
would arise from a source file ending with an unclosed \tcode{/*}
141145
comment.
142146
\end{footnote}
143-
Each comment\iref{lex.comment} is replaced by one space character. New-line characters are
144-
retained. Whether each nonempty sequence of whitespace characters other
145-
than new-line is retained or replaced by one space character is
147+
Each comment\iref{lex.comment} is replaced by one \unicode{0020}{space} character. New-line characters are
148+
retained. Whether each nonempty sequence of \grammarterm{whitespace-character}s other
149+
than new-line is retained or replaced by one \unicode{0020}{space} character is
146150
unspecified.
147151
As characters from the source file are consumed
148152
to form the next preprocessing token
@@ -178,10 +182,10 @@
178182
\item
179183
Adjacent \grammarterm{string-literal} tokens are concatenated\iref{lex.string}.
180184

181-
\item Whitespace characters separating tokens are no longer
182-
significant. Each preprocessing token is converted into a
183-
token\iref{lex.token}. The resulting tokens
184-
constitute a \defn{translation unit} and
185+
\item
186+
Each preprocessing token is converted into a token\iref{lex.token}.
187+
Any \grammarterm{whitespace-character}s separating tokens are no longer significant.
188+
The resulting tokens constitute a \defn{translation unit} and
185189
are syntactically and
186190
semantically analyzed and translated.
187191
\begin{note}
@@ -467,7 +471,28 @@
467471
None of these names or aliases have leading or trailing spaces.
468472
\end{note}
469473

470-
\rSec1[lex.comment]{Comments}
474+
\rSec1[lex.whitespace]{Whitespace}
475+
\indextext{whitespace|(}%
476+
477+
\rSec2[lex.whitechar]{Whitespace Characters}
478+
479+
\indextext{character!whitespace|(}%
480+
\begin{bnf}
481+
\nontermdef{whitespace-character}\br
482+
\unicode{0009}{character tabulation}\br
483+
\textnormal{new-line}\br
484+
\unicode{000b}{line tabulation}\br
485+
\unicode{000c}{form feed}\br
486+
\unicode{0020}{space}\br
487+
\end{bnf}
488+
489+
\pnum
490+
\begin{note}
491+
Whitespace characters are used to separate elements of the \Cpp grammar.
492+
\end{note}
493+
\indextext{character!whitespace|)}
494+
495+
\rSec2[lex.comment]{Comments}
471496

472497
\pnum
473498
\indextext{comment|(}%
@@ -477,8 +502,8 @@
477502
characters \tcode{*/}. These comments do not nest.
478503
\indextext{comment!\tcode{//}}%
479504
The characters \tcode{//} start a comment, which terminates immediately before the
480-
next new-line character. If there is a form-feed or a vertical-tab
481-
character in such a comment, only whitespace characters shall appear
505+
next new-line character. If there is a \unicode{000c}{form feed} or a \unicode{000b}{line tabulation}
506+
character in such a comment, only \grammarterm{whitespace-character}s shall appear
482507
between it and the new-line that terminates the comment; no diagnostic
483508
is required.
484509
\begin{note}
@@ -489,6 +514,7 @@
489514
\tcode{/*} comment.
490515
\end{note}
491516
\indextext{comment|)}
517+
\indextext{whitespace|)}%
492518

493519
\rSec1[lex.pptoken]{Preprocessing tokens}
494520

@@ -506,7 +532,7 @@
506532
string-literal\br
507533
user-defined-string-literal\br
508534
preprocessing-op-or-punc\br
509-
\textnormal{each non-whitespace character that cannot be one of the above}
535+
\textnormal{each non-\grammarterm{whitespace-character} that cannot be one of the above}
510536
\end{bnf}
511537

512538
\pnum
@@ -520,22 +546,17 @@
520546
(\grammarterm{import-keyword}, \grammarterm{module-keyword}, and \grammarterm{export-keyword}),
521547
identifiers, preprocessing numbers, character literals (including user-defined character
522548
literals), string literals (including user-defined string literals), preprocessing
523-
operators and punctuators, and single non-whitespace characters that do not lexically
549+
operators and punctuators, and single non-\grammarterm{whitespace-character}s that do not lexically
524550
match the other preprocessing token categories.
525551
If a \unicode{0027}{apostrophe} or a \unicode{0022}{quotation mark} character
526552
matches the last category, the program is ill-formed.
527553
If any character not in the basic character set matches the last category,
528554
the program is ill-formed.
529555
Preprocessing tokens can be separated by
530556
\indextext{whitespace}%
531-
whitespace;
557+
whitespace\iref{lex.whitespace};
532558
\indextext{comment}%
533-
this consists of comments\iref{lex.comment}, or whitespace characters
534-
(\unicode{0020}{space},
535-
\unicode{0009}{character tabulation},
536-
new-line,
537-
\unicode{000b}{line tabulation}, and
538-
\unicode{000c}{form feed}), or both.
559+
this consists of comments, \grammarterm{whitespace-character}s, or both.
539560
As described in \ref{cpp}, in certain
540561
circumstances during translation phase 4, whitespace (or the absence
541562
thereof) serves as more than preprocessing token separation. Whitespace
@@ -826,9 +847,7 @@
826847
\end{footnote}
827848
operators, and other separators.
828849
\indextext{whitespace}%
829-
Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments
830-
(collectively, ``whitespace''), as described below, are ignored except
831-
as they serve to separate tokens.
850+
Whitespace\iref{lex.whitespace} is ignored except to separate tokens.
832851
\begin{note}
833852
Whitespace can separate otherwise adjacent identifiers, keywords, numeric
834853
literals, and alternative tokens containing alphabetic characters.
@@ -1790,8 +1809,8 @@
17901809
\begin{bnf}
17911810
\nontermdef{d-char}\br
17921811
\textnormal{any member of the basic character set except:}\br
1793-
\bnfindent\textnormal{\unicode{0020}{space}, \unicode{0028}{left parenthesis}, \unicode{0029}{right parenthesis}, \unicode{005c}{reverse solidus},}\br
1794-
\bnfindent\textnormal{\unicode{0009}{character tabulation}, \unicode{000b}{line tabulation}, \unicode{000c}{form feed}, and new-line}
1812+
\bnfindent\textnormal{a \grammarterm{whitespace-character}, \unicode{0028}{left parenthesis}, \unicode{0029}{right parenthesis},}\br
1813+
\bnfindent\textnormal{and \unicode{005c}{reverse solidus}}
17951814
\end{bnf}
17961815

17971816
\pnum

0 commit comments

Comments
 (0)