Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify string descriptions #875

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 27 additions & 14 deletions toml.md
Original file line number Diff line number Diff line change
Expand Up @@ -259,12 +259,20 @@ String
------

There are four ways to express strings: basic, multi-line basic, literal, and
multi-line literal. All strings must contain only valid UTF-8 characters.
multi-line literal.

**Basic strings** are surrounded by quotation marks (`"`). Any Unicode character
may be used except those that must be escaped: quotation mark, backslash, and
the control characters other than tab (U+0000 to U+0008, U+000A to U+001F,
U+007F).
All strings must contain only valid UTF-8 encoded characters as is the case for
the TOML document as a whole. Certain control characters are not allowed to
occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E
to U+001F, and U+007F. In basic strings and multi-line basic strings, but not in
literal strings or multi-line literal strings, those control characters can be
described with escapes as specified below. Additional restrictions are described
below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do persist with this change, then I'd simplify this paragraph. There's just too much here.

Suggested change
All strings must contain only valid UTF-8 encoded characters as is the case for
the TOML document as a whole. Certain control characters are not allowed to
occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E
to U+001F, and U+007F. In basic strings and multi-line basic strings, but not in
literal strings or multi-line literal strings, those control characters can be
described with escapes as specified below. Additional restrictions are described
below.
Strings must contain only valid UTF-8 encoded characters. Certain control characters are not allowed to occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, and U+007F.

The point about the basic strings supporting escaped control characters is already covered in the "basic strings" section.

(An argument can be made that the list of disallowed characters should be represented as an actual bullet-point list, though that's a matter of taste, and beyond the scope of this PR since it wasn't that way to begin with.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving the "Any Unicode character may be used except [..]" one paragraph up makes sense. Now it just looks like it applies only to basic strings, rather than all strings.

Can then also remove the same text for multi-line strings and "Control characters other than tab are not permitted in a literal string" at the end of the "Multi-line literal strings" section.

I'd write it as something like:

There are four ways to express strings: basic, multi-line basic, literal, and multi-line literal. All strings must be encoded as valid UTF-8, and can contain any codepoint except control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F). Multi-line strings can also contain newlines (U+000A) and carriage returns (U+000D).

This way you have "what bytes/characters can be in a string?" in a single concise paragraph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arp242 I like that wording.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arp242 Thanks, that is an improvement. Putting those details at the beginning rather than at the end of each section makes all the sections easier to understand. I have also reworded the last paragraph in strings to make it clear it is not a another part of the spec but just advice (paragraph starting "Because most control characters are not permitted...").


**Basic strings** are surrounded by quotation marks (`"`). In addition to the
characters disallowed for all strings mentioned above, U+000A (LF) and U+000D
(CR) may not occur literally in basic strings. Backslash and quotation mark may
only occur literally if they are part of a valid escape sequence.

```toml
str = "I'm a string. \"You can quote me\". Name\tJos\u00E9\nLocation\tSF."
Expand Down Expand Up @@ -340,10 +348,10 @@ str3 = """\
"""
```

Any Unicode character may be used except those that must be escaped: backslash
and the control characters other than tab, line feed, and carriage return
(U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, U+007F). Carriage returns
(U+000D) are only allowed as part of a newline sequence.
In addition to the characters disallowed for all strings mentioned above, U+000D
(CR) is allowed only as part of a newline sequence U+000D U+000A (CRLF). As
with basic strings, backslash and quotation mark may only occur literally if
they are part of a valid escape sequence.

You can write a quotation mark, or two adjacent quotation marks, anywhere inside
a multi-line basic string. They can also be written just inside the delimiters.
Expand Down Expand Up @@ -405,9 +413,12 @@ apos15 = "Here are fifteen apostrophes: '''''''''''''''"
str = ''''That,' she said, 'is still pointless.''''
```

Control characters other than tab are not permitted in a literal string. Thus,
for binary data, it is recommended that you use Base64 or another suitable ASCII
or UTF-8 encoding. The handling of that encoding will be application-specific.
As in all strings, most control characters are not permitted even in a literal
string or multi-line literal string. Thus, these literal strings are not suited
for representing blobs of binary data. It is recommended that you use Base64 or
another suitable ASCII or UTF-8 encoding. The handling of that encoding will be
application-specific.


Integer
-------
Expand Down Expand Up @@ -763,7 +774,8 @@ member_since = 1999-08-04

Dotted keys create and define a table for each key part before the last one. Any
such table must have all its key/value pairs defined under the current `[table]`
header, or in the root table if defined before all headers, or in one inline table.
header, or in the root table if defined before all headers, or in one inline
table.

```toml
fruit.apple.color = "red"
Expand Down Expand Up @@ -1008,6 +1020,7 @@ When transferring TOML files over the internet, the appropriate MIME type is
ABNF Grammar
------------

A formal description of TOML's syntax is available, as a separate [ABNF file][abnf].
A formal description of TOML's syntax is available, as a separate
[ABNF file][abnf].

[abnf]: ./toml.abnf