title | description | published | date | tags | editor | dateCreated |
---|---|---|---|---|---|---|
Non-ASCII characters in RFCXML |
true |
2021-12-16 04:45:35 UTC |
markdown |
2021-12-16 02:40:30 UTC |
The use of non-ASCII characters in RFCXML is detailed in RFC 7997. Your file encoding must be set as UTF-8 (the default).
non-ASCII characters in RFCXML (and I-Ds in general) may appear within the body of the document. The <u> element is required for cases where the non-ASCII characters are needed for correct protocol operation.
- For the <author> and <contact> elements, there exist both fullname, initials, and surname attributes that can hold non-ASCII characters and also the asciiFullname, asciiInitials, and asciiSurname attributes to hold the ASCII equivalents of non-ASCII characters that are not in the Unicode Latin blocks.
- Postal address elements <street>, <city>, <region>, <city>, <country>, and <email> also have an ascii attribute to hold the ASCII equivalent, which will also appear in the output format.
When non-ASCII characters are needed for correct protocol operation, they must be wrapped by the <u> element with the format attribute specifying how it is represented.
The simplified format consists of dash-separated keywords, where each keyword represents a possible expansion of the Unicode character or string; use for example <u format="lit-num-name">foo</u>
to expand the text to its literal value, code point values, and code point names.
A combination of up to 3 of the following keywords may be used, separated by dashes: "num", "lit", "name", "ascii", "char". The keywords are expanded as follows and combined, with the second and third enclosed in parentheses if present:
- "ascii" - The value of the 'ascii' attribute on the <u> element
- "char" - The literal element text, without quotes
- "lit" - The literal element text, enclosed in quotes
- "name" - The Unicode name(s) of the element text
- "num" - The numeric value(s) of the element text, in U+1234 notation
In order to ensure that no specification mistakes can result from rendering methods that cannot render all Unicode code points, "num" MUST always be part of the specified format.
The following RFCXML:
<t>Temperature changes are indicated by the character <u>Δ</u></t>
Generates the following outputs depending on the setting of format:
- format="num-lit":
Temperature changes are indicated by the character U+0394 ("Δ")
- format="num-name":
Temperature changes are indicated by the character U+0394 (GREEK CAPITAL LETTER DELTA)
- format="num-lit-name":
Temperature changes are indicated by the character U+0394 ("Δ", GREEK CAPITAL LETTER DELTA)
- format="num-name-lit":
Temperature changes are indicated by the character U+0394 (GREEK CAPITAL LETTER DELTA, "Δ")
- format="name-lit-num":
Temperature changes are indicated by the character GREEK CAPITAL LETTER DELTA ("Δ", U+0394)
- format="lit-name-num":
Temperature changes are indicated by the character "Δ" (GREEK CAPITAL LETTER DELTA, U+0394)
The default value is "lit-name-num"
Expansion of <u> multi-codepoint strings
If the <u> element encloses a sequence of Unicode codepoints, rather than a single one, the rendering reflects this. For example:
<u format="num-lit">ᏚᎢᎵᎬᎢᎬᏒ</u>
will be expanded to "U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2 ("ᏚᎢᎵᎬᎢᎬᏒ")".
Unicode characters in document text which are not enclosed in <u> will be replaced with a question mark (?) and a warning will be issued.
Non-simplified <u> format specifications
In order to provide for cases where the simplified format above is insufficient, without relinquishing the requirement that the number of a code point always must be rendered, the format attribute can also accept a full format string. This format uses placeholders which consist of any of the key words above enclosed in curly braces; outside of this, any ascii text is permissible. For example,
The <u format="{lit} character ({num})">Δ</u>
will be rendered as
The "Δ" character (U+0394).
As for the simplified format, "num" MUST always be part of the specified format in order to ensure that no specification mistakes can result for rendering methods that cannot render all Unicode code points,
Split expansion of <u> elements
There are cases which cannot be handled with either the simplified or full <u> format specifications. One is exemplified in Table 1 of the CSS sample document at https://rfc-format.github.io/draft-iab-rfc-css-bis/sample2-v2.html#s-3. Rendering this with <u> elements requires that the non-ascii content be rendered in one place (a table cell in one column) while the expansion is rendered in another cell in a different column. Provision for this has been made by modifying the expansion of <u> when it is referenced by an <xref>. This table, with <u> elements referenced by <xref> instances:
<table>
<name>A Sample of Legal Nicknames</name>
<thead>
<tr>
<th>#</th>
<th>Nickname</th>
<th>Output for comparison</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td><Foo></td>
<td><foo></td>
</tr>
<tr>
<td>2</td>
<td><foo></td>
<td><foo></td> </tr>
<tr>
<td>3</td>
<td><Foo Bar></td>
<td><foo bar></td>
</tr>
<tr>
<td>4</td>
<td><foo bar></td>
<td><foo bar></td>
</tr>
<tr>
<td>5</td>
<td>
<
<u format="name-num" anchor="greek-upper-sigma">Σ</u>
>
</td>
<td> <xref target="greek-upper-sigma" /> </td>
</tr>
<tr>
<td>6</td>
<td>
<
<u format="name-num" anchor="greek-lower-sigma">σ</u>
>
</td>
<td> <xref target="greek-lower-sigma" /> </td>
</tr>
<tr>
<td>7</td>
<td>
<
<u format="name-num" anchor="greek-final-sigma">ς</u>
>
</td>
<td> <xref target="greek-final-sigma" /> </td>
</tr>
<tr>
<td>8</td>
<td>
<
<u format="name-num" anchor="black-chess-king">♚</u>
>
</td>
<td>
<xref target="black-chess-king" format="default"/>
</td>
</tr>
<tr>
<td>9</td>
<td>
<Richard
<u format="{char}> ({num})" anchor="richard-iv">Ⅳ</u>
>
</td>
<td><richard iv></td>
</tr>
</tbody>
</table>
comes out as shown below:
| # | Nickname | Output for comparison |
| --: | :--------------------- | :-------------------------------------- |
| 1 | \<Foo\> | \<foo\> |
| 2 | \<foo\> | \<foo\> |
| 3 | \<Foo Bar\> | \<foo bar\> |
| 4 | \<foo bar\> | \<foo bar\> |
| 5 | \<Σ\> | GREEK CAPITAL LETTER SIGMA (U+03A3) |
| 6 | \<σ\> | GREEK SMALL LETTER SIGMA (U+03C3) |
| 7 | \<ς\> | GREEK SMALL LETTER FINAL SIGMA (U+03C2) |
| 8 | \<♚\> | BLACK CHESS KING (U+265A) |
| 9 | \<Richard Ⅳ\> (U+2163) | \<richard iv\> |
_Table 1: A Sample of Legal Nicknames_