Skip to content

Commit 91df707

Browse files
authored
Merge pull request #670 from ndw/iss-668
Attempt to address the BOM
2 parents 6ebaafc + 1a3990b commit 91df707

File tree

3 files changed

+87
-3
lines changed

3 files changed

+87
-3
lines changed

src/main/xml/bibliography.xml

+6
Original file line numberDiff line numberDiff line change
@@ -341,6 +341,12 @@ Internationalized Resource Identifiers (IRIs)</citetitle>.
341341
M. Duerst and M. Suignard, editors.
342342
Internet Engineering Task Force. January, 2005.</bibliomixed>
343343

344+
<bibliomixed xml:id="rfc8259"><abbrev>RFC 8259</abbrev>
345+
<citetitle xlink:href="https://doi.org/10.17487/RFC8259">RFC 8259:
346+
The JavaScript Object Notation (JSON) Data Interchange Format.</citetitle>
347+
T. Bray, editor.
348+
Internet Engineering Task Force. December, 2017.</bibliomixed>
349+
344350
<bibliomixed xml:id="unicodetr17"><abbrev>Unicode TR#17</abbrev>
345351
<citetitle xlink:href="https://unicode.org/reports/tr17/">Unicode Technical
346352
Report #17: Character Encoding Model</citetitle>.

steps/src/main/xml/references.xml

+1
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
<bibliomixed xml:id="rfc3986"/>
2323
<bibliomixed xml:id="rfc4646"/>
2424
<bibliomixed xml:id="rfc4647"/>
25+
<bibliomixed xml:id="rfc8259"/>
2526
<bibliomixed xml:id="bcp47"/>
2627
<bibliomixed xml:id="bib.uuid"/>
2728
<bibliomixed xml:id="bib.sha"/>

steps/src/main/xml/steps/load.xml

+80-3
Original file line numberDiff line numberDiff line change
@@ -102,18 +102,86 @@ the processor does not support DTD validation.</error></para>
102102
document is an XPath data model document consisting of a single text node.)</para>
103103

104104
<para><error code="D0060">It is a <glossterm>dynamic error</glossterm> if the
105-
<option>content-type</option> specifies an encoding, which is not supported
106-
by the processor.</error></para>
105+
<option>content-type</option> specifies a charset (sometimes called the
106+
character encoding) that is not supported by the processor.</error></para>
107107

108108
<para><impl>Text parameters are <glossterm>implementation-defined</glossterm>.
109109
</impl></para>
110110

111+
<section xml:id="text-bom">
112+
<title>Byte order marks</title>
113+
114+
<para>UTF-8 and UTF-16 inputs can begin with a byte order mark. The byte order
115+
mark is not considered part of the text and is not included in the text
116+
document.</para>
117+
118+
<para>In order to identify the byte order mark, it is first necessary to
119+
identify the charset that is being
120+
used. The charset of an external resource is determined as follows:</para>
121+
122+
<orderedlist>
123+
<listitem>
124+
<para>external charset information is used if available (for example, if the
125+
resource is loaded with HTTP or HTTPS and the server provided a charset), otherwise
126+
</para>
127+
</listitem>
128+
<listitem>
129+
<para>the charset from the content type if specified, otherwise
130+
</para>
131+
</listitem>
132+
<listitem>
133+
<para>the processor may use implementation-defined heuristics to determine the
134+
likely charset, otherwise
135+
</para>
136+
</listitem>
137+
<listitem>
138+
<para>UTF-8 is assumed.</para>
139+
</listitem>
140+
</orderedlist>
141+
142+
<para>Processors <rfc2119>must</rfc2119> support UTF-8 and UTF-16. <impl>Support
143+
for other charsets is <glossterm>implementation-defined</glossterm>.</impl></para>
144+
145+
<para>If the encoding is UTF-8, UTF-16, UTF-16LE, or UTF-16BE, a byte order mark
146+
may be present. (For any other encoding, there is no byte order mark and all of the text
147+
is returned.)</para>
148+
149+
<itemizedlist>
150+
<listitem>
151+
<para>If the encoding is UTF-8 and the document begins with the bytes
152+
EF BB BF, those bytes are the byte order mark. They are discarded.</para>
153+
</listitem>
154+
<listitem>
155+
<para>If the encoding is UTF-16 and the document begins with the bytes
156+
FE FF or FF FE, those bytes are the byte order mark. They are discarded.</para>
157+
</listitem>
158+
<listitem>
159+
<para>If the encoding is UTF-16LE and the document begins with the bytes
160+
FF FE, those bytes are the byte order mark. They are discarded.</para>
161+
</listitem>
162+
<listitem>
163+
<para>If the encoding is UTF-16BE and the document begins with the bytes
164+
FE FF, those bytes are the byte order mark. They are discarded.</para>
165+
</listitem>
166+
<listitem>
167+
<para>If the encoding isn’t specified, but the file begins with a byte
168+
order mark (FE FF or FF FE), treat the charset as UTF-16 and
169+
discard the byte order mark.</para>
170+
</listitem>
171+
<listitem>
172+
<para>Otherwise, there is no byte order mark, nothing is discarded.
173+
</para>
174+
</listitem>
175+
</itemizedlist>
176+
177+
</section>
111178
</section>
112179

113180
<section xml:id="c.load.json">
114181
<title>Loading JSON data</title>
115182

116-
<para>For a JSON media type, the content is loaded and parsed as JSON.</para>
183+
<para>For a JSON media type, the content is loaded and parsed as JSON
184+
<biblioref linkend="rfc8259"/>.</para>
117185

118186
<para>The parameters specified for the <code>fn:parse-json</code> function
119187
in <biblioref linkend="xpath31-functions"/>
@@ -133,6 +201,15 @@ map contains an entry whose key is defined in the specification of
133201
<code>fn:parse-json</code> and whose value is not valid for that key, or if it contains
134202
an entry with the key fallback when the parameter <option>escape</option> with
135203
<literal>true()</literal> is also present.</error></para>
204+
205+
<section xml:id="json-bom">
206+
<title>Byte order mark</title>
207+
208+
<para>JSON data transmitted with the UTF-8 encoding may begin with a byte order
209+
mark. If it does, the byte order mark is discarded before parsing the
210+
input.</para>
211+
</section>
212+
136213
</section>
137214

138215
<section xml:id="c.load.html">

0 commit comments

Comments
 (0)