Skip to content

Commit 7257839

Browse files
author
José Valim
committed
Improve the docs regarding UTF-8 encoded atoms
1 parent d14cae9 commit 7257839

File tree

2 files changed

+18
-32
lines changed

2 files changed

+18
-32
lines changed

erts/doc/src/erl_ext_dist.xml

+5-10
Original file line numberDiff line numberDiff line change
@@ -119,16 +119,11 @@
119119
<tcaption>Compressed Data Format when Expanded</tcaption></table>
120120
<marker id="utf8_atoms"/>
121121
<note>
122-
<p>As from ERTS 5.10 (OTP R16) support
123-
for UTF-8 encoded atoms has been introduced in the external format.
124-
However, only characters that can be encoded using Latin-1 (ISO-8859-1)
125-
are currently supported in atoms. The support for UTF-8 encoded atoms
126-
in the external format has been implemented to be able to support
127-
all Unicode characters in atoms in <em>some future release</em>.
128-
Until full Unicode support for atoms has been introduced,
129-
it is an <em>error</em> to pass atoms containing
130-
characters that cannot be encoded in Latin-1, and <em>the behavior is
131-
undefined</em>.</p>
122+
<p>As from ERTS 9.0 (OTP 20), UTF-8 encoded atoms may contain any Unicode
123+
character. Although the support for UTF-8 encoded atoms in the external
124+
format is available since ERTS 5.10 (OTP R16), passing atoms that cannot
125+
be encoded in Latin-1 is an <em>error</em> in versions earlier than
126+
Erlang/OTP 20, and <em>the behavior is undefined</em>.</p>
132127
<p>When distribution flag <seealso marker="erl_dist_protocol#dflags">
133128
<c>DFLAG_UTF8_ATOMS</c></seealso> has been exchanged between both nodes
134129
in the <seealso marker="erl_dist_protocol#distribution_handshake">

erts/doc/src/erlang.xml

+13-22
Original file line numberDiff line numberDiff line change
@@ -325,16 +325,11 @@ Z = erlang:adler32_combine(X,Y,iolist_size(Data2)).</code>
325325
is <c>latin1</c>, one byte exists for each character
326326
in the text representation. If <c><anno>Encoding</anno></c> is
327327
<c>utf8</c> or
328-
<c>unicode</c>, the characters are encoded using UTF-8
329-
(that is, characters from 128 through 255 are
330-
encoded in two bytes).</p>
328+
<c>unicode</c>, the characters are encoded using UTF-8 where
329+
characters may require multiple bytes.</p>
331330
<note>
332-
<p><c>atom_to_binary(<anno>Atom</anno>, latin1)</c> never
333-
fails, as the text representation of an atom can only
334-
contain characters from 0 through 255. In a future release,
335-
the text representation
336-
of atoms can be allowed to contain any Unicode character and
337-
<c>atom_to_binary(<anno>Atom</anno>, latin1)</c> then fails if the
331+
<p>As from Erlang/OTP 20, atoms can contain any Unicode character
332+
and <c>atom_to_binary(<anno>Atom</anno>, latin1)</c> may fail if the
338333
text representation for <c><anno>Atom</anno></c> contains a Unicode
339334
character &gt; 255.</p>
340335
</note>
@@ -402,13 +397,11 @@ Z = erlang:adler32_combine(X,Y,iolist_size(Data2)).</code>
402397
translation of bytes in the binary is done.
403398
If <c><anno>Encoding</anno></c>
404399
is <c>utf8</c> or <c>unicode</c>, the binary must contain
405-
valid UTF-8 sequences. Only Unicode characters up
406-
to 255 are allowed.</p>
400+
valid UTF-8 sequences.</p>
407401
<note>
408-
<p><c>binary_to_atom(<anno>Binary</anno>, utf8)</c> fails if
409-
the binary contains Unicode characters &gt; 255.
410-
In a future release, such Unicode characters can be allowed and
411-
<c>binary_to_atom(<anno>Binary</anno>, utf8)</c> does then not fail.
402+
<p>As from Erlang/OTP 20, <c>binary_to_atom(<anno>Binary</anno>, utf8)</c>
403+
is capable of encoding any Unicode character. Earlier versions would
404+
fail if the binary contained Unicode characters &gt; 255.
412405
For more information about Unicode support in atoms, see the
413406
<seealso marker="erl_ext_dist#utf8_atoms">note on UTF-8
414407
encoded atoms</seealso>
@@ -419,9 +412,7 @@ Z = erlang:adler32_combine(X,Y,iolist_size(Data2)).</code>
419412
> <input>binary_to_atom(&lt;&lt;"Erlang"&gt;&gt;, latin1).</input>
420413
'Erlang'
421414
> <input>binary_to_atom(&lt;&lt;1024/utf8&gt;&gt;, utf8).</input>
422-
** exception error: bad argument
423-
in function binary_to_atom/2
424-
called as binary_to_atom(&lt;&lt;208,128&gt;&gt;,utf8)</pre>
415+
'Ѐ'</pre>
425416
</desc>
426417
</func>
427418

@@ -2401,10 +2392,10 @@ os_prompt%</pre>
24012392
<desc>
24022393
<p>Returns the atom whose text representation is
24032394
<c><anno>String</anno></c>.</p>
2404-
<p><c><anno>String</anno></c> can only contain ISO-latin-1
2405-
characters (that is, numbers &lt; 256) as the implementation does not
2406-
allow Unicode characters equal to or above 256 in atoms.
2407-
For more information on Unicode support in atoms, see
2395+
<p>As from Erlang/OTP 20, <c><anno>String</anno></c> may contain
2396+
any Unicode character. Earlier versions allowed only ISO-latin-1
2397+
characters as the implementation did not allow Unicode characters
2398+
above 255. For more information on Unicode support in atoms, see
24082399
<seealso marker="erl_ext_dist#utf8_atoms">note on UTF-8
24092400
encoded atoms</seealso>
24102401
in section "External Term Format" in the User's Guide.</p>

0 commit comments

Comments
 (0)