Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler: Support long UTF-8 encoded atoms #8913

Merged
merged 1 commit into from
Oct 10, 2024

Conversation

bjorng
Copy link
Contributor

@bjorng bjorng commented Oct 8, 2024

Support for atoms containing any Unicode code point was added in Erlang/OTP 20 (#1078).

After that change, an atom can contain up to 255 Unicode code points. However, atoms used in Erlang source code is still limited to 255 bytes because the atom table in the BEAM file only has a byte for holding the length in bytes of the atom text. For instance, the 🟦 character has a four-byte encoding (<<240,159,159,166>>), meaning that Erlang source code containing a literal atom consisting of 64 or more such characters cannot be compiled.

This pull request changes the atom table in BEAM files to use two bytes a variable length encoding for the length of each atom. For atoms up to 15 bytes, the length is encoded in one byte. The header for the atom table is also changed to indicate that two-byte length the new encodiging of lengths are used. Attempting to load a BEAM file compiled with Erlang/OTP 28 in Erlang/OTP 27 or earlier will result in the following error message:

1> l(t).
=ERROR REPORT==== 8-Oct-2024::08:49:01.750424 ===
beam/beam_load.c(150): Error loading module t:
  corrupt atom table

{error,badfile}

beam_lib is updated to handle the new format. External tools that use beam_lib:chunks(Beam, [atoms]) to read the atom table will continue to work. External tools that do their own parsing of the atom table will need to be updated.

@bjorng bjorng added team:VM Assigned to OTP team VM enhancement testing currently being tested, tag is used by OTP internal CI labels Oct 8, 2024
@bjorng bjorng self-assigned this Oct 8, 2024
Copy link
Contributor

github-actions bot commented Oct 8, 2024

CT Test Results

     5 files    538 suites   1h 46m 46s ⏱️
 4 247 tests 4 154 ✅  93 💤 0 ❌
10 049 runs  9 932 ✅ 117 💤 0 ❌

Results for commit 04b168d.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@bjorng bjorng requested a review from jhogberg October 8, 2024 07:16
Support for atoms containing any Unicode code point was added
in Erlang/OTP 20 (PR-1078).

After that change, an atom can contain up to 255 Unicode code
points. However, atoms used in Erlang source code is still limited to
255 bytes because the atom table in the BEAM file only has a byte for
holding the length in bytes of the atom text. For instance, the `🟦`
character has a four-byte encoding (`<<240,159,159,166>>`), meaning
that Erlang source code containing a literal atom consisting of 64 or
more such characters cannot be compiled.

This commit changes the atom table in BEAM files to use a variable
length encoding for the length of each atom. For atoms up to 15 bytes,
the length is encoded in one byte. The header for the atom table is
also changed to indicate that new encoding of lengths are used.
Attempting to load a BEAM file compiled with Erlang/OTP 28 in
Erlang/OTP 27 or earlier will result in the following error message:

    1> l(t).
    =ERROR REPORT==== 8-Oct-2024::08:49:01.750424 ===
    beam/beam_load.c(150): Error loading module t:
      corrupt atom table

    {error,badfile}

`beam_lib` is updated to handle the new format. External tools that
use `beam_lib:chunks(Beam, [atoms])` to read the atom table will
continue to work. External tools that do their own parsing of the atom
table will need to be updated.
@bjorng bjorng force-pushed the bjorn/long-utf8-atoms/OTP-19285 branch from 0b18e23 to 04b168d Compare October 8, 2024 11:01
@bjorng bjorng merged commit 26d72b4 into erlang:master Oct 10, 2024
19 checks passed
@bjorng bjorng deleted the bjorn/long-utf8-atoms/OTP-19285 branch October 10, 2024 13:16
bettio added a commit to atomvm/AtomVM that referenced this pull request Oct 13, 2024
CI: build-and-test: disable OTP master

A recent change made impossible to use beam files compiled with OTP master, so
let's disable it.

See also #1321 and erlang/otp#8913

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants