-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new AtU8 beam chunk #1078
Add new AtU8 beam chunk #1078
Conversation
What is meant by an UTF8 codepoint? I thought that by the time you're in UTF-8 you no longer deal in codepoints, just in bytes? The Unicode consortium does specify that lengths can be counted in one of 4 ways. From the FAQ using the string aनि亜𐂃 (U+0061, U+0928, U+093F, U+4E9C, U+10083)
The two latter are not encoding-specific. What is meant exactly by UTF-8 code points? Just code points, or it's a bad term for code units (and hence bytes)? |
Hey @ferd, I meant literally code points as in your definition:
It is also worth mentioning the implementation today (prior to this patch) also checks code points. This can be checked in the source code or by trying a quick example: 3> binary_to_atom(binary:copy(<<"é"/utf8>>, 255), utf8).
ééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé
4> byte_size(binary:copy(<<"é"/utf8>>, 255)).
510
5> binary_to_atom(binary:copy(<<"é"/utf8>>, 256), utf8).
** exception error: a system limit has been reached
in function binary_to_atom/2
called as binary_to_atom(<<"éééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé"/utf8...>>,
utf8) So this pull request is currently backwards incompatible (hence the pending tasks). The question is mostly how to implement the new binary_to_atom efficiently. |
@@ -6332,7 +6355,7 @@ erts_make_stub_module(Process* p, Eterm Mod, Eterm Beam, Eterm Info) | |||
goto error; | |||
} | |||
define_file(stp, "atom table", ATOM_CHUNK); | |||
if (!load_atom_table(stp)) { | |||
if (!load_atom_table(stp, ERTS_ATOM_ENC_LATIN1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to go through the same if (stp->chunks[UTF8_ATOM_CHUNK].size > 0) ...
charade down here, or code:make_stub_module/3
(and thus HiPE) will break. There's some tests for it in erts/emulator/test/code_SUITE*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I have run the code suite and fixed this and the remaining failing tests, I have updated this PR.
Patch has passed first testings and has been assigned to be reviewed I am a script, I am not human |
We will not have time to do a thorough review until OTP 19.0 has been released, but here is some quick feedback: The test case
Not modifying |
@bjorng Can we name the chunk 'JOSÉ'? PLS. |
We will take a closer at this PR request when you have fixed the failed test case mentioned above and rebased it on the latest master. |
I think you interpreted that incorrectly. Currently only UTF-8 encoded atoms that can be converted to Latin-1 are supported, so it just counts the number of Latin-1 code points, which equates the number of Latin-1 code units, which equates the number of bytes in the result. |
@bjorng I will allocate time in the next weeks to work on all of my pending pull requests. :D While talking to @nox, we came up with another solution for the second bullet above. @nox commented that it would be much simpler to count bytes instead of counting codepoints. However, if we count bytes and keep the limit of 255 bytes, it would be backwards incompatible because an atom made only of the letter However, if we double the maximum atom size, we could keep the conceptually simpler and faster code that only counts bytes and remain backwards compatible. From the Elixir perspective, Increasing the atom limit would be a positive change, since function names must be atoms, sometimes we may reach this limit when building functions dynamically. @bjorng thoughts on this suggestion? |
I think that the BIFs that create atoms should count code points, and let the limit be 255 code points. When converting from a list, the number of code points is the number of elements in the list (assuming that we don't try to combine code points that can be combined). The chunk in the BEAM file can give the size of each atom in bytes, and the loader can trust the compiler not to create overlong atoms. (There could be a sanity check to abort the loading if any atom is longer than 1024 bytes.) |
Rationale for counting code points: We don't want to handle support request from Klingon users that complain that they can only put 85 Klingon characters into their atoms, while speakers of most west-European languages can have 255 characters in their atoms. |
What about following what ROK proposed in EEP20? 2 bytes for byte length, 2 bytes for code point count. |
8b5121e
to
7367a1f
Compare
Eterm | ||
erts_atom_put(const byte *name, int len, ErtsAtomEncoding enc, int trunc) | ||
int | ||
erts_atom_put_index(const byte *name, int len, ErtsAtomEncoding enc, int trunc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjorng I have introduced this function that returns the atom index instead of the atom so we have control upstream if we should error with badarg or system_limit. I have two questions I would love your input on:
- Should the name have the
erts_
prefix? - Should we use
#define ATOM_BAD_ENCODING
and#define ATOM_MAX_CHARS
instead of using integers? If so, is it fine to define those on atom.h?
Notice the new function and its return values are used in binary_to_atom (erl_unicode.c). The PR has also been rebased. Once those questions are answered, I will squash everything and we should be good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention for the erts_
prefix is to use it for global functions. Older function, or some very often used functions, e.g. size_object
does not use it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answer for 2: Yes, and yes.
7367a1f
to
e426d18
Compare
@bjorng This is ready for review. I have rebased, fixed all feedback and features. I have ran emulator, compiler and stdlib test suites with the following results:
I have also added a test that compiles a module from the Erlang Abstract Format with utf8 atoms, to ensure the whole compilation chain works, and also a test for the r19 option. |
Patch has passed first testings and has been assigned to be reviewed I am a script, I am not human |
e426d18
to
f4424da
Compare
Patch has passed first testings and has been assigned to be reviewed I am a script, I am not human |
Three test cases fail in I updated the primary bootstrap and compiled from a clean repository. |
How do I update the primary bootstrap? I remember doing a clean build but I did not touch the bootstrap files. Thanks! |
I found and fixed another minor bug. Please squash and rebase on latest master. |
4bb624c
to
b58cd32
Compare
Done and done! Btw the reducing memory during compilation I had to rebase on may help Elixir too. 👍 |
Can you please rebase again? We will test the branch in our daily builds again to see that there are no remaining failed test cases. |
b58cd32
to
d14cae9
Compare
@bjorng done! |
This branch no longer seems to cause any problems in our daily builds. We are thinking about merging it soon. More work is probably needed in some applications, for example to ensure that "~ts" is used as format specifier when an atom is displayed, but we can do that in separate branches. However, I think that the documentation should be updated in this branch to reflect the changes, at least the BIFs for the atom BIFs. Do you think you could update the documentation? |
I will push the docs soon. :) Two questions:
|
I have pushed a separate commit with docs. Once review is done, I can squash it all together. |
|
I agree. Let me know what are your decisions regarding 1 and 2 and I will gladly add tests and improve the docs. |
There may some time before we can reach a decision. We will merge this branch and do further corrections/improvement in separate pull requests. Please squash the commits. When you have done it, we will run the branch once more in our daily builds and then merge it. |
The new chunk stores atoms encoded in UTF-8. beam_lib has also been modified to handle the new 'utf8_atoms' attribute while the 'atoms' attribute may be a missing chunk from now on. The binary_to_atom/2 BIF can now encode any utf8 binary with up to 255 characters. The list_to_atom/1 BIF can now accept codepoints higher than 255 with up to 255 characters (thanks to Björn Gustavsson).
7257839
to
26b59df
Compare
Squashed and pushed. |
Like renaming the chunk to 'JOSÉ', right? |
🎉 |
Support for decompiling |
The new chunk stores atoms encoded in UTF-8.
This is still work in progress and the following tasks are still pending:
Add the compile option
r19
that will compile atoms to the old "Atom" chunk as mentioned by @bjorng in the mailing listSupport 255 UTF-8 codepoints in
binary_to_atom
instead of 255 bytes. The issue is that calculating the number of characters is linear to the binary size. Althougherts_atom_put
performs such calculation, it returns a NON-VALUE for both invalid encoding and large binaries errors, and the test suite expects badarg for invalid encoding and system limit for large binaries. To solve this, we can either:binary_to_atom
BIF and continue raising a system limit error (effectively traversing the binary twice)Thoughts? Any other options? Once this is changed, the docs shall be properly updated.
Feedback request:
beam_lib
has been modified to support a new 'utf8_atoms' chunk that maps to the new "AtU8" chunk. This means accessing the old "Atom" chunk can now be missing although it is straight-forward to handle it with a case statement.list_to_atom
has not been modified and it will still fail if given a character more than 255. The reason I have decided to not changelist_to_atom
is because its current documentation does not say it may support characters more than 255 in the future (while thebinary_to_atom
has a very explicit warning about such).