-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Unicode property escape support to RegExp #1295
Conversation
(I'm including this outside of the PR description, in case that's used for automation purposes.) Looking for Feedback
|
Hi! Thank you for this contribution, it must have been a lot of work. We would like to land this in Hermes, but since this is a lot of code, I hope you can bear through our review process, which can be very detail oriented. At first glance I see some things that likely need to be tweaked, like the static constructors of In any case, this is required ECMAScript functionality, so we are very willing to work with you to land this, and I hope you are willing to work with us.
As a first step, I think it would be definitely very helpful if you break this into commits, to make it easier to review and iterate on each. The contribution guideline to squash everything is generally well intentioned, because it assumes most contributions will be smaller and because Github does not provide the best UI for reviewing a stack of commits. But a sizable contribution like this really needs independent review of its parts. |
I'm more than happy to go through your review process, and look forward to it!
This is exactly the kind of feedback I was hoping for, thank you. I knew writing this that this wouldn't necessarily be the best approach, but it was fairly easy to understand and iterate on. I'd be glad to see this evolve into a better solution.
Of course, I'll get to that as soon as I can. |
040070d
to
187aa18
Compare
@tmikov I've restructured the original giant commit into several more digestible ones, I hope it helps make the review easier. |
Thanks for splitting this up, I think the first thing to do here is to try and reduce the size of the additional data so that we get a more realistic picture of the final size. The idea is to eliminate the need to emit static constructors to populate the data structures you're using, particularly To pack all of the data into
I've demonstrated these transformations on a small snippet in https://godbolt.org/z/Mvr7qKxnT, showing how the current data structures can be changed into this format. Notice that the generated output is dramatically smaller, while encoding the same information. As you make these changes, you can measure the effect they have on the final size. You can configure and build Hermes as follows:
To give you a rough sense of the numbers, the generated binary (on my M1 Mac) goes from 2.4MB before this PR to 2.6MB after it. Hopefully with these changes, the size gap will shrink considerably. |
a7830ed
to
62705ac
Compare
Thanks for the detailed feedback @neildhar, the godbolt output for those two snippets was really interesting! I've made the changes to I ran optimized build you suggested for main, this branch after the changes, and this branch before the changes.
The difference between main and this branch after the changes is 249664 bytes, which is about the difference you mentioned in your own builds. The difference between this branch before and after the changes is 21264 bytes, which is smaller than I expected (although I guess ~20KB of assembly just to initialize static data is a lot). Is this about the size difference you expected? I ran some rough calculations for the size of the generated Unicode data:
Seems like the ~200KB increase in binary size roughly matches up with the amount of raw data that has been added. As far as I can tell the code now matches your suggested structures, but I'd love to find out that I overlooked something obvious. What do you think the next steps are, trying to improve how the data is stored? |
Thanks for making these changes so quickly, and for being so receptive iterating on this! I agree that the improvement is somewhat smaller than I would have expected. The next step is probably to try and cut down dynamic relocations, it looks like they add tens of kilobytes to the size, and will also slow down startup. This will be a slightly more involved transformation, since we're trying to have as few pointers as possible in the constant data. Right now, every string and array pointer in the data needs a dynamic relocation to set the actual pointer at runtime, based on where the library is loaded into memory. The basic idea is to turn the array and string pointers into offsets instead of pointers. We need to merge all of the arrays of a given type into a single giant pool (same for strings). Then in each place where we currently have a pointer (like each This adds some marginal cost when accessing the data, but it represents the data much more efficiently (since the offsets will be smaller than pointers), and avoids the startup and size overhead of the dynamic relocations. I repurposed the same example to demonstrate how this would look: https://godbolt.org/z/novqf4PdK Note that the size win here will be less obvious in godbolt, but we should see an improvement in the final binary. I expect this to be on the same order as the code size reduction you observed from the previous change. In the case of strings, it will also be worth doing some basic deduplication, since many of the strings are repeated. This should get us fairly close to the bare minimum needed to encode this data, which will guide how to include this in the build, and then we can proceed with reviewing the actual logic. |
Thank you again for the excellent feedback! Near the beginning of this change, having a single large pool did occur to me but I didn't know about dynamic reallocations, so I decided to avoid the complexity and would probably not have had the 3-stage lookup anyway. Sorry about the longer turnaround on this iteration, between a busy day job and wrapping my head around how to restructure the Python logic, it just took me longer to get to. I ran the optimized build again and this is the new size:
Looks like a 75KB improvement, I ended up going with 16-bit values since none of the offset+size values were larger than 65535:
Interesting (but maybe not useful) things:
|
@neildhar Are there any other changes you'd like me to make, prior to moving on to reviewing the logic? |
@jonathanj Sorry for the delay, nope, I'll take a look at the actual logic next. Given the size regression, we'll also want to retain the option to disable this. There are two steps to doing this (which you can do in parallel while I review the rest):
|
Thanks @neildhar!
I ended up naming the option Do I need to specify this option in the gradle, podspec, Circle CI config, and other build files, like is done for
Done. There are still quite a few |
@jonathanj I think we will keep this option always enabled for OSS releases (we are much more sensitive to binary size internally). If we set it on to default, there is no need to worry about gradle, podspec, etc. About CircleCI: I think we should also keep it on (though I am curious what @neildhar thinks). |
I agree, it seems simplest to turn this on by default, and not override it in any of our build configs. We can specify different defaults when building internally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies for the delay in getting to this.
I've reviewed the C++ changes, which mostly look good. It's awesome that you were able to familiarise yourself with our RegExp implementation and the nuances of this feature!
I have one correctness concern, and some organisational suggestions that should hopefully simplify things.
} | ||
|
||
// Canonicalize the property name. | ||
auto canonicalNameEntry = findNameMapEntry(nameMapStart, nameMapEnd, key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding it a little hard to follow the naming here between name
, value
, and key
. Are there more precise terms we can use? Otherwise, a comment describing what each thing refers to would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I blame you, I think it's confusing too and would really appreciate finding some better names. Maybe I can explain the terms, and you can decide if there are better names or if perhaps a comment is more useful.
The RegExp Unicode property syntax accepts:
- Binary properties, where the presence of the property implies a value of true, e.g.
ASCII_Hex
,Emoji
, etc.- Additionally, all of
General_Category
is also allowed to be specified as a binary property, e.g.Lowercase_Letter
,N
,Symbol
, etc.
- Additionally, all of
- Properties with a value, where the value is not a binary choice, e.g.
Scripts=Latin
,Script_Extensions=Katakana
,General_Category=Symbol
, etc.
The "property name" and "property value" terminology comes from ES262.
The "key" terminology is my own, and is imprecise because it is used to resolve a name to its canonical form, by looking it up in canonicalPropertyNameMap_<CATEGORY>
, where CATEGORY
is determined as follows:
- If there is no
propertyValue
(in thePropertyName=PropertyValue
sense) given, thenCATEGORY
assumed to beBinaryProperty
- If that assumption is incorrect, and it's not in the binary property category, then try
GeneralCategory
- If that assumption is incorrect, and it's not in the binary property category, then try
- If there is a
propertyValue
, thenCATEGORY
is determined bypropertyName
(e.g.canonicalPropertyNameMap_Script
)
In both cases key
needs to be resolved according to the canonical name of the value, but in the binary property case the "value" can be thought of something like General_Category=VALUE
. I chose key
in the sense of value = dict[key]
but maybe a term like needle
better conveys the "we need to find this" idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Perhaps the best thing to do is to compute canonicalNameEntry
in each of the cases, so the logic is easier to follow. It would also avoid duplicating the lookup for binary properties. So for example:
NameMapEntry *canonicalNameEntry;
llvh::ArrayRef<RangeMapEntry> rangeMap;
if (propertyValue.empty()) {
// There was no property value, this is either a binary property or a value
// from General_Category, as per `LoneUnicodePropertyNameOrValue`.
if(canonicalNameEntry = findNameMapEntry(canonicalPropertyNameMap_BinaryProperty, propertyName))
rangeMap = unicodePropertyRangeMap_BinaryProperty;
else if(canonicalNameEntry = findNameMapEntry(canonicalPropertyNameMap_GeneralCategory, propertyName))
rangeMap = unicodePropertyRangeMap_GeneralCategory;
}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great suggestion, thanks! It definitely helps localise the initial setup, before moving onto the range lookups.
I did have to do findMapEntry(llvh::ArrayRef(…), …);
to appease compiler, let me know if you have any thoughts on that.
I'm working through the generation script now, and have some suggestions that would make it much easier to review:
|
Download a significant portion of the Unicode database in order to generate codepoint tables, and property name/value canonicalization, for the Unicode 15.1.0 binary and non-binary property names/aliases supported by ES262. Certain properties that are not explicitly defined in the Unicode database are derived here, such as "Assigned", and "Any".
Resolves the property name/value to a Unicode range table, then directly adds all entries from that table to a CodePointSet. It also handles the inverted case (i.e. `\P`) where all ranges _not_ in a table are added to the CodePointSet.
This uses the newly named Unicode range tables for the same functionality, tables such as `UNICODE_LETTER` are left as-is, since they are more efficient as they are for their purpose.
In Unicode mode, RegExp is now able to parse `\p{...}` and `\P{...}` atom escapes, both in and out of character classes. Outside of Unicode mode, `\p` and `\P` are unnecessary escapings of literal `p` and `P` respectively. A new error type `InvalidPropertyName` was added, which is returned for all property-name related parsing/resolving issues. This seems to be consistent with other JS engines.
This means a script extension range can be specified as a single `ArrayRef` that includes the range for the original script.
Now, the class atom contains the ArrayRef for the codepoint ranges, and the range is validated, and that range is then used separately to add the codepoints to the bracket. This simplifies a lot of the validation logic, and requires fewer conditional returns as a result.
d2e98a1
to
339f5c9
Compare
/// Find a matching entry (such as \p NameMapEntry or \p RangeMapEntry) by | ||
/// matching a string \p name against the entry's \p name field. | ||
template <class T> | ||
const T *findNameMapEntry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to match the description, and mark it as static.
const T *findNameMapEntry( | |
static const T *findMapEntry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've done that with the commit that fixes up the logic flow in unicodePropertyRanges
.
} | ||
|
||
// Canonicalize the property name. | ||
auto canonicalNameEntry = findNameMapEntry(nameMapStart, nameMapEnd, key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Perhaps the best thing to do is to compute canonicalNameEntry
in each of the cases, so the logic is easier to follow. It would also avoid duplicating the lookup for binary properties. So for example:
NameMapEntry *canonicalNameEntry;
llvh::ArrayRef<RangeMapEntry> rangeMap;
if (propertyValue.empty()) {
// There was no property value, this is either a binary property or a value
// from General_Category, as per `LoneUnicodePropertyNameOrValue`.
if(canonicalNameEntry = findNameMapEntry(canonicalPropertyNameMap_BinaryProperty, propertyName))
rangeMap = unicodePropertyRangeMap_BinaryProperty;
else if(canonicalNameEntry = findNameMapEntry(canonicalPropertyNameMap_GeneralCategory, propertyName))
rangeMap = unicodePropertyRangeMap_GeneralCategory;
}
...
This helps localise the logic/assignment before proceeding to the general range lookup part.
@neildhar Just checking in if there's anything I still need to address, or can help with for this review. |
@jonathanj Thanks for the ping, there isn't anything outstanding, let me import it so we can more easily do a quick review of the python changes |
@neildhar has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good, apart from the minor comments, the only real ask is to add a file-level doc-comment in genUnicodeTable.py
explaining what the steps are in producing the final output. Once that's done, this should be good to merge.
Thank you for your patience with this process!
utils/genUnicodeTable.py
Outdated
unsigned offset:16; | ||
unsigned size:16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need a bitfield here and below, we can just use uint16_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've updated all of these to uint16_t
.
utils/genUnicodeTable.py
Outdated
""" | ||
binary_property_aliases = { | ||
canonical_name: [] | ||
for canonical_name in "ASCII ASCII_Hex_Digit Alphabetic Bidi_Control Bidi_Mirrored " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be preferable to just make this a list to begin with instead of creating and splitting a string. It can also be assigned to a separate variable beforehand to make the loop easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, I've gone ahead and made that change.
@jonathanj has updated the pull request. You must reimport the pull request before landing. |
I added instructions to the end of the existing file-level comment, that explain invoking the script from the project working root, use with clang-format, and the redirect path. I think I sort of just figured this out, so I actually have no idea if this was the intended process! Let me know if I need to adjust it.
Thank you for your patience with my submission, I'm very appreciative of the insightful comments and explanations along the way, I learned some interesting new things! It feels good to contribute to an exciting project like Hermes, with supportive maintainers. |
@neildhar has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hello @neildhar 👋 Just checking in if there's anything else I should do, perhaps rebase on main? The |
Hey @jonathanj, nothing at the moment, I've rebased it internally and it should get merged soon. |
Summary: Original Author: [email protected] Original Git: dcf8e7b Original Reviewed By: avp Original Revision: D56493540 Extends RegExp to handle `\p` and `\P` Unicode property escapes, standalone and within character classes, for Unicode mode. The implementation extends the existing idea of tables of codepoint ranges to cover all of the properties explicitly required by ES262. The `genUnicodeTable.py` script was updated to generate code related to the binary and non-binary Unicode properties explicitly mentioned in ES262 ([ref](https://tc39.es/ecma262/multipage/text-processing.html#sec-runtime-semantics-unicodematchproperty-p)); based on the Unicode 15.1.0 data files. Closes #1027 Pull Request resolved: #1295 Pulled By: neildhar Reviewed By: neildhar Differential Revision: D60361619 fbshipit-source-id: 70d7293114576d4b05f1dbc2d9b22e553bec089f
Summary
Extends RegExp to handle
\p
and\P
Unicode property escapes, standalone and within character classes, for Unicode mode. The implementation extends the existing idea of tables of codepoint ranges to cover all of the properties explicitly required by ES262.The
genUnicodeTable.py
script was updated to generate code related to the binary and non-binary Unicode properties explicitly mentioned in ES262 (ref); based on the Unicode 15.1.0 data files.Closes #1027
Test Plan
regexp_unicode_properties.js
to the Hermes test suite\p{…}
and\P{…}
hermes/utils/testsuite/run_testsuite.py
--test-skiplist
otherwise around 417 tests (intest262/test/built-ins/RegExp/property-escapes
) are skipped with the reason "Skipping test with 'const'". This was confusing to me since most of these do pass, but I couldn't find an explainer for this skip.