Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: Update Unicode property lists per Unicode v13 #1896

Merged
merged 1 commit into from
May 21, 2020

Conversation

mathiasbynens
Copy link
Member

@mathiasbynens mathiasbynens commented Mar 11, 2020

@mathiasbynens mathiasbynens added the unicode Relates to upstream Unicode updates. label Mar 11, 2020
@ljharb ljharb added the normative change Affects behavior required to correctly evaluate some ECMAScript source text label Mar 11, 2020
@ljharb ljharb requested review from michaelficarra, syg, ljharb, bakkot and a team March 11, 2020 17:40
Copy link
Member

@ljharb ljharb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(altho "has tests" usually means the tests are merged, so we'll wait til they are before merging this)

@ljharb
Copy link
Member

ljharb commented Mar 11, 2020

Actually can this update Annex E as well?

@leobalter
Copy link
Member

(altho "has tests" usually means the tests are merged, so we'll wait til they are before merging this)

You're now excused to no longer wait for tests. tc39/test262#2526

@michaelficarra michaelficarra self-assigned this Mar 11, 2020
@syg syg removed their request for review March 13, 2020 20:23
@syg
Copy link
Contributor

syg commented Mar 13, 2020

I defer to the other editors here. I don't feel comfortable with Unicode.

@ljharb
Copy link
Member

ljharb commented May 11, 2020

actually before i land this; @mathiasbynens, would you please update Annex E to describe the observable changes in v13?

@mathiasbynens
Copy link
Member Author

@ljharb Added a note. PTAL.

spec.html Outdated Show resolved Hide resolved
@michaelficarra michaelficarra added the editor call to be discussed in the next editor call label May 13, 2020
@ljharb
Copy link
Member

ljharb commented May 21, 2020

@michaelficarra would you or @mathiasbynens mind adding an item to the agenda to discuss it?

@michaelficarra
Copy link
Member

Added: tc39/agendas@d15454b

@mathiasbynens if you'd like to co-present or create supporting materials, feel free to add yourself.

@ljharb ljharb merged commit e1c7179 into tc39:master May 21, 2020
ljharb pushed a commit to mathiasbynens/ecma262 that referenced this pull request May 21, 2020
@mathiasbynens mathiasbynens deleted the unicode-13-scripts branch May 24, 2020 10:58
@mathiasbynens
Copy link
Member Author

I'm curious what you're planning to present/propose.

AFAICT, we need to refer some list of Unicode properties (and values, and aliases for both) that must be supported in ECMAScript regular expressions, regardless of where that list is maintained. The upstream Unicode Standard does not currently have such a list, and it seems unlikely it ever will given it already publishes property lists that are different from what we explicitly decided to support in ECMAScript.

Feel free to ping me on email before the meeting: [email protected]

@mathiasbynens
Copy link
Member Author

Re-reading some of the above comments, I think I now know where the confusion comes from:

If we can get to 100% "always match latest Unicode", that's great! If we can't, however, I'd very much prefer a snapshot that we have to manually update.

We are already there! There's two separate dimensions here:

  1. Data: Using the latest Unicode data for whatever properties the spec refers to (which is observable in identifiers through e.g. ID_Start and in property escapes through which characters e.g. /\p{Script=Greek}/u matches)
  2. Implementation: Extending the set of supported properties in regular expressions (e.g. Unicode 9001 adds a property named Foo and so /\p{Foo}/u now suddenly matches stuff instead of throwing a SyntaxError)

Point 1 was addressed by #620 after https://github.com/tc39/notes/blob/master/meetings/2016-07/jul-27.md#10ia-require-unicode-900 (and I hope you're not trying to relitigate that). It sends a very important signal to implementers that they should decouple the implementation from the Unicode data they're using, i.e. have a mechanism in place to quickly update their implementation to the latest Unicode data.

Point 2 is addressed by these annual PRs that add new properties introduced by new Unicode versions. If, hypothetically, such a PR would be rejected, then all the pre-existing properties in ECMAScript's list would continue to be supported, and they would still start matching new characters if the new Unicode version changes their definition, which is exactly the intention.

@michaelficarra
Copy link
Member

@mathiasbynens You've not helped me understand why these concepts are separate. Are you saying that the set of supported properties in regular expressions is not derived from a Unicode data set directly? Are there any properties that are defined by Unicode that we choose not to pull in to ECMA262 regular expressions?

@mathiasbynens
Copy link
Member Author

Are you saying that the set of supported properties in regular expressions is not derived from a Unicode data set directly?

Exactly. There is a large list of properties that Unicode defines, including so-called Binary properties, Enumerated properties, Catalog properties, and so on. Some of those categories of properties we cannot yet support in ECMAScript (e.g. https://github.com/tc39/proposal-regexp-unicode-sequence-properties), others were deemed not useful enough to support in ECMAScript (e.g. \p{Block=...}).

PRs like this one based on a new Unicode release can cover a few types of changes without going through the proposal process:

  • ECMAScript aims to support all binary properties, so when a Unicode release introduces new binary properties we add them (+ their canonical aliases) to the relevant list in the ECMAScript spec.
  • ECMAScript aims to support all values for the Script, Script_Extensions, and General_Category properties, so when a Unicode release introduces new values we add them (+ their canonical aliases) to the relevant list in the ECMAScript spec.
  • When a Unicode release introduces new property or value aliases for one of the properties or values that ECMAScript already supports, we add their canonical forms to the relevant list in the ECMAScript spec.

Are there any properties that are defined by Unicode that we choose not to pull in to ECMA262 regular expressions?

There's plenty. Blindly pulling in everything would bloat JS engine binary size for little to no gain. \p{Block=...} is a good example of something that would not be useful in practice.

@ljharb
Copy link
Member

ljharb commented May 27, 2020

Could those bullet points not be explicitly delegated to Unicode, rather than explicitly enumerated?

@mathiasbynens
Copy link
Member Author

I suppose we could make the spec less explicit in this way, but it seems strictly worse than listing the properties/values and their aliases explicitly. Unicode doesn't list them in any one place, and since Unicode docs assume loose matching (which we deliberately decided against for ECMAScript), there's no clear overview of what is a canonical property/value name/alias vs. what isn't. This is a massive interoperability footgun.

It's also not what the committee agreed previously. I don't understand the desire to relitigate this.

@mathiasbynens
Copy link
Member Author

cc @littledan

@ljharb
Copy link
Member

ljharb commented May 27, 2020

It wasn’t clear to any of the editors that this is what we were agreeing to; it’s possible others on the committee were unclear as well.

We all like the idea, i expect, of implicitly matching latest Unicode at all times - but i suspect many of us would not like the idea of the current half-measure.

@mathiasbynens
Copy link
Member Author

i suspect many of us would not like the idea of the current half-measure.

What half-measure?

@michaelficarra
Copy link
Member

@mathiasbynens The time between a Unicode release and us updating these tables, where certain Unicode properties or Script names/aliases are unavailable, inconsistent with the observed properties of code points from other contexts. For example, I can observe the ID_Start property of a code point by using it as an identifier (using eval to do a runtime test, if you prefer). From this, I can infer that the supported Unicode version is >= some version number. Yet I could then also try to use the name of a Script in a RegExp that I know to be present in that same Unicode version, and see that it is unsupported. As a programmer, I've observed an inconsistency in the Unicode support of my host.

@mathiasbynens
Copy link
Member Author

@michaelficarra This time delta is always going to exist regardless of how quickly we update the spec, since implementing and shipping things takes time. (Also see Can I Unicode.)

It boils down to data vs. implementation. In practice, updating the version of the Unicode data an engine is using is a separate task from updating the lists of properties/values and their aliases RegExps should recognize. In V8 for example, the former is done by updating ICU, whereas the latter is done by updating the hardcoded list of properties in the regular expression engine.

The current ECMAScript spec provides the complete list of supported properties, values, and aliases. Implementers need these lists. Making this information harder to find is an interoperability footgun.

@michaelficarra
Copy link
Member

@mathiasbynens If we feel it is important for implementers, we can continue to maintain the lists as non-normative. But the normative text should reference the Unicode data sets and describe a way to derive the supported Scripts and other properties. That way we never normatively specify mixed Unicode support.

@mathiasbynens
Copy link
Member Author

It's important that the lists continue to be included in the spec.

Precisely describing a way to derive the supported properties seems tricky and error-prone. I wrote a rough summary in an earlier comment, but there are exceptions — see tc39/proposal-regexp-unicode-property-escapes#27 for some binary properties that were explicitly excluded. (There's other things that contribute to the trickiness, e.g. Any, Assigned, and ASCII are technically not "properties" in the Unicode definition of the word, although they behave like properties.) If Unicode adds new binary properties that are similarly special, we might want to decide not to support those in ECMAScript. With the current approach of having normative lists in the spec, we're keeping that option open. If we move away from normative lists, the new property would technically be supported as soon as the new Unicode Standard is released, at which point engines could hypothetically ship support for it, at which point removing that property might become impossible due to web compat. It seems better to avoid this risk by sticking to what we have.

ECMAScript should control what properties ECMAScript supports. Unicode should control the data.

@mathiasbynens
Copy link
Member Author

To be clear, I like the idea of describing the way to derive the supported properties, but given that a) it seems difficult to get 100% right, and b) that we probably want to retain our freedom to exclude new properties when needed, I'd suggest doing it non-normatively.

@michaelficarra
Copy link
Member

Well I think that's not for us to decide on our own, but a matter to be brought to the committee.

@michaelficarra
Copy link
Member

@mathiasbynens Something I failed to bring up during the plenary discussion is that, if we opposed automatically pulling in new property names/aliases, these kinds of PRs are not strictly editorial and must reach consensus. I think we need to back out this and #1939 until we get that consensus.

@michaelficarra michaelficarra added the editor call to be discussed in the next editor call label Jun 5, 2020
@mathiasbynens
Copy link
Member Author

@michaelficarra The commit message for this PR is marked "Normative", not "Editorial". Note that these changes are already shipping in V8, so I'd advise against backing this out at this point. I'd be happy to let you discuss these PRs in plenary before landing in the future.

@michaelficarra
Copy link
Member

@mathiasbynens Result of the editor call is that we'd like these kinds of PRs to get committee consensus before merging. Since the Unicode 13 PRs have already landed, we will leave them in and ask for retroactive consensus at the next meeting.

@michaelficarra michaelficarra removed the editor call to be discussed in the next editor call label Jun 10, 2020
@mathiasbynens
Copy link
Member Author

Closing the loop — the notes for the TC39 discussion mentioned in this thread can be found here: https://github.com/tc39/notes/blob/master/meetings/2020-06/june-2.md#introducing-unicode-support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has test262 tests normative change Affects behavior required to correctly evaluate some ECMAScript source text unicode Relates to upstream Unicode updates.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants