Normative: Update Unicode property lists per Unicode v13 #1896

mathiasbynens · 2020-03-11T17:36:28Z

https://unicode.org/versions/Unicode13.0.0/

Tests: tc39/test262#2526

Ref. #1897.

ljharb

(altho "has tests" usually means the tests are merged, so we'll wait til they are before merging this)

ljharb · 2020-03-11T20:51:29Z

Actually can this update Annex E as well?

leobalter · 2020-03-11T20:54:39Z

(altho "has tests" usually means the tests are merged, so we'll wait til they are before merging this)

You're now excused to no longer wait for tests. tc39/test262#2526

syg · 2020-03-13T20:23:45Z

I defer to the other editors here. I don't feel comfortable with Unicode.

Ref. tc39#1897 & tc39#1896.

https://unicode.org/versions/Unicode13.0.0/

ljharb · 2020-05-11T21:19:08Z

actually before i land this; @mathiasbynens, would you please update Annex E to describe the observable changes in v13?

mathiasbynens · 2020-05-12T07:04:44Z

@ljharb Added a note. PTAL.

spec.html

ljharb · 2020-05-21T20:50:20Z

@michaelficarra would you or @mathiasbynens mind adding an item to the agenda to discuss it?

michaelficarra · 2020-05-21T20:58:56Z

Added: tc39/agendas@d15454b

@mathiasbynens if you'd like to co-present or create supporting materials, feel free to add yourself.

https://unicode.org/versions/Unicode13.0.0/

Ref. tc39#1897 & tc39#1896.

mathiasbynens · 2020-05-24T11:16:20Z

I'm curious what you're planning to present/propose.

AFAICT, we need to refer some list of Unicode properties (and values, and aliases for both) that must be supported in ECMAScript regular expressions, regardless of where that list is maintained. The upstream Unicode Standard does not currently have such a list, and it seems unlikely it ever will given it already publishes property lists that are different from what we explicitly decided to support in ECMAScript.

Feel free to ping me on email before the meeting: [email protected]

mathiasbynens · 2020-05-24T11:31:56Z

Re-reading some of the above comments, I think I now know where the confusion comes from:

If we can get to 100% "always match latest Unicode", that's great! If we can't, however, I'd very much prefer a snapshot that we have to manually update.

We are already there! There's two separate dimensions here:

Data: Using the latest Unicode data for whatever properties the spec refers to (which is observable in identifiers through e.g. ID_Start and in property escapes through which characters e.g. /\p{Script=Greek}/u matches)
Implementation: Extending the set of supported properties in regular expressions (e.g. Unicode 9001 adds a property named Foo and so /\p{Foo}/u now suddenly matches stuff instead of throwing a SyntaxError)

Point 1 was addressed by #620 after https://github.com/tc39/notes/blob/master/meetings/2016-07/jul-27.md#10ia-require-unicode-900 (and I hope you're not trying to relitigate that). It sends a very important signal to implementers that they should decouple the implementation from the Unicode data they're using, i.e. have a mechanism in place to quickly update their implementation to the latest Unicode data.

Point 2 is addressed by these annual PRs that add new properties introduced by new Unicode versions. If, hypothetically, such a PR would be rejected, then all the pre-existing properties in ECMAScript's list would continue to be supported, and they would still start matching new characters if the new Unicode version changes their definition, which is exactly the intention.

michaelficarra · 2020-05-26T16:54:45Z

@mathiasbynens You've not helped me understand why these concepts are separate. Are you saying that the set of supported properties in regular expressions is not derived from a Unicode data set directly? Are there any properties that are defined by Unicode that we choose not to pull in to ECMA262 regular expressions?

mathiasbynens · 2020-05-27T06:58:20Z

Are you saying that the set of supported properties in regular expressions is not derived from a Unicode data set directly?

Exactly. There is a large list of properties that Unicode defines, including so-called Binary properties, Enumerated properties, Catalog properties, and so on. Some of those categories of properties we cannot yet support in ECMAScript (e.g. https://github.com/tc39/proposal-regexp-unicode-sequence-properties), others were deemed not useful enough to support in ECMAScript (e.g. \p{Block=...}).

PRs like this one based on a new Unicode release can cover a few types of changes without going through the proposal process:

ECMAScript aims to support all binary properties, so when a Unicode release introduces new binary properties we add them (+ their canonical aliases) to the relevant list in the ECMAScript spec.
ECMAScript aims to support all values for the Script, Script_Extensions, and General_Category properties, so when a Unicode release introduces new values we add them (+ their canonical aliases) to the relevant list in the ECMAScript spec.
When a Unicode release introduces new property or value aliases for one of the properties or values that ECMAScript already supports, we add their canonical forms to the relevant list in the ECMAScript spec.

Are there any properties that are defined by Unicode that we choose not to pull in to ECMA262 regular expressions?

There's plenty. Blindly pulling in everything would bloat JS engine binary size for little to no gain. \p{Block=...} is a good example of something that would not be useful in practice.

ljharb · 2020-05-27T07:01:29Z

Could those bullet points not be explicitly delegated to Unicode, rather than explicitly enumerated?

mathiasbynens · 2020-05-27T07:46:26Z

I suppose we could make the spec less explicit in this way, but it seems strictly worse than listing the properties/values and their aliases explicitly. Unicode doesn't list them in any one place, and since Unicode docs assume loose matching (which we deliberately decided against for ECMAScript), there's no clear overview of what is a canonical property/value name/alias vs. what isn't. This is a massive interoperability footgun.

It's also not what the committee agreed previously. I don't understand the desire to relitigate this.

mathiasbynens · 2020-05-27T07:47:16Z

cc @littledan

ljharb · 2020-05-27T07:54:12Z

It wasn’t clear to any of the editors that this is what we were agreeing to; it’s possible others on the committee were unclear as well.

We all like the idea, i expect, of implicitly matching latest Unicode at all times - but i suspect many of us would not like the idea of the current half-measure.

mathiasbynens · 2020-05-27T07:55:56Z

i suspect many of us would not like the idea of the current half-measure.

What half-measure?

michaelficarra · 2020-05-27T16:24:54Z

@mathiasbynens The time between a Unicode release and us updating these tables, where certain Unicode properties or Script names/aliases are unavailable, inconsistent with the observed properties of code points from other contexts. For example, I can observe the ID_Start property of a code point by using it as an identifier (using eval to do a runtime test, if you prefer). From this, I can infer that the supported Unicode version is >= some version number. Yet I could then also try to use the name of a Script in a RegExp that I know to be present in that same Unicode version, and see that it is unsupported. As a programmer, I've observed an inconsistency in the Unicode support of my host.

mathiasbynens · 2020-05-27T16:53:57Z

@michaelficarra This time delta is always going to exist regardless of how quickly we update the spec, since implementing and shipping things takes time. (Also see Can I Unicode.)

It boils down to data vs. implementation. In practice, updating the version of the Unicode data an engine is using is a separate task from updating the lists of properties/values and their aliases RegExps should recognize. In V8 for example, the former is done by updating ICU, whereas the latter is done by updating the hardcoded list of properties in the regular expression engine.

The current ECMAScript spec provides the complete list of supported properties, values, and aliases. Implementers need these lists. Making this information harder to find is an interoperability footgun.

michaelficarra · 2020-05-27T23:36:46Z

@mathiasbynens If we feel it is important for implementers, we can continue to maintain the lists as non-normative. But the normative text should reference the Unicode data sets and describe a way to derive the supported Scripts and other properties. That way we never normatively specify mixed Unicode support.

mathiasbynens · 2020-05-28T07:17:34Z

It's important that the lists continue to be included in the spec.

Precisely describing a way to derive the supported properties seems tricky and error-prone. I wrote a rough summary in an earlier comment, but there are exceptions — see tc39/proposal-regexp-unicode-property-escapes#27 for some binary properties that were explicitly excluded. (There's other things that contribute to the trickiness, e.g. Any, Assigned, and ASCII are technically not "properties" in the Unicode definition of the word, although they behave like properties.) If Unicode adds new binary properties that are similarly special, we might want to decide not to support those in ECMAScript. With the current approach of having normative lists in the spec, we're keeping that option open. If we move away from normative lists, the new property would technically be supported as soon as the new Unicode Standard is released, at which point engines could hypothetically ship support for it, at which point removing that property might become impossible due to web compat. It seems better to avoid this risk by sticking to what we have.

ECMAScript should control what properties ECMAScript supports. Unicode should control the data.

mathiasbynens · 2020-05-28T07:22:59Z

To be clear, I like the idea of describing the way to derive the supported properties, but given that a) it seems difficult to get 100% right, and b) that we probably want to retain our freedom to exclude new properties when needed, I'd suggest doing it non-normatively.

michaelficarra · 2020-05-28T16:20:03Z

Well I think that's not for us to decide on our own, but a matter to be brought to the committee.

michaelficarra · 2020-06-05T17:49:53Z

@mathiasbynens Something I failed to bring up during the plenary discussion is that, if we opposed automatically pulling in new property names/aliases, these kinds of PRs are not strictly editorial and must reach consensus. I think we need to back out this and #1939 until we get that consensus.

mathiasbynens · 2020-06-06T12:05:03Z

@michaelficarra The commit message for this PR is marked "Normative", not "Editorial". Note that these changes are already shipping in V8, so I'd advise against backing this out at this point. I'd be happy to let you discuss these PRs in plenary before landing in the future.

michaelficarra · 2020-06-10T22:30:55Z

@mathiasbynens Result of the editor call is that we'd like these kinds of PRs to get committee consensus before merging. Since the Unicode 13 PRs have already landed, we will leave them in and ask for retroactive consensus at the next meeting.

mathiasbynens · 2021-09-14T11:56:24Z

Closing the loop — the notes for the TC39 discussion mentioned in this thread can be found here: https://github.com/tc39/notes/blob/master/meetings/2020-06/june-2.md#introducing-unicode-support

mathiasbynens added the unicode Relates to upstream Unicode updates. label Mar 11, 2020

This was referenced Mar 11, 2020

[Informative] Observable changes because of Unicode 13 #1897

Closed

Update RegExp property escape tests per Unicode v13.0.0 tc39/test262#2526

Merged

mathiasbynens added the has test262 tests label Mar 11, 2020

ljharb added the normative change Affects behavior required to correctly evaluate some ECMAScript source text label Mar 11, 2020

ljharb requested review from michaelficarra, syg, ljharb, bakkot and a team March 11, 2020 17:40

ljharb approved these changes Mar 11, 2020

View reviewed changes

michaelficarra self-assigned this Mar 11, 2020

syg removed their request for review March 13, 2020 20:23

mathiasbynens added a commit to mathiasbynens/ecma262 that referenced this pull request Apr 7, 2020

Normative: Update RegExp property aliases per Unicode v13

84a9699

Ref. tc39#1897 & tc39#1896.

mathiasbynens mentioned this pull request Apr 7, 2020

Normative: Update RegExp property aliases per Unicode v13 #1939

Merged

bakkot mentioned this pull request Apr 26, 2020

Support automatically generating Unicode property lists tc39/ecmarkup#193

Open

bakkot approved these changes May 11, 2020

View reviewed changes

michaelficarra approved these changes May 11, 2020

View reviewed changes

ljharb assigned ljharb and unassigned michaelficarra May 11, 2020

ljharb force-pushed the unicode-13-scripts branch from 2725eb3 to d63241e Compare May 11, 2020 21:17

ljharb pushed a commit to mathiasbynens/ecma262 that referenced this pull request May 11, 2020

Normative: Update Unicode property lists per Unicode v13 (tc39#1896)

d63241e

https://unicode.org/versions/Unicode13.0.0/

mathiasbynens force-pushed the unicode-13-scripts branch from d63241e to 97e2e48 Compare May 12, 2020 07:03

mathiasbynens commented May 12, 2020

View reviewed changes

spec.html Outdated Show resolved Hide resolved

michaelficarra added the editor call to be discussed in the next editor call label May 13, 2020

ljharb force-pushed the unicode-13-scripts branch from 14d4a76 to e1c7179 Compare May 21, 2020 21:04

Normative: Update Unicode property lists per Unicode v13 (tc39#1896)

e1c7179

https://unicode.org/versions/Unicode13.0.0/

ljharb merged commit e1c7179 into tc39:master May 21, 2020

ljharb pushed a commit to mathiasbynens/ecma262 that referenced this pull request May 21, 2020

Normative: Update RegExp property aliases per Unicode v13 (tc39#1939)

fc8964f

Ref. tc39#1897 & tc39#1896.

mathiasbynens deleted the unicode-13-scripts branch May 24, 2020 10:58

michaelficarra added the editor call to be discussed in the next editor call label Jun 5, 2020

michaelficarra removed the editor call to be discussed in the next editor call label Jun 10, 2020

mysticatea mentioned this pull request Aug 4, 2020

fix for test262 acornjs/acorn#973

Merged

bakkot mentioned this pull request Nov 10, 2021

Normative: List new Unicode v14 Script/Script_Extensions values #2515

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normative: Update Unicode property lists per Unicode v13 #1896

Normative: Update Unicode property lists per Unicode v13 #1896

mathiasbynens commented Mar 11, 2020 •

edited

Loading

ljharb left a comment

ljharb commented Mar 11, 2020

leobalter commented Mar 11, 2020

syg commented Mar 13, 2020

ljharb commented May 11, 2020

mathiasbynens commented May 12, 2020

ljharb commented May 21, 2020

michaelficarra commented May 21, 2020

mathiasbynens commented May 24, 2020

mathiasbynens commented May 24, 2020

michaelficarra commented May 26, 2020

mathiasbynens commented May 27, 2020

ljharb commented May 27, 2020

mathiasbynens commented May 27, 2020

mathiasbynens commented May 27, 2020

ljharb commented May 27, 2020

mathiasbynens commented May 27, 2020

michaelficarra commented May 27, 2020

mathiasbynens commented May 27, 2020

michaelficarra commented May 27, 2020

mathiasbynens commented May 28, 2020

mathiasbynens commented May 28, 2020

michaelficarra commented May 28, 2020

michaelficarra commented Jun 5, 2020

mathiasbynens commented Jun 6, 2020

michaelficarra commented Jun 10, 2020

mathiasbynens commented Sep 14, 2021

Normative: Update Unicode property lists per Unicode v13 #1896

Normative: Update Unicode property lists per Unicode v13 #1896

Conversation

mathiasbynens commented Mar 11, 2020 • edited Loading

ljharb left a comment

Choose a reason for hiding this comment

ljharb commented Mar 11, 2020

leobalter commented Mar 11, 2020

syg commented Mar 13, 2020

ljharb commented May 11, 2020

mathiasbynens commented May 12, 2020

ljharb commented May 21, 2020

michaelficarra commented May 21, 2020

mathiasbynens commented May 24, 2020

mathiasbynens commented May 24, 2020

michaelficarra commented May 26, 2020

mathiasbynens commented May 27, 2020

ljharb commented May 27, 2020

mathiasbynens commented May 27, 2020

mathiasbynens commented May 27, 2020

ljharb commented May 27, 2020

mathiasbynens commented May 27, 2020

michaelficarra commented May 27, 2020

mathiasbynens commented May 27, 2020

michaelficarra commented May 27, 2020

mathiasbynens commented May 28, 2020

mathiasbynens commented May 28, 2020

michaelficarra commented May 28, 2020

michaelficarra commented Jun 5, 2020

mathiasbynens commented Jun 6, 2020

michaelficarra commented Jun 10, 2020

mathiasbynens commented Sep 14, 2021

mathiasbynens commented Mar 11, 2020 •

edited

Loading