Normative: Reference UTS 35 Unicode BCP 47 Locale Identifiers #289

littledan · 2018-10-29T21:55:06Z

UTS 35, rather than RFC 5646, provides a more modern and regular
normalization algorithm for locales. This standard definition will
be implementable in ICU and then shared among implementations,
rather than relying on buggy, implementation-specific normalization
algorithms. It also provides a more regular and easier-to-manipulate
form for Intl.Locale.

c.f. tc39/proposal-intl-locale#63

UTS 35, rather than RFC 5646, provides a more modern and regular normalization algorithm for locales. This standard definition will be implementable in ICU and then shared among implementations, rather than relying on buggy, implementation-specific normalization algorithms. It also provides a more regular and easier-to-manipulate form for Intl.Locale. c.f. tc39/proposal-intl-locale#63

Based on Unicode BCP 47 Locale Identifier conversion, adopted in tc39/ecma402#289 , there will be no privateuse or grandfathered locales, allowing this specification's definition of ApplyOptionsToTag and other algorithms to be simplified.

littledan · 2018-11-26T18:19:58Z

As pointed out by @FrankYFTang , we also need to change the "may" to "must" with respect to the RFC 6067 reference.

Based on Unicode BCP 47 Locale Identifier conversion, adopted in tc39/ecma402#289 , there will be no privateuse or grandfathered locales, allowing this specification's definition of ApplyOptionsToTag and other algorithms to be simplified.

littledan · 2018-12-01T04:25:59Z

We agreed on this change in the November 2018 Intl meeting, modulo the above correction.

FrankYFTang · 2018-12-02T05:00:53Z

LGTM

anba · 2018-12-04T17:07:51Z

Hmm, I still have some questions about this change.

Canonicalize the language tag (afterwards, there will be no extlang subtags).

I assume "canonicalize" here refers to RFC 5646 canonicalization, right? Assuming it does, I have the following questions:

UTS 35 doesn't require to normalize the case of the subtags, whereas ECMA-402 did require case normalization. (Case normalization is optional in RFC 5646!). I think it makes sense to still require case normalization. For example CanonicalizeLanguageTag("Fr-lAtN-fR") should still return "fr-Latn-FR" even after switching to UTS 35.
It looks like UTS 35 expects that RFC 5646 canonicalization strips away all extlang subtags, but from what I can tell RFC 5646 is a bit ambiguous about this point. RFC 5646 does contain this statement in section 4.5

The canonical form contains no 'extlang' subtags.

But there is no actual process defined how to remove extraneous, unregistered extlang subtags, because RFC 5646 only describes how to replace extlang subtags which have a "Preferred-Value" in the IANA Language Subtag Registry. For example let's take the input language tag "sgn-sgn-DE", which is a well-formed, but not valid language tag, cf. RFC 5646, section 2.2.9. Classes of Conformance. I don't really know if its canonical form is either "sgn-sgn-DE", "sgn-DE", or "gsg":

If only subtags with a Preferred-Value are replaced, the canonical form is "sgn-sgn-DE", but that'd contradict the statement that canonical language tags don't contain extlang subtags.
If the canonical form is "sgn-DE" (either by removing the primary language subtag or removing the (first?) extlange subtag), we'd end up with a canonical form which isn't actually canonical, because "sgn-DE" itself is a redundant language tag whose canonical form is "gsg".
If "gsg" is the canonical form, we'd need to apply RFC 5646 canonicalization two times, but RFC 5646 section 4.5 Canonicalization of Language Tags states that the canonicalization steps are applied in order and doesn't state that steps possibly need to be repeated:

[...] it has been canonicalized by applying each of the following steps in order [...]

This is not the only issue with extlang subtags in RFC 5646 and the language tag registry. For example section 2.2.2 Extended Language Subtags:

Extended language subtag records MUST include a 'Preferred-Value'. The 'Preferred-Value' and 'Subtag' fields MUST be identical.

But the registry has these entries for deprecated extlang subtags which don't have Preferred-Values:

Type: extlang
Subtag: lsg
Description: Lyons Sign Language
Added: 2009-07-29
Deprecated: 2018-03-08
Prefix: sgn

Type: extlang
Subtag: rsi
Description: Rennellese Sign Language
Added: 2009-07-29
Deprecated: 2017-02-23
Prefix: sgn

Type: extlang
Subtag: yds
Description: Yiddish Sign Language
Added: 2009-07-29
Deprecated: 2015-02-12
Prefix: sgn

FWIW, ICU simply uses the first extlang subtag as the replacement for the primary language subtag.

That means for the "sgn-sgn-DE" case from above, ICU returns the canonicalization "sgn-DE".

Back to UTS 35.

If the BCP 47 primary language subtag matches the type attribute of a languageAlias element in Supplemental Data, replace the language subtag with the replacement value.

"supplementalMetadata.xml" contains some <languageAlias> entries for languages which are not registered in the IANA language tag registry, for example:

<languageAlias type="scc" replacement="sr" reason="deprecated"/>

and

<languageAlias type="deu" replacement="de" reason="overlong"/>

So if we simply refer to UTS 35, it looks like we need to start to support these entries. Are we okay with that? Do we know where these entries come from? (Probably just ISO 639-2 and/or ISO 639-3 codes which already have corresponding ISO 639-1 codes?)

<languageAlias> also contains macro-language replacements, so "zh-cmn" will then be canonicalized to "zh" and no longer to "cmn". Correct? Is this actually implemented in ICU?

If the BCP 47 region subtag matches the type attribute of a territoryAlias element in Supplemental Data, replace the language subtag with the replacement value, as follows:

<territoryAlias> contains many entries to replace 3-digit region codes with their 2-alpha replacements. Is this actually implemented in ICU?

If there are multiple territories:

Look up the most likely territory for the base language code (and script, if there is one).

If that likely territory is in the list, use it.

Does "likely territory" mean we need to incorporate the data from "likelySubtags.xml"? So, for example "ru-SU" will now be canonicalized to "ru-RU", whereas "hy-SU" will be canonicalized to "hy-AM"? Also here: Is this actually implemented in ICU?

FrankYFTang

I think we also need to change IsStructurallyValidLanguageTag

FrankYFTang · 2019-01-08T20:16:55Z

spec/locales-currencies-tz.html

@@ -60,11 +60,7 @@ <h1>CanonicalizeLanguageTag ( _locale_ )</h1>

      <p>
        The CanonicalizeLanguageTag abstract operation returns the canonical and case-regularized form of the _locale_ argument (which must be a String value that is a structurally valid BCP 47 language tag as verified by the IsStructurallyValidLanguageTag abstract operation).


"BCP 47 language tag" => "Unicode BCP 47 Locale Identifier"

FrankYFTang · 2019-01-08T20:20:35Z

spec/locales-currencies-tz.html

@@ -21,7 +21,7 @@ <h1>Case Sensitivity and Case Mapping</h1>
    <h1>Language Tags</h1>

    <p>
-      The ECMAScript 2019 Internationalization API Specification identifies locales using language tags as defined by IETF BCP 47 (RFCs 5646 and 4647 or their successors), which may include extensions such as those registered through RFC 6067. Their canonical form is specified in RFC 5646 section 4.5 or its successor.
+      The ECMAScript 2019 Internationalization API Specification identifies locales using language tags as defined by IETF BCP 47 (RFCs 5646 and 4647 or their successors), which may include extensions such as those registered through RFC 6067. Their canonical form is that of a Unicode BCP 47 Locale Identifier, as specified in <a href="http://unicode.org/reports/tr35/#BCP_47_Conformance">Unicode Technical Standard #35 LDML § 3.3 BCP 47 Conformance</a>.
    </p>

    <p>


Suggested Changes in IsStructurallyValidLanguageTag

represents a well-formed - BCP 47 language tag as specified in RFC 5646 section 2.1, + Unicode BCP 47 Locale Identifier" as specified in Unicode Technical Standard 35 section 3.2, or successor,

FrankYFTang · 2019-01-08T20:24:44Z

spec/locales-currencies-tz.html

@@ -60,11 +60,7 @@ <h1>CanonicalizeLanguageTag ( _locale_ )</h1>



Suggested changes in IsStructurallyValidLanguageTag

"The abstract operation returns true if _locale_ can be generated from the ABNF grammar - in section 2.1 of the RFC, starting with Language-Tag, + in section 3.2 of the UTS35, starting with unicode_locale_id and does not contain duplicate variant or singleton subtags (other than as a private use subtag). It returns false otherwise. Terminal value characters in the grammar are interpreted as the Unicode equivalents of the ASCII octet values given."

littledan · 2019-01-22T23:24:14Z

@anba These sound like issues among various Unicode standards. Could you file them in those corresponding repositories?

littledan · 2019-01-22T23:31:35Z

@FrankYFTang Thanks, good catch. I believe I've updated the places which had accidentally been omitted previously. Do you think you could double-check to see if things look correct now?

FrankYFTang · 2019-01-23T00:24:04Z

@FrankYFTang Thanks, good catch. I believe I've updated the places which had accidentally been omitted previously. Do you think you could double-check to see if things look correct now?

LGTM

FrankYFTang · 2019-01-23T00:25:20Z

I think this PR is ready merged.

tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47 locale identifiers instead of BCP47 language tags for language tags. That means extlang subtags are no longer supported in language tags. Differential Revision: https://phabricator.services.mozilla.com/D23536 --HG-- extra : moz-landing-system : lando

tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47 locale identifiers instead of BCP47 language tags for language tags. That means extlang subtags are no longer supported in language tags. Differential Revision: https://phabricator.services.mozilla.com/D23536

tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47 locale identifiers instead of BCP47 language tags for language tags. That means extlang subtags are no longer supported in language tags. Differential Revision: https://phabricator.services.mozilla.com/D23536 UltraBlame original commit: 113a287cfb7f8badb75d17bcc51731cedb64e03a

littledan mentioned this pull request Oct 29, 2018

Normative: Simplify algorithms, without privateuse/grandfathered tags tc39/proposal-intl-locale#66

Merged

FrankYFTang reviewed Jan 8, 2019

View reviewed changes

FrankYFTang mentioned this pull request Jan 9, 2019

Remove unsupported irregular grandfathered tags. tc39/test262#2029

Merged

Fix-ups for references to the definition of language tag

a11cf25

littledan merged commit ac60837 into tc39:master Jan 23, 2019

FrankYFTang mentioned this pull request Jan 23, 2019

Ref UTS35 in IsStructurallyValidLanguageTag #317

Merged

anba mentioned this pull request Mar 11, 2019

Fix various test issues (Was: Unicode BCP 47 Locale Identifier changes) tc39/test262#2097

Merged

iamstolis mentioned this pull request Apr 30, 2020

'accepts valid language tags' tests are incorrect compat-table/compat-table#1615

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normative: Reference UTS 35 Unicode BCP 47 Locale Identifiers #289

Normative: Reference UTS 35 Unicode BCP 47 Locale Identifiers #289

littledan commented Oct 29, 2018

littledan commented Nov 26, 2018

littledan commented Dec 1, 2018

FrankYFTang commented Dec 2, 2018

anba commented Dec 4, 2018

FrankYFTang left a comment

FrankYFTang Jan 8, 2019

littledan Jan 22, 2019

FrankYFTang Jan 8, 2019

littledan Jan 22, 2019

FrankYFTang Jan 8, 2019

littledan Jan 22, 2019

littledan commented Jan 22, 2019

littledan commented Jan 22, 2019

FrankYFTang commented Jan 23, 2019

FrankYFTang commented Jan 23, 2019

		@@ -60,11 +60,7 @@ <h1>CanonicalizeLanguageTag ( _locale_ )</h1>

		<p>
		The CanonicalizeLanguageTag abstract operation returns the canonical and case-regularized form of the _locale_ argument (which must be a String value that is a structurally valid BCP 47 language tag as verified by the IsStructurallyValidLanguageTag abstract operation).

Normative: Reference UTS 35 Unicode BCP 47 Locale Identifiers #289

Normative: Reference UTS 35 Unicode BCP 47 Locale Identifiers #289

Conversation

littledan commented Oct 29, 2018

littledan commented Nov 26, 2018

littledan commented Dec 1, 2018

FrankYFTang commented Dec 2, 2018

anba commented Dec 4, 2018

FrankYFTang left a comment

Choose a reason for hiding this comment

FrankYFTang Jan 8, 2019

Choose a reason for hiding this comment

littledan Jan 22, 2019

Choose a reason for hiding this comment

FrankYFTang Jan 8, 2019

Choose a reason for hiding this comment

littledan Jan 22, 2019

Choose a reason for hiding this comment

FrankYFTang Jan 8, 2019

Choose a reason for hiding this comment

littledan Jan 22, 2019

Choose a reason for hiding this comment

littledan commented Jan 22, 2019

littledan commented Jan 22, 2019

FrankYFTang commented Jan 23, 2019

FrankYFTang commented Jan 23, 2019