-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normative: Reference UTS 35 Unicode BCP 47 Locale Identifiers #289
Conversation
UTS 35, rather than RFC 5646, provides a more modern and regular normalization algorithm for locales. This standard definition will be implementable in ICU and then shared among implementations, rather than relying on buggy, implementation-specific normalization algorithms. It also provides a more regular and easier-to-manipulate form for Intl.Locale. c.f. tc39/proposal-intl-locale#63
Based on Unicode BCP 47 Locale Identifier conversion, adopted in tc39/ecma402#289 , there will be no privateuse or grandfathered locales, allowing this specification's definition of ApplyOptionsToTag and other algorithms to be simplified.
As pointed out by @FrankYFTang , we also need to change the "may" to "must" with respect to the RFC 6067 reference. |
Based on Unicode BCP 47 Locale Identifier conversion, adopted in tc39/ecma402#289 , there will be no privateuse or grandfathered locales, allowing this specification's definition of ApplyOptionsToTag and other algorithms to be simplified.
We agreed on this change in the November 2018 Intl meeting, modulo the above correction. |
LGTM |
Hmm, I still have some questions about this change.
I assume "canonicalize" here refers to RFC 5646 canonicalization, right? Assuming it does, I have the following questions:
But there is no actual process defined how to remove extraneous, unregistered extlang subtags, because RFC 5646 only describes how to replace extlang subtags which have a "Preferred-Value" in the IANA Language Subtag Registry. For example let's take the input language tag "sgn-sgn-DE", which is a well-formed, but not valid language tag, cf. RFC 5646, section 2.2.9. Classes of Conformance. I don't really know if its canonical form is either "sgn-sgn-DE", "sgn-DE", or "gsg":
This is not the only issue with extlang subtags in RFC 5646 and the language tag registry. For example section 2.2.2 Extended Language Subtags:
But the registry has these entries for deprecated extlang subtags which don't have Preferred-Values:
FWIW, ICU simply uses the first extlang subtag as the replacement for the primary language subtag.
That means for the "sgn-sgn-DE" case from above, ICU returns the canonicalization "sgn-DE". Back to UTS 35.
"supplementalMetadata.xml" contains some <languageAlias> entries for languages which are not registered in the IANA language tag registry, for example:
and
So if we simply refer to UTS 35, it looks like we need to start to support these entries. Are we okay with that? Do we know where these entries come from? (Probably just ISO 639-2 and/or ISO 639-3 codes which already have corresponding ISO 639-1 codes?) <languageAlias> also contains macro-language replacements, so "zh-cmn" will then be canonicalized to "zh" and no longer to "cmn". Correct? Is this actually implemented in ICU?
<territoryAlias> contains many entries to replace 3-digit region codes with their 2-alpha replacements. Is this actually implemented in ICU?
Does "likely territory" mean we need to incorporate the data from "likelySubtags.xml"? So, for example "ru-SU" will now be canonicalized to "ru-RU", whereas "hy-SU" will be canonicalized to "hy-AM"? Also here: Is this actually implemented in ICU? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to change IsStructurallyValidLanguageTag
spec/locales-currencies-tz.html
Outdated
@@ -60,11 +60,7 @@ <h1>CanonicalizeLanguageTag ( _locale_ )</h1> | |||
|
|||
<p> | |||
The CanonicalizeLanguageTag abstract operation returns the canonical and case-regularized form of the _locale_ argument (which must be a String value that is a structurally valid BCP 47 language tag as verified by the IsStructurallyValidLanguageTag abstract operation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"BCP 47 language tag" => "Unicode BCP 47 Locale Identifier"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -21,7 +21,7 @@ <h1>Case Sensitivity and Case Mapping</h1> | |||
<h1>Language Tags</h1> | |||
|
|||
<p> | |||
The ECMAScript 2019 Internationalization API Specification identifies locales using language tags as defined by IETF BCP 47 (RFCs 5646 and 4647 or their successors), which may include extensions such as those registered through RFC 6067. Their canonical form is specified in RFC 5646 section 4.5 or its successor. | |||
The ECMAScript 2019 Internationalization API Specification identifies locales using language tags as defined by IETF BCP 47 (RFCs 5646 and 4647 or their successors), which may include extensions such as those registered through RFC 6067. Their canonical form is that of a Unicode BCP 47 Locale Identifier, as specified in <a href="http://unicode.org/reports/tr35/#BCP_47_Conformance">Unicode Technical Standard #35 LDML § 3.3 BCP 47 Conformance</a>. | |||
</p> | |||
|
|||
<p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested Changes in IsStructurallyValidLanguageTag
represents a well-formed
- BCP 47 language tag as specified in RFC 5646 section 2.1,
+ Unicode BCP 47 Locale Identifier" as specified in Unicode Technical Standard 35 section 3.2,
or successor,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -60,11 +60,7 @@ <h1>CanonicalizeLanguageTag ( _locale_ )</h1> | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested changes in IsStructurallyValidLanguageTag
"The abstract operation returns true if _locale_ can be generated from the ABNF grammar
- in section 2.1 of the RFC, starting with Language-Tag,
+ in section 3.2 of the UTS35, starting with unicode_locale_id
and does not contain duplicate variant or singleton subtags (other than as a private use subtag). It returns false otherwise. Terminal value characters in the grammar are interpreted as the Unicode equivalents of the ASCII octet values given."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@anba These sound like issues among various Unicode standards. Could you file them in those corresponding repositories? |
@FrankYFTang Thanks, good catch. I believe I've updated the places which had accidentally been omitted previously. Do you think you could double-check to see if things look correct now? |
LGTM |
I think this PR is ready merged. |
tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47 locale identifiers instead of BCP47 language tags for language tags. That means extlang subtags are no longer supported in language tags. Differential Revision: https://phabricator.services.mozilla.com/D23536 --HG-- extra : moz-landing-system : lando
tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47 locale identifiers instead of BCP47 language tags for language tags. That means extlang subtags are no longer supported in language tags. Differential Revision: https://phabricator.services.mozilla.com/D23536
tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47 locale identifiers instead of BCP47 language tags for language tags. That means extlang subtags are no longer supported in language tags. Differential Revision: https://phabricator.services.mozilla.com/D23536 UltraBlame original commit: 113a287cfb7f8badb75d17bcc51731cedb64e03a
tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47 locale identifiers instead of BCP47 language tags for language tags. That means extlang subtags are no longer supported in language tags. Differential Revision: https://phabricator.services.mozilla.com/D23536 UltraBlame original commit: 113a287cfb7f8badb75d17bcc51731cedb64e03a
tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47 locale identifiers instead of BCP47 language tags for language tags. That means extlang subtags are no longer supported in language tags. Differential Revision: https://phabricator.services.mozilla.com/D23536 UltraBlame original commit: 113a287cfb7f8badb75d17bcc51731cedb64e03a
UTS 35, rather than RFC 5646, provides a more modern and regular
normalization algorithm for locales. This standard definition will
be implementable in ICU and then shared among implementations,
rather than relying on buggy, implementation-specific normalization
algorithms. It also provides a more regular and easier-to-manipulate
form for Intl.Locale.
c.f. tc39/proposal-intl-locale#63