Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: Reference UTS 35 Unicode BCP 47 Locale Identifiers #289

Merged
merged 2 commits into from
Jan 23, 2019

Conversation

littledan
Copy link
Member

UTS 35, rather than RFC 5646, provides a more modern and regular
normalization algorithm for locales. This standard definition will
be implementable in ICU and then shared among implementations,
rather than relying on buggy, implementation-specific normalization
algorithms. It also provides a more regular and easier-to-manipulate
form for Intl.Locale.

c.f. tc39/proposal-intl-locale#63

UTS 35, rather than RFC 5646, provides a more modern and regular
normalization algorithm for locales. This standard definition will
be implementable in ICU and then shared among implementations,
rather than relying on buggy, implementation-specific normalization
algorithms. It also provides a more regular and easier-to-manipulate
form for Intl.Locale.

c.f. tc39/proposal-intl-locale#63
littledan added a commit to tc39/proposal-intl-locale that referenced this pull request Oct 29, 2018
Based on Unicode BCP 47 Locale Identifier conversion, adopted in
tc39/ecma402#289 , there will be no
privateuse or grandfathered locales, allowing this specification's
definition of ApplyOptionsToTag and other algorithms to be simplified.
@littledan
Copy link
Member Author

As pointed out by @FrankYFTang , we also need to change the "may" to "must" with respect to the RFC 6067 reference.

littledan added a commit to tc39/proposal-intl-locale that referenced this pull request Nov 30, 2018
Based on Unicode BCP 47 Locale Identifier conversion, adopted in
tc39/ecma402#289 , there will be no
privateuse or grandfathered locales, allowing this specification's
definition of ApplyOptionsToTag and other algorithms to be simplified.
@littledan
Copy link
Member Author

We agreed on this change in the November 2018 Intl meeting, modulo the above correction.

@FrankYFTang
Copy link
Contributor

LGTM

@anba
Copy link
Contributor

anba commented Dec 4, 2018

Hmm, I still have some questions about this change.

  1. Canonicalize the language tag (afterwards, there will be no extlang subtags).

I assume "canonicalize" here refers to RFC 5646 canonicalization, right? Assuming it does, I have the following questions:

  • UTS 35 doesn't require to normalize the case of the subtags, whereas ECMA-402 did require case normalization. (Case normalization is optional in RFC 5646!). I think it makes sense to still require case normalization. For example CanonicalizeLanguageTag("Fr-lAtN-fR") should still return "fr-Latn-FR" even after switching to UTS 35.
  • It looks like UTS 35 expects that RFC 5646 canonicalization strips away all extlang subtags, but from what I can tell RFC 5646 is a bit ambiguous about this point. RFC 5646 does contain this statement in section 4.5

The canonical form contains no 'extlang' subtags.

But there is no actual process defined how to remove extraneous, unregistered extlang subtags, because RFC 5646 only describes how to replace extlang subtags which have a "Preferred-Value" in the IANA Language Subtag Registry. For example let's take the input language tag "sgn-sgn-DE", which is a well-formed, but not valid language tag, cf. RFC 5646, section 2.2.9. Classes of Conformance. I don't really know if its canonical form is either "sgn-sgn-DE", "sgn-DE", or "gsg":

  1. If only subtags with a Preferred-Value are replaced, the canonical form is "sgn-sgn-DE", but that'd contradict the statement that canonical language tags don't contain extlang subtags.
  2. If the canonical form is "sgn-DE" (either by removing the primary language subtag or removing the (first?) extlange subtag), we'd end up with a canonical form which isn't actually canonical, because "sgn-DE" itself is a redundant language tag whose canonical form is "gsg".
  3. If "gsg" is the canonical form, we'd need to apply RFC 5646 canonicalization two times, but RFC 5646 section 4.5 Canonicalization of Language Tags states that the canonicalization steps are applied in order and doesn't state that steps possibly need to be repeated:

[...] it has been canonicalized by applying each of the following steps in order [...]

This is not the only issue with extlang subtags in RFC 5646 and the language tag registry. For example section 2.2.2 Extended Language Subtags:

  1. Extended language subtag records MUST include a 'Preferred-Value'. The 'Preferred-Value' and 'Subtag' fields MUST be identical.

But the registry has these entries for deprecated extlang subtags which don't have Preferred-Values:

Type: extlang
Subtag: lsg
Description: Lyons Sign Language
Added: 2009-07-29
Deprecated: 2018-03-08
Prefix: sgn

Type: extlang
Subtag: rsi
Description: Rennellese Sign Language
Added: 2009-07-29
Deprecated: 2017-02-23
Prefix: sgn

Type: extlang
Subtag: yds
Description: Yiddish Sign Language
Added: 2009-07-29
Deprecated: 2015-02-12
Prefix: sgn

FWIW, ICU simply uses the first extlang subtag as the replacement for the primary language subtag.

That means for the "sgn-sgn-DE" case from above, ICU returns the canonicalization "sgn-DE".


Back to UTS 35.

  1. If the BCP 47 primary language subtag matches the type attribute of a languageAlias element in Supplemental Data, replace the language subtag with the replacement value.

"supplementalMetadata.xml" contains some <languageAlias> entries for languages which are not registered in the IANA language tag registry, for example:

<languageAlias type="scc" replacement="sr" reason="deprecated"/> <!-- Serbian -->

and

<languageAlias type="deu" replacement="de" reason="overlong"/> <!-- [German] -->

So if we simply refer to UTS 35, it looks like we need to start to support these entries. Are we okay with that? Do we know where these entries come from? (Probably just ISO 639-2 and/or ISO 639-3 codes which already have corresponding ISO 639-1 codes?)

<languageAlias> also contains macro-language replacements, so "zh-cmn" will then be canonicalized to "zh" and no longer to "cmn". Correct? Is this actually implemented in ICU?


  1. If the BCP 47 region subtag matches the type attribute of a territoryAlias element in Supplemental Data, replace the language subtag with the replacement value, as follows:

<territoryAlias> contains many entries to replace 3-digit region codes with their 2-alpha replacements. Is this actually implemented in ICU?

  1. If there are multiple territories:
    1. Look up the most likely territory for the base language code (and script, if there is one).
    2. If that likely territory is in the list, use it.

Does "likely territory" mean we need to incorporate the data from "likelySubtags.xml"? So, for example "ru-SU" will now be canonicalized to "ru-RU", whereas "hy-SU" will be canonicalized to "hy-AM"? Also here: Is this actually implemented in ICU?

Copy link
Contributor

@FrankYFTang FrankYFTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to change IsStructurallyValidLanguageTag

@@ -60,11 +60,7 @@ <h1>CanonicalizeLanguageTag ( _locale_ )</h1>

<p>
The CanonicalizeLanguageTag abstract operation returns the canonical and case-regularized form of the _locale_ argument (which must be a String value that is a structurally valid BCP 47 language tag as verified by the IsStructurallyValidLanguageTag abstract operation).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"BCP 47 language tag" => "Unicode BCP 47 Locale Identifier"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -21,7 +21,7 @@ <h1>Case Sensitivity and Case Mapping</h1>
<h1>Language Tags</h1>

<p>
The ECMAScript 2019 Internationalization API Specification identifies locales using language tags as defined by IETF BCP 47 (RFCs 5646 and 4647 or their successors), which may include extensions such as those registered through RFC 6067. Their canonical form is specified in RFC 5646 section 4.5 or its successor.
The ECMAScript 2019 Internationalization API Specification identifies locales using language tags as defined by IETF BCP 47 (RFCs 5646 and 4647 or their successors), which may include extensions such as those registered through RFC 6067. Their canonical form is that of a Unicode BCP 47 Locale Identifier, as specified in <a href="http://unicode.org/reports/tr35/#BCP_47_Conformance">Unicode Technical Standard #35 LDML § 3.3 BCP 47 Conformance</a>.
</p>

<p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested Changes in IsStructurallyValidLanguageTag

  represents a well-formed 
- BCP 47 language tag as specified in RFC 5646 section 2.1,
+ Unicode BCP 47 Locale Identifier" as specified in Unicode Technical Standard 35 section 3.2,
  or successor,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -60,11 +60,7 @@ <h1>CanonicalizeLanguageTag ( _locale_ )</h1>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested changes in IsStructurallyValidLanguageTag

"The abstract operation returns true if _locale_ can be generated from the ABNF grammar
 -  in section 2.1 of the RFC, starting with Language-Tag, 
 +  in section 3.2 of the UTS35, starting with unicode_locale_id
and does not contain duplicate variant or singleton subtags (other than as a private use subtag). It returns false otherwise. Terminal value characters in the grammar are interpreted as the Unicode equivalents of the ASCII octet values given."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@littledan
Copy link
Member Author

@anba These sound like issues among various Unicode standards. Could you file them in those corresponding repositories?

@littledan
Copy link
Member Author

@FrankYFTang Thanks, good catch. I believe I've updated the places which had accidentally been omitted previously. Do you think you could double-check to see if things look correct now?

@FrankYFTang
Copy link
Contributor

@FrankYFTang Thanks, good catch. I believe I've updated the places which had accidentally been omitted previously. Do you think you could double-check to see if things look correct now?

LGTM

@FrankYFTang
Copy link
Contributor

I think this PR is ready merged.

@littledan littledan merged commit ac60837 into tc39:master Jan 23, 2019
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Apr 9, 2019
tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47
locale identifiers instead of BCP47 language tags for language tags. That means
extlang subtags are no longer supported in language tags.

Differential Revision: https://phabricator.services.mozilla.com/D23536

--HG--
extra : moz-landing-system : lando
mykmelez pushed a commit to mykmelez/gecko that referenced this pull request Apr 9, 2019
tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47
locale identifiers instead of BCP47 language tags for language tags. That means
extlang subtags are no longer supported in language tags.

Differential Revision: https://phabricator.services.mozilla.com/D23536
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Oct 4, 2019
tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47
locale identifiers instead of BCP47 language tags for language tags. That means
extlang subtags are no longer supported in language tags.

Differential Revision: https://phabricator.services.mozilla.com/D23536

UltraBlame original commit: 113a287cfb7f8badb75d17bcc51731cedb64e03a
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Oct 4, 2019
tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47
locale identifiers instead of BCP47 language tags for language tags. That means
extlang subtags are no longer supported in language tags.

Differential Revision: https://phabricator.services.mozilla.com/D23536

UltraBlame original commit: 113a287cfb7f8badb75d17bcc51731cedb64e03a
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request Oct 4, 2019
tc39/ecma402#289 changed ECMA-402 to use Unicode BCP47
locale identifiers instead of BCP47 language tags for language tags. That means
extlang subtags are no longer supported in language tags.

Differential Revision: https://phabricator.services.mozilla.com/D23536

UltraBlame original commit: 113a287cfb7f8badb75d17bcc51731cedb64e03a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants