Skip to content
This repository has been archived by the owner on Jan 25, 2022. It is now read-only.

Using Unicode locale ID vs BCP 47 in our spec #63

Closed
nciric opened this issue Oct 2, 2018 · 51 comments
Closed

Using Unicode locale ID vs BCP 47 in our spec #63

nciric opened this issue Oct 2, 2018 · 51 comments

Comments

@nciric
Copy link

nciric commented Oct 2, 2018

@littledan this is a proposal we could work into our Locale spec, if we can get group to agree on the change.

Current spec (and most of the constructors) expect bcp-47 locale id. A cleaner approach would be to use Unicode locale ID, see here for differences:

http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#BCP_47_Conformance

It does not allow for the full syntax of [BCP47]:

  • No irregular or BCP47 grandfathered tags are allowed
  • No extlang subtags are allowed
  • A tag must not start with the subtag "x". Thus a privateuse (eg x-abc) can only be after a language subtag like "und"

It allows for certain additions:

  • For field separator characters, the "_" character can be used as well as the "-" used in [BCP47].
  • "root" to indicate the generic locale used as the parent of all languages in the CLDR data model.
  • Certain codes that are private-use in BCP-47 and ISO are given semantics by LDML.
  • Each macrolanguage has an identified primary encompassed language. That encompassed language is treated as an alias for the macrolanguage, and thus is replaced when canonicalizing.
  • The language tag may begin with a script rather than a language (specialized use only).

There are multiple problems with bcp-47 tags, from slightly annoying grandfathered tags (source of most Locale bugs in v8), to script mapping.

For example:

@zbraniecki
Copy link
Member

Wow. It seems like they differences are pretty much in line with our pain-points (irregular, grandfather, privateuse without langtag) or things we already talked about supporting on input (_ -> -).

I'm cautiously excited about this proposal.

@aphillips
Copy link

@zbraniecki Their pain points and yours exist for equivalent reasons: these are notable problems when dealing with language tags in the context of locale-based APIs. What's more, CLDR provides the basis of many underlying implementations, so it makes sense to arrive at very similar choices when dealing with these issues. As such, I support this proposal.

Some minor nits: the excluded tags at least potentially exist in content. They need to be addressed, even if it is mapping them all to und or root

It is a bit incorrect to say that all grandfathered tags are not allowed. The regular grandfathered tags all canonicalize to their modern equivalents (there is no round trip).

Similarly, extlang is permitted as input, but implementations strip either the primary (zh-yue -> yue) or map away the primary enclosed language (zh-cmn -> zh).

@markusicu
Copy link

+1

Some of BCP 47 + language-subtag-registry seems more geared towards bibliographic use.
Unicode language identifiers reflect computer industry practice for tagging of messages (translations) and contents (web pages, emails, ...).

FYI, the CLDR spec link above is for the latest draft (which will soon be released for CLDR 34).
For reference to a stable URL for the differences I would use http://www.unicode.org/reports/tr35/#BCP_47_Conformance

For the definition of Unicode Language Identifier: http://www.unicode.org/reports/tr35/#Unicode_language_identifier

@macchiati
Copy link

+1, for reasons already stated. Can follow up tomorrow.

@macchiati
Copy link

Well, not quite "tomorrow"...

For the reasons stated, it is much cleaner to use the Unicode locale identifiers — the cleanest being the "Unicode BCP 47 locale identifiers" as in Unicode BCP 47 Conformance (draft, but soon to be released). Those are all conformant BCP 47 language tags, but with some additional semantic restrictions and semantic additions.

In case it is useful, note that Addison and I are the editors of the main RFC of BCP 47.

@aphillips
Copy link

My one caveat/concern with this thread and related ones is: there is a universe of tags, including rubbish ones, that can't be overlooked by Intl. There needs to be a clearly defined mapping or method of handling them, given that someone is finding utility out there for using said tags. Unicode's mapping is helpful, but not round trip. The constraints provided may not be enough: say what happens with the other tags, even the inconvenient ones. (Saying that rubbish things happen with rubbish tags is fine).

I guess my objection could be summed up as: I don't like the gap UTS35 leaves in grandfathered tags. Just say they all turn into root or something innocuous or useless (tlh-Cyrl-AQ !). Ditto private use tags. Further, specify that the tag in may not be recoverable later, at least in these cases.

Otherwise, +1 to @macchiati

Sorry for brevity: (Tablet, airplane)

@markusicu
Copy link

I guess my objection could be summed up as: I don't like the gap UTS35 leaves in grandfathered tags. Just say they all turn into root or something innocuous or useless (tlh-Cyrl-AQ !).

I agree that it would be useful to specify what to do with them, rather than "cannot be converted".

Simplest: Turn them into und or root, depending on whether root makes sense (the spec already has conditionals for that).

Ditto private use tags.

CLDR does say to prepend "und-" in conversion to Unicode lang IDs. (At least in the draft for CLDR 34.)

The conversion to BCP 47 could turn an initial "und-x-" into just "x-" to make all-privateuse tag round-trip, but then tags that are "und-x-..." to begin with won't round-trip. You have to choose one or the other. I think it's fair to leave the "und-" prefix alone, especially looking what a pain it is trying to support privateuse tags "properly". (They are the only case where conceivably a getLanguageSubtag() API would return a string of arbitrary length for a valid tag, rather than a single subtag of at most 8 characters.)

Further, specify that the tag in may not be recoverable later, at least in these cases.

SGTM

@aphillips
Copy link

@markusicu You could turn x- into und-x-x- for round trip: once you see the X singleton, further subtag checking is turned off (save for 1*8alphanum). It still doesn't produce a useful locale, but I thought I'd point it out........

@littledan
Copy link
Member

Good to see the above discussion. I think this is a really important issue.

Switching to referencing Unicode locale identifiers sounds good to me at a high level, but we've discussed some aspects of Unicode locale identifiers and come to different conclusions. For example, we explicitly decided to not support some of the allowed features in Unicode language tags, such as _ instead of -. In Intl v1, it was a particular design decision to not expose the root locale, to avoid misuse. But we can reconsider these things.

I have definitely heard feature requests from web developers about accepting various different kinds of tags, as @aphillips mentions, but it's not clear what the API, definition or data sources should be. For some of these tags, we were considering a potential future separate API for their processing.

@macchiati
Copy link

macchiati commented Oct 6, 2018 via email

@jungshik
Copy link

Good to see the above discussion. I think this is a really important issue.

Switching to referencing Unicode locale identifiers sounds good to me at a high level, but we've discussed some aspects of Unicode locale identifiers and come to different conclusions. For example, we explicitly decided to not support some of the allowed features in Unicode language tags, such as _ instead of -. In Intl v1, it was a particular design decision to not expose the root locale, to avoid misuse. But we can reconsider these things.

+1 I'd add to the list disallowing a language tag starting with a script subtag.

@jungshik
Copy link

Using Unicode locale ID vs BCP 47 in our spec

What's used in the current spec is not BCP 47 alone but "BCP 47 + RFC 6067 + IANA Language subtag registry".

@macchiati
Copy link

macchiati commented Oct 11, 2018 via email

@littledan
Copy link
Member

Related issue: tc39/ecma402#212

@littledan
Copy link
Member

@macchiati This looks great--if we stick to Unicode BCP 47 locale identifiers, it seems like many annoying edge cases that we've spent a lot of time working through are simply defined away.

@jungshik
Copy link

@macchiati : With 'Unicode BCP 47 locale identifier', how are variants like 'preeuro', 'stroke', 'cyrillic', 'direct' and 'pinyin' handled? (see tc39/ecma402#273 ). I hope they're not given any special treatment/mapping.

The current ICU implementation results in the following mapping and many others : (after going through forLanguageTag and toLanguageTag)

zh-pinyin ==>  zh-u-co-pinyin
es-ES-preeuro => es-ES-u-cu-esp
uz-UZ-CYRILLIC =>  uz-Cyrl-UZ

@macchiati
Copy link

macchiati commented Oct 15, 2018 via email

@littledan
Copy link
Member

@macchiati That helps, thanks.

So, if we say that Intl.Locale (and all of ECMA-402's constructors) supports only Unicode BCP 47 locale identifiers, those would throw a RangeError. Would folks be happy with those semantics?

In a follow-on proposal, we could create alternate factory functions on Intl.Locale for various more tolerant/legacy locale identifiers.

@macchiati
Copy link

macchiati commented Oct 15, 2018 via email

@jungshik
Copy link

@macchiati Thank you for the clarification. My question was if the canonicalization of bogus/legacy variant subtag currently done by ICU (such as mapping zh-pinyin to zh-u-co-pinyin) is allowed/required by Unicode BCP 47 locale identifier handling. Good to hear that it's not.

@littledan wrote:

In a follow-on proposal, we could create alternate factory functions on Intl.Locale for various more tolerant/legacy locale identifiers.

Why do you want to do that? What would we gain from this?

if we say that Intl.Locale (and all of ECMA-402's constructors) supports only Unicode BCP 47 locale identifiers, those would throw a RangeError. Would folks be happy with those semantics?

Well, zh-pinyin, es-ES-preeuro etc are still structurally valid per BCP 47 although pinyin and preeuro are NOT registered so that they are not valid variant subtag per BCP 47.

The current spec does not throw a range error for language tags that are structurally valid but (partly) made of unregistered subtags. Instead, it just passes them through. Changing that behavior would lead to a significant (?) burden on the implementation. c.f. ICU does not go beyond the structural validity check (+ canonicalization), either although it may do in the future.

BTW, Ecma 402 does require that a given timezone ID is checked to if it's in the list of allowed tz IDs. Spidermonkey implementation has a rather large list of mapping/exception lists on top of ICU's list. For timezone ids, it's a lot more manageable than lang tags.

@jungshik
Copy link

One more clarifying question: What part of the IANA language tag registry's deprecated/preferred value mapping has to be followed and what part should not in "Unicode BCP 47 locale identifiers" ?

'Unicode BCP 47 locale identifier' has its own mapping entries for language and regions. For some subtags, it's more comprehensive (e.g. treatment of region subtag 'SU' in a context dependent manner). For others, it's less or different.

@macchiati
Copy link

macchiati commented Oct 18, 2018 via email

@jungshik
Copy link

Thank you for a long reply with details. The current ICU implementation does the first two (structure check and mechanical canonicalization along with mapping deprecated sub tags to preferred values). So do the spec and implementations of Ecma Intl.Locale and locale parameter handling in other Intl APIs.

What is not done is checking against the list of valid subtags.

@macchiati
Copy link

macchiati commented Oct 21, 2018 via email

@littledan
Copy link
Member

But in the actual world, general purpose systems should give the choice as to whether to validate or not.

Do we want to start doing this checking? Given that mobile phones are a key use case for us, and we have a long history of not checking in ECMA-402 on the web, maybe we should leave that in the "follow-on proposal" bucket.

Why do you want to do that? What would we gain from this?

I'm not sure if it would be so high priority, but the goal would be to help JS programs deal with legacy/platform-specific locale identifiers. Separating into a separate API keeps the core simple.

@littledan
Copy link
Member

Well, if we don't barf on them or "canonicalize" them to root, it becomes difficult to do things like apply additional tags to them. The current Intl.Locale algorithm is full of special cases for this particular purpose.

@aphillips
Copy link

I'm mostly in violent agreement with @macchiati. I guess my position boils down to: don't barf, cannibalize to root to save all the attempts to extract "meaning" from the meaningless.

@littledan
Copy link
Member

littledan commented Oct 23, 2018

@aphillips and I talked in the W3C i18n meeting about this topic further, in particular about the few grandfathered tags that don't canonicalize to anything. @aphillips suggested that CLDR add canonicalizations for them (possibly matching what ICU outputs), and we move our reference for this data from IANA to CLDR.

Would anyone be interested in filing these CLDR tickets? @anba wrote up the list of the exceptions in #12 (comment) .

@macchiati
Copy link

macchiati commented Oct 23, 2018 via email

@littledan
Copy link
Member

littledan commented Oct 23, 2018

@macchiati Thanks! I missed that change (not sure how, the text is very straightforward). Seems like there's nothing to change in CLDR, just for the spec text in this proposal to be updated.

@macchiati
Copy link

macchiati commented Oct 23, 2018 via email

@aphillips
Copy link

@littledan I drew the action to follow up on this, so thanks for doing this.
@macchiati I thought you had done this---and you had. I agree about not bothering mapping the cel-gaulish's of the world.

@macchiati
Copy link

macchiati commented Oct 23, 2018 via email

@littledan
Copy link
Member

littledan commented Oct 29, 2018

Clarifying question: Is the canonicalization in step 1 of the BCP 47 Language Tag to Unicode BCP 47 Locale Identifier algorithm intended to sort the -u- extensions? (My guess: yes?)

littledan added a commit to littledan/ecma402 that referenced this issue Oct 29, 2018
UTS 35, rather than RFC 5646, provides a more modern and regular
normalization algorithm for locales. This standard definition will
be implementable in ICU and then shared among implementations,
rather than relying on buggy, implementation-specific normalization
algorithms. It also provides a more regular and easier-to-manipulate
form for Intl.Locale.

c.f. tc39/proposal-intl-locale#63
@littledan
Copy link
Member

We've concluded that we will reference Unicode BCP 47 Locale Identifiers, which resolves this issue. Thanks for suggesting the simplification here!

littledan added a commit to tc39/ecma402 that referenced this issue Jan 23, 2019
UTS 35, rather than RFC 5646, provides a more modern and regular
normalization algorithm for locales. This standard definition will
be implementable in ICU and then shared among implementations,
rather than relying on buggy, implementation-specific normalization
algorithms. It also provides a more regular and easier-to-manipulate
form for Intl.Locale.

c.f. tc39/proposal-intl-locale#63
@FrankYFTang
Copy link
Contributor

We could also add some specific aliases (such as cel-gaulish → xtg-x-cel-gaulish) although since these are essentially never used, it hardly seems worth the effort.

Mark- I have one problem related to this in the test
could you explain tome what does xtg mean in xtg-x-cel-gaulish ?

Currently https://github.com/tc39/test262/blob/master/test/intl402/Locale/extensions-grandfathered.js fail because this.

cel-gaulish got turn into xtg-x-cel-gaulish first , then we try to build the locale by replacing the
options: {
language: "fr",
script: "Cyrl",
region: "FR",
numberingSystem: "latn",
},
The current expectation i the test is
"fr-Cyrl-FR-u-nu-latn"
but my implementation got "fr-Cyrl-FR-u-nu-latn-x-cel-gaulish" because cel-gaulish first became xtg-x-cel-gaulish

I am currently use icu::Locale to parse the language/script/region/variant/other, but that will cause not only the parsing but also the canonicalization. Maybe I should just build my own simple parser to do the replacement instead so I can avoid such "early canonization".

@FrankYFTang
Copy link
Contributor

zh-pinyin ==>  zh-u-co-pinyin
es-ES-preeuro => es-ES-u-cu-esp
uz-UZ-CYRILLIC =>  uz-Cyrl-UZ

@macchiati wrote
"The variants on the left are not allowed in BCP 47 (and thus not in Unicode BCP 47 locale identifiers), while those on the right are Unicode BCP 47 locale identifiers."

@macchiati - Do you mean "pinyin", "preeuro" and "CYRILLIC" are not registered under https://tools.ietf.org/html/bcp47#section-3.5 so they are not allowed in BCP 47? Because these are structural valid variant, right?

@macchiati
Copy link

macchiati commented Feb 8, 2019 via email

@macchiati
Copy link

macchiati commented Feb 8, 2019 via email

@FrankYFTang
Copy link
Contributor

FrankYFTang commented Feb 8, 2019 via email

@littledan
Copy link
Member

I'm a bit lost on the technical details here. Is there a change we need to follow up with for tests or the specification?

@FrankYFTang
Copy link
Contributor

I'm a bit lost on the technical details here. Is there a change we need to follow up with for tests or the specification?

that is what I am trying to figure out.

@littledan littledan reopened this Feb 11, 2019
@macchiati
Copy link

macchiati commented Feb 11, 2019 via email

@FrankYFTang
Copy link
Contributor

Here is the issue. Those 3 mappings were added for compatibility with pre-bcpr7 versions of Unicode. I don't know whether it is necessary for ICU to continue to support them (as far as I'm concerned they could be dropped). So I see the following options: 1. No change to the ECMA spec, thus follows LDML for canonicalization. 1. File a ticket in ICU to drop those three mappings 2. OR Use ICU, but special case those 3 cases (ugly but doable). (If #1 is going to be done, this could just be a temporary workaround). 2. OR Modify the ECMA spec to allow these 3 mappings for backwards compatibility. {phone}

On Mon, Feb 11, 2019, 03:00 Frank Yung-Fong Tang @.***> wrote: I'm a bit lost on the technical details here. Is there a change we need to follow up with for tests or the specification? that is what I am trying to figure out. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AJKyMOhYQuSmKFo4XmFZ_JFhw1cJG4cxks5vMM6sgaJpZM4XDSS8 .

@macchiati Thanks for your reply. Now I understand it is not I missed something from the UTS35 or LDML but the ICU behavior is simply out of sync with the spec. I already file bugs in ICU, just want to make sure that should be a bug instead of a "feature". See https://unicode-org.atlassian.net/browse/ICU-20187 and https://unicode-org.atlassian.net/browse/ICU-20411.

@littledan
Copy link
Member

OK, sounds like there is nothing to do at the specification level then, right?

@littledan
Copy link
Member

We've switched the ECMA-402 spec to Unicode BCP 47 Locale Identifiers, so this issue should be resolved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants