diff --git a/docs/site/development/development-process/design-proposals/language-distance-data.md b/docs/site/development/development-process/design-proposals/language-distance-data.md new file mode 100644 index 00000000000..9db853fce55 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/language-distance-data.md @@ -0,0 +1,199 @@ +--- +title: Language Distance Data +--- + +# Language Distance Data + +The purpose is to provide a way to match language/locales according to "closeness" rather than just a truncation algorithm. and to allow for the user to specify multiple acceptable languages. The data is designed to allow for an algorithm that can account for the closeness of the relations between, say, tl and fil, or en-US and en-CA. This is based on code that we already use, but expanded to be more data-driven. + +For example, if I understand written American English, German, French, Swiss German, and Italian, and the product has {ja-JP, de, zh-TW}, then de would be the best match; if I understand only {zh}, then zh-TW would be the best match. This represents a superset of the capabilities of locale fallback. Stated in those terms, it can have the effect of a more complex fallback, such as: + +sr-Cyrl-RS + +sr-Cyrl + +sr-Latn-RS + +sr-Latn + +sr + +hr-Latn + +hr + +Note that the goal, as with the rest of CLDR, is for matching written languages. Should we find in the future that it is also important to support spoken language matching in the same way, variant weights could be supplied. + +This is related to the current aliasing mechanism, which is used to equate he and iw, for example. It is used to find the best locale ID for a given request, but does not interact with the fallback of resources *within the locale-parent chain.* It subsumes and replaces the current \ element (we'd take the current information in those elements and apply it). + +## Expected Input + +1. a weighted list of desired languages D (like AcceptLanguage) +2. a weighted list of available languages A (eg supported languages) + +In the examples, the weights are given in AcceptLanguage syntax, eg ";" + number in (0.0 to 1.0). The weight 0.0 means don't match at all. Unlike AcceptLanguage, however, the relations among variants like "en" and "en-CA" are taken into account. + +In very many cases, the weights will all be identical (1.0). Some exceptions might be: + +- For desired languages, to indicate a preference. For example, I happen to prefer English to German to French to Swiss German to Italian. So the desired list for me might be {"en-US;q=1", "de;q=0.9", "fr;q=0.85", "gsw;q=0.8", "it;q=0.6"} +- For available languages, it can be used to indicate the "quality" of the user experience. Thus if it is known that the German version of a product or site is quite good, but the Danish is substandard, that could be reflected in the weightings. In most cases, however, the available language weights would be the same. + +## Expected Output + +1. A "best fit" language from A +2. A measure of how good the fit is + +## Examples + +Input: + +desired: {"en-CA;q=1", "fr;q=1"} + +available: {"en-GB;q=1", "en-US;q=1"} + +threshold: script + +Output: + +en-US + +good + +Input: + +desired: {"en-ZA;q=1", "fr;q=1"} + +available: {"en-GB;q=1", "en-US;q=1", "fr-CA;q=0.9"} + +threshold: script + +Output: + +en-GB + +good + +Input: + +desired: {"de"} + +available: {"en-GB;q=1", "en-US;q=1", "fr-CA;q=0.9"} + +threshold: script + +Output: + +en-GB + +bad + +## Internals + +The following is a logical expression of how this data can be used. + +The lists are processed, with each Q value being inverted (x = 1/x) to derive a weight. There is a small progressive cost as well, so {x;q=1 y;q=1} turns into x;w=0 y;w=0.0001. Because AcceptLanguage is fatally underspecified, we also have to normalize the Q values. + +For each pair (d,a) in D and A: + +The base distance between d and a is computed by canonicalizing both languages and maximizing, using likely subtags, then computing the following. + +baseDistance = diff(d.language, a.language) + diff(d.script, a.script) + diff(d.region, a.region) + diff(d.variants, a.variants) + +There is also a small distance allotted for the maximization. That is, "en-Latn-CA" vs "en-Latn-CA" where the second "Latn" was added by maximization, will have a non-zero distance. Variants are handled as a sorted set, and the distance is variantDistance \* (count(variants1-variants2) + count(variants2-variants1)). As yet, there is no distance for extensions, but that may come in the future. + +We then compute: + +weight(d,a) = weight(d) \* weight(a) \* baseDistance(d,a) + +The weight of each a is then computed as the min(weight(d,a)) for all d. The a with the smallest such weight is the winner. The "goodness" of the match is given as a scale from 0.0(perfect) to 1.0 (awful). Constants are provided for a Script-only difference and a Region-only difference, for comparison. + +If, however, the winning language has too low a threshold, then the default locale (first in the available languages list) is returned. + +Note that the distance metric is *not* symmetric: the distance from zh to yue may be different than the distance from yue to zh. That happens when it is more likely that a reader of yue would understand zh than the reverse. + +Note that this doesn't have to be an N x M algorithm. Because there is a minimum threshold (otherwise returning the default locale), we can precompute the possible base language subtags that could be returned; anything else can be discarded. + +## Data Sample + +The data is designed to be relatively simple to understand. It would typically be processed into an internal format for fast processing. The data does not need to be exact; only the relative computed values are important. However, for keep the types of fields apart, they are given very different values. TODO: add values for [ISO 636 Deprecation Requests - DRAFT](https://cldr.unicode.org/development/development-process/design-proposals/iso-636-deprecation-requests-draft) + +\ + +\ + +\8\ + +\1\ + +\1\ + +\ + +\64\ + +\64\ + +\96\ + +\96\ + +\128\ + +\ + +\64\ + +\64\ + +\64\ + +\64\ + +\64\ + +\64\ + +\ + +\128\ + +\ + +\8\ \ + +\64\ \ + +\8\ \ + +\ + +\1024\ \ + +\256\ \ + +\64\ \ + +\16\ \ + +\ + +## Interpreting the Format + +1. The list is ordered, so the first match for a given type wins. That is, logically, you walk through the list looking for language matches. At the first one, you record the distance. Then you walk though for script differences, and so on. +2. The attributes desired and available both take language tags, and are assumed to be maximized for matching. +3. The Unknown subtags (und, Zzzz, ZZ, UNKNOWN) match any subtag of the same type. Trailing unknown values can be omitted. "\*" is a special value, used for the default distances. The macro regions (eg, 019 = Americas) match any region in them. So und-155 matches any language in Western Europe (155). + 1. As we expand, we may find out that we want more expressive power, like regex. +4. The attribute oneWay="true" indicates that the distance is only one direction. + +Issues + +- Should we have the values be symbolic rather than literal numbers? eg: L, S, R, ... instead of 1024, 256, 64,... +- The "\*" is a bit of a hack. Other thoughts for syntax? + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/list-formatting.md b/docs/site/development/development-process/design-proposals/list-formatting.md new file mode 100644 index 00000000000..18ff38972e6 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/list-formatting.md @@ -0,0 +1,66 @@ +--- +title: List Formatting +--- + +# List Formatting + +We add a set of patterns used for formatting variable-length lists, such as "A, B, C, and D" as follows: + +\ + + \ + +  \{0}, {0}\ + +  \{0}, {1}\ + +  \{0}, {1}\ + +  \{0}, {1}\ + + \ + +\ + +The way this works is that you format with type = exact number if there is one (eg type="2"). If not: + +1. Format the last two elements with the "end" format. +2. Then use middle format to add on subsequent elements working towards the front, all but the very first element. That is, {1} is what you've already done, and {0} is the previous element. +3. Then use "start" to add the front element, again with {1} as what you've done so far, and {0} is the first element. + +Thus a list (a,b,c,...m, n) is formatted as + +start(a,middle(b,middle(c,middle(...end(m, n))...))) + +By using start, middle, and end, we have the possibility of doing something special between the first two and last two elements. So here's how it would work for English. + +\ + + \ + +  \{0} and {1}\ + +  \{0}, and {1}\ + +  \{0}, {1}\ + +  \{0}, {1}\ + + \ + +\ + +Thus a list (a,b,c,d) is formatted as "a, b, c, and d" using this. + +Note that a higher level needs to handle the cases of zero and one element. Typically one element would just be that element; for zero elements a different structure might be substituted. Example: + +- zero: There are no meetings scheduled. +- one: There is a meeting scheduled on Wednesday. +- other: There are meetings scheduled on Wednesday, Friday, and Saturday. + +(The grammar of rest of these sentences aside from the list can be handled with plural formatting.) + +To account for the issue Philip raises, we might want to have alt values for a semi-colon (like) variant. + + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/locale-format.md b/docs/site/development/development-process/design-proposals/locale-format.md new file mode 100644 index 00000000000..6bcb30c0ad5 --- /dev/null +++ b/docs/site/development/development-process/design-proposals/locale-format.md @@ -0,0 +1,87 @@ +--- +title: Locale Format +--- + +# Locale Format + +**Problem:** + +Currently, we can get formats like the following: + +Chinese (Simplified Han) + +Simplified Chinese (Singapore) + +Chinese (Simplified Han, Singapore) + +English (United States) + +American English + +English (United States, Variant) + +American English (Variant) + +But we want to be able to have formats like: + +**Chinese (Simplified, Singapore)** + +**Chinese (Simplified)** + +**English (US), or English (American)** + +**English (UK), or English (British)** + +**English (US, Variant)** + +Here is a proposal for how to do this: + +Our current data looks like this (English): + +\ + +\{0} ({1})\ + +\, \ + +\ + +1. \Simplified Chinese\ +2. \Traditional Chinese\ +3. \U.S. English\ +4. \ +5. \ + +What happens is that in formatting, the fields that are not present in the type are put into {1} in the localePattern, separated by the localeSeparator (if there is more than one). + +We would change it slightly so that we could have patterns like: + +1. \Chinese (Simplified{SEP\_LEFT})\ +2. \English (US{SEP\_LEFT})\ + +{SEP\_LEFT} is whatever is left: separated by localeSeparator, and with localeSeparator in front + +{LEFT} is whatever is left: separated by localeSeparator, but with no initial localeSeparator + +Then we get: + +en\_US\_VARIANT => English (US, Variant) + +If there is no placeholder in the pattern, it works the old way. + +### Issue: + +1. Add context="", "standalone", "short", "short standalone" +2. If you have type="en\_US", then it will get one of the following: + 1. "": English (American) *or* English (United States) + 2. "short": English (US) + 3. "standalone": American English + 4. "short standalone": US English +3. We would also add context="short" on Regions, to get "US", and use it if there wasn't a short form of en\_US context="short" or "short standalone" + +Fallbacks: + +- short standalone => standalone => "" +- short => "" + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/localized-gmt-format.md b/docs/site/development/development-process/design-proposals/localized-gmt-format.md new file mode 100644 index 00000000000..a21d71261af --- /dev/null +++ b/docs/site/development/development-process/design-proposals/localized-gmt-format.md @@ -0,0 +1,100 @@ +--- +title: Localized GMT Format +--- + +# Localized GMT Format + +### Tickets + +[#3665](http://unicode.org/cldr/trac/ticket/3665) Additional time zone offset second field + +[#5382](http://unicode.org/cldr/trac/ticket/5382) Add short localized GMT format + +### Requirements + +- Many zones in the IANA time zone database use LMT as the initial rule. LMT is calculated from longitude of each location and has non-zero seconds offset. For example, America/Los\_Angeles uses -7:52:58 as the initial UTC offset. At this moment, CLDR does not have a pattern including seconds field. +- Localized GMT format is used as the final fallback of other name types. Other name types have short/long variation, but localized GMT format does not have such variation. In many cases, UTC offsets can be represented by integer hours, and offset minutes field would be redundant when shorter format is desired. + +### Current Implementation + +In CLDR 22, elements used for localized GMT format are below: + +- \ Format patterns used for representing UTC offset. This item is a single string containing two patterns, one for positive offset and another for negative offset, separated by semicolon (;). For example, "+HH:mm;-HH:mm". Each pattern must contain "H" (0-based 24 hours field) and "m" (minutes field). +- \ Message format pattern such as "GMT{0}" used for localized GMT format. The variable part is replaced with UTC offset representation created by \ above. +- \ The string used for UTC (GMT) itself, such as "GMT". The string is used only when UTC offset is 0. + +### Proposed Changes + +Below are the high level overview of the changes in this proposal + +- Deprecate \ element +- Introduce new \ element, with type attribute representing combinations - ( h | hm | hms ). For example, \+H:m;-H:m\ +- Introduce new \ to store a locale specific separator used for offset patterns. Character colon (:) is reserved in \ patterns for locale specific separator and actual pattern is produced by replacing colon (:) with the separator character specified by \ element. + +With above change, root.xml would be changed from + +**Old:** + + \+HH:mm;-HH:mm\ + +**New:** + + \+H;-H\ + + \+H:m;-H:m\ + + \+H:m:s;-H:m:s\ + + \:\ + +The table below illustrates the behavior of long / short format, with the root data above. + +| UTC Offset | Width | Output | Comment | +|---|---|---|---| +| -8:00:00 | long | GMT-08:00 | The negative pattern from <gmtOffsetPattern type="hm">, interpret 'H' as fixed 2 digits hour, replace ':' with <gmtOffsetSeparator> | +| | short | GMT-8 | The negative pattern from <gmtOffsetPattern type"h"> | +| -8:30:00 | long | GMT-08:30 | The negative pattern from <gmtOffsetPattern type="hm">, interpret 'H' as fixed 2 digits hour, replace ':' with <gmtOffsetSeparator> | +| | short | GMT-8:30 | The negative pattern from <gmtOffsetPattern type="hm">, interpret 'H' as variable width hour, replace ':' with <gmtOffsetSeparator> | +| -8:23:45 | long | GMT-08:23:45 | The negative pattern from <gmtOffsetPattern type="hms">, interpret 'H' as fixed 2 digits hour, replace ':' with <gmtOffsetSeparator> | +| | short | GMT-8:23:45 | The negative pattern from <gmtOffsetPattern type="hms">, interpret 'H' as variable width hour, replace ':' with <gmtOffsetSeparator> | + +- "long" format always uses zero-padded 2 digits for offset hours field, such as "00", "08", "11". +- "long" format does not use \. +- With above two, "long" format is expected to generates fixed length outputs practically (non-zero seconds offset is not used for modern dates). +- "short" format always uses shortest offset hours field, such as "0", "8", "11". +- "short" format uses the shortest pattern (h \< hm \< hms) without offset data loss. + +Design considerations + +- \ uses single "H", "m", "s", because they just indicate the disposition of these fields. "H" is interpreted as date format pattern "HH" or "H" depending on the width context. +- Many locales simply use the root data. Some locale may override only \. +- Some existing locale data do not use any separators (e.g. zh.xml). This can be represented by \ to be empty string. However, empty string as data does not fit well to CLDR structure, so such locale data require to provide at least \ and \. +- Some existing locale data uses U+2212 MINUS SIGN instead of U+002D HYPHEN-MINUS. These locales need to provide all of \ types. + +### Open Issues + +1. Distinction of H and HH in the current locale data. + +Some locales currently use single "H" in \. + + \+H:mm;-H:mm\ (cs.xml) + + \+H.mm;-H.mm\ (fi.xml) + +This proposal is trying to set a new assumption that "long" format to be practically fixed length, and long format always place leading zero for single digit hour value. If this assumption is not acceptable by these locales, then the design must be changed to allow "HH" or "H" in \ element, then "short" format to interpret "HH" as "H" (opposite way). However, I don't think this is the case. + +2. Needs for long/short message pattern? + +Although, [ticket #5382](http://unicode.org/cldr/trac/ticket/5382) mentioned about offset part of localized GMT format, some locales may want to have two \ data in different length. For example, + + \Гриинуич{0}\ (bg.xml) + + \গ্রীনিচ মান সময় {0}\ (bn.xml) + +There are some locales using relatively long patterns. If long/short distinction is given, these locales may want to provide shorter format such as "UTC{0}". + +3. Impacts in CLDR ST + +Because of another level of abstraction (separator, actual pattern width by context), this proposal may need a little bit more work on CLDR ST. + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/development-process/design-proposals/math-formula-preferences.md b/docs/site/development/development-process/design-proposals/math-formula-preferences.md new file mode 100644 index 00000000000..967e69b6fca --- /dev/null +++ b/docs/site/development/development-process/design-proposals/math-formula-preferences.md @@ -0,0 +1,49 @@ +--- +title: Math Formula Preferences +--- + +# Math Formula Preferences + +As outlined in [http://unicode.org/cldr/trac/ticket/10100,](http://unicode.org/cldr/trac/ticket/10100) + +the directional flow for mathematical formulas can differ across bidirectional locales, as does the numbering system used for writing mathematical formulas. As we can see in the map below, some Arabic speaking prefer a left to right presentation for mathematical formulas, while others prefer right to left. Furthermore, the numbering system used in mathematical formulas may differ from the numbering system used for formatting regular numbers. + +![image](../../../images/design-proposals/MathFormularDir_GoogleMap.PNG) + +This proposal adds some additional structure to CLDR's DTD to allow applications to format mathematical formulas properly: + +1. Under the "numbers" element, add a new element mathFormulaDirection as follows..... + +\ + +\ + +\ + + \ + + \ + +2. Under the "otherNumberingSystems" element, add an additional numbering system called "math", that can be used to handle those situations where the default numbering system for the locales differs from the numbering system used in mathematical formulas. We can and should document in [tr35](http://www.unicode.org/reports/tr35/tr35-numbers.html#otherNumberingSystems) , that if the "math" numbering system is not defined for a locale, then the default numbering system should be used + +\ + +\ + +\ + +\ + + \ + + \ + +Given the information we have currently, the amount of new data needed in CLDR is fairly minimal. In root.xml, we would have: + +mathFormulaDirection = "left-to-right", while ar.xml would have "right-to-left", which would cover the majority of Arabic speaking locales, but we would then add "left-to-right" in ar\_MO.xml to cover Morocco. + +Similarly, the vast majority of Arabic speaking locales would simply inherit their "math" numbering system from the default numbering system for the locale, and we would only need to explicitly specify a "math" numbering system where it differs from the default, for example, Yemen, Oman, Iraq. + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/images/design-proposals/MathFormularDir_GoogleMap.PNG b/docs/site/images/design-proposals/MathFormularDir_GoogleMap.PNG new file mode 100644 index 00000000000..d41cab5b163 Binary files /dev/null and b/docs/site/images/design-proposals/MathFormularDir_GoogleMap.PNG differ