Skip to content

Commit 94942a8

Browse files
authored
CLDR-15948 Clean up well-formedness and/or validity constraints (unicode-org#4179)
1 parent 7e21046 commit 94942a8

File tree

3 files changed

+185
-163
lines changed

3 files changed

+185
-163
lines changed

docs/ldml/tr35-general.md

+156-153
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ The LDML specification is divided into the following parts:
6868
* [Unit Preference and Conversion Data](#Unit_Preference_and_Conversion)
6969
* [Unit Identifiers](#Unit_Identifiers)
7070
* [Nomenclature](#nomenclature)
71-
* [Syntax](#syntax)
71+
* [Unit Syntax](#unit-syntax)
7272
* [Unit Identifier Uniqueness](#Unit_Identifier_Uniqueness)
7373
* [Example Units](#Example_Units)
7474
* [Compound Units](#compound-units)
@@ -902,157 +902,160 @@ As with other identifiers in CLDR, the American English spelling is used for uni
902902

903903
> In keeping with U.S. and International practice (see Sec. C.2), this Guide uses the dot on the line as the decimal marker. In addition this Guide utilizes the American spellings “meter,” “liter,” and “deka” rather than “metre,” “litre,” and “deca,” and the name “metric ton” rather than “tonne.”
904904
905-
#### Syntax
906-
907-
The formal syntax for identifiers is provided below.
908-
Some of the constraints reference data from the unitIdComponents in [Unit_Conversion](tr35-info.md#Unit_Conversion).
909-
910-
<!-- HTML: no header -->
911-
912-
<table><tbody>
913-
<tr><td><a name='unit_identifier' href='#unit_identifier'>unit_identifier</a></td><td>:=</td>
914-
<td>core_unit_identifier<br/>
915-
| mixed_unit_identifier<br/>
916-
| long_unit_identifier</td></tr>
917-
918-
<tr><td><a name='core_unit_identifier' href='#core_unit_identifier'>core_unit_identifier</a></td><td>:=</td>
919-
<td>product_unit ("-" per "-" product_unit)*<br/>
920-
| per "-" product_unit ("-" per "-" product_unit)*
921-
<ul><li><em>Examples:</em>
922-
<ul><li>foot-per-second-per-second</li>
923-
<li>per-second</li>
924-
</ul></li>
925-
<li><em>Note:</em> The normalized form will have only one "per"</li>
926-
</ul></td></tr>
927-
928-
<tr><td>per</td><td>:=</td>
929-
<td>"per"
930-
<ul>
931-
<li><em>Constraint:</em> The token 'per' is the single value in &lt;unitIdComponent type="per"&gt;</li>
932-
</ul></td></tr>
933-
934-
<tr><td><a name='product_unit' href='#product_unit'>product_unit</a></td><td>:=</td>
935-
<td>single_unit ("-" single_unit)* ("-" pu_single_unit)*<br/>
936-
| pu_single_unit ("-" pu_single_unit)*
937-
<ul><li><em>Example:</em> foot-pound-force</li>
938-
<li><em>Constraint:</em> No pu_single_unit may precede a single unit</li>
939-
</ul></td></tr>
940-
941-
<tr><td><a name='single_unit' href='#single_unit'>single_unit</a></td><td>:=</td>
942-
<td>dimensionality_prefix? simple_unit | unit_constant
943-
<ul><li><em>Examples: </em>square-kilometer, or 100</li></ul></td></tr>
944-
945-
<tr><td><a name='pu_single_unit' href='#pu_single_unit'>pu_single_unit</a></td><td>:=</td>
946-
<td>"xxx-" single_unit | "x-" single_unit
947-
<ul><li><em>Example:</em> xxx-square-knuts (a Harry Potter unit)</li>
948-
<li><em>Note:</em> "x-" is only for backwards compatibility</li>
949-
<li>See <a href="#Private_Use_Units">Private-Use Units</a></li>
950-
</ul></td></tr>
951-
952-
<tr><td><a name='unit_constant' href='#unit_constant'>unit_constant</a></td><td>:=</td>
953-
<td>[1-9][0-9]* ("e" [1-9][0-9]*)?
954-
<ul><li><em>Examples:</em>
955-
<ul><li>kilowatt-hour-per-100-kilometer</li>
956-
<li>gallon-per-100-mile</li>
957-
<li>per-200-pound</li>
958-
<li>per-12</li>
959-
</ul></li>
960-
<li><em>Constraint:</em> The numeric value of the unit constant must be an integer greater than one.</li>
961-
<li><em>Note:</em> The normal interpretation of <code>e</code> is used, where 2e6 = 2×10⁶.</li>
962-
<li><em>Note:</em> The <code>e</code> notation is optional: per-100-kilometer and per-1e2-kilometer are equivalent unit_identifiers.</li>
963-
<li><em>Note:</em> When constructing identifiers, exponents should be greater than 3 and multiples of 3, even though parsers must accept the wider range.</li>
964-
</ul></td></tr>
965-
966-
<tr><td><a name='dimensionality_prefix' href='#dimensionality_prefix'>dimensionality_prefix</a></td><td>:=</td>
967-
<td>"square-"<p>| "cubic-"<p>| "pow" ([2-9]|1[0-5]) "-"
968-
<ul>
969-
<li><em>Constraint:</em> must be value in: &lt;unitIdComponent type="power"&gt;.</li>
970-
<li><em>Note:</em> "pow2-" and "pow3-" canonicalize to "square-" and "cubic-"</li>
971-
<li><em>Note:</em> These are values in &lt;unitIdComponent type="power"&gt;</li>
972-
</ul></td></tr>
973-
974-
<tr><td><a name='simple_unit' href='#simple_unit'>simple_unit</a></td><td>:=</td>
975-
<td>(prefix_component "-")* (prefixed_unit | base_component) ("-" suffix_component)*<br/>
976-
| currency_unit<br/>
977-
| "em" | "g" | "us" | "hg" | "of"
978-
<ul>
979-
<li><em>Examples:</em> kilometer, meter, cup-metric, fluid-ounce, curr-chf, em</li>
980-
<li><em>Note:</em> Three simple units are currently allowed as legacy usage, for tokens that wouldn’t otherwise be a base_component due to length (eg, "<strong>g</strong>-force").
981-
We will likely deprecate those and add conformant aliases in the future: the "hg" and "of" are already only in deprecated simple_units.</li>
982-
</ul></td></tr>
983-
984-
<tr><td><a name='prefixed_unit' href='#prefixed_unit'>prefixed_unit</a></td><td></td>
985-
<td>prefix base_component<ul><li><em>Example: </em>kilometer</li></ul></td></tr>
986-
987-
<tr><td><a name='prefix' href='#prefix'>prefix</a></td><td></td>
988-
<td>si_prefix | binary_prefix</td></tr>
989-
990-
<tr><td><a name='si_prefix' href='#si_prefix'>si_prefix</a></td><td>:=</td>
991-
<td>"deka" | "hecto" | "kilo", …
992-
<ul><li><em>Constraint:</em> Must be an attribute value of the <code>type</code> in: &lt;unitPrefix type='…' … power10='…'&gt;.
993-
See also <a href="https://www.nist.gov/pml/special-publication-811">NIST special publication 811</a></li></ul></td></tr>
994-
995-
<tr><td><a name='binary_prefix' href='#binary_prefix'>binary_prefix</a></td><td>:=</td>
996-
<td>"kibi", "mebi", …
997-
<ul><li><em>Constraint:</em> Must be an attribute value of the <code>type</code> in: &lt;unitPrefix type='…' … power2='…'&gt;.
998-
See also <a href="https://physics.nist.gov/cuu/Units/binary.html">Prefixes for binary multiples</a></li></ul></td></tr>
999-
1000-
<tr><td><a name='prefix_component' href='#prefix_component'>prefix_component</a></td><td>:=</td>
1001-
<td>[a-z]{3,∞}
1002-
<ul><li><em>Constraint:</em> must be value in: &lt;unitIdComponent type="prefix"&gt;.</li></ul></td></tr>
1003-
1004-
<tr><td><a name='base_component' href='#base_component'>base_component</a></td><td>:=</td>
1005-
<td>[a-z]{3,∞}
1006-
<ul><li><em>Constraint:</em> must not be a value in any of the following:<br>
1007-
&lt;unitIdComponent type="prefix"&gt;<br>
1008-
or &lt;unitIdComponent type="suffix"&gt; <br>
1009-
or &lt;unitIdComponent type="power"&gt;<br>
1010-
or &lt;unitIdComponent type="and"&gt;<br>
1011-
or &lt;unitIdComponent type="per"&gt;.
1012-
</li>
1013-
<li><em>Constraint:</em> must not have a prefix as an initial segment.</li>
1014-
<li><em>Constraint:</em> no two different base_components will share the first 8 letters.
1015-
(<b>For more information, see <a href="#Unit_Identifier_Uniqueness">Unit Identifier Uniqueness</a>.)</b>
1016-
</li>
1017-
</ul>
1018-
</td></tr>
1019-
1020-
<tr><td><a name='suffix_component' href='#suffix_component'>suffix_component</a></td><td>:=</td>
1021-
<td>[a-z]{3,∞}
1022-
<ul>
1023-
<li><em>Constraint:</em> must be value in: &lt;unitIdComponent type="suffix"&gt;</li>
1024-
</ul></td></tr>
1025-
1026-
<tr><td><a name='mixed_unit_identifier' href='#mixed_unit_identifier'></a></td><td>:=</td>
1027-
<td>(single_unit | pu_single_unit) ("-" and "-" (single_unit | pu_single_unit ))*
1028-
<ul><li><em>Example: foot-and-inch</em></li>
1029-
</ul></td></tr>
1030-
1031-
<tr><td>and</td><td>:=</td>
1032-
<td>"and"
1033-
<ul>
1034-
<li><em>Constraint:</em> The token 'and' is the single value in &lt;unitIdComponent type="and"&gt;</li>
1035-
</ul></td></tr>
1036-
1037-
<tr><td><a name='long_unit_identifier' href='#long_unit_identifier'>long_unit_identifier</a></td><td>:=</td>
1038-
<td>grouping "-" core_unit_identifier</td></tr>
1039-
1040-
<tr><td>grouping</td><td>:=</td>
1041-
<td>[a-z]{3,∞}</td></tr>
1042-
1043-
<tr><td><a name='currency_unit' href='#currency_unit'>currency_unit</a></td><td>:=</td>
1044-
<td>"curr-" [a-z]{3}
1045-
<ul>
1046-
<li><em>Constraint:</em> The first part of the currency_unit is a standard prefix; the second part of the currency unit must be a valid <a href="tr35.md#UnicodeCurrencyIdentifier">Unicode currency identifier</a>.</li>
1047-
</ul>
1048-
<ul>
1049-
<li><em>Examples:</em> <b>curr-eur</b>-per-square-meter, or pound-per-<b>curr-usd</b></li>
1050-
<li><em>Note:</em> CLDR does not provide conversions for currencies; this is only intended for formatting.
1051-
The locale data for currencies is supplied in the <code>currencies</code> element, not in the <code>units</code> element.</li>
1052-
</ul>
1053-
</td></tr>
1054-
1055-
</tbody></table>
905+
<a name="syntax"></a>
906+
#### Unit Syntax
907+
908+
The formal syntax for identifiers is provided below, in [EBNF](tr35.md#ebnf).
909+
Some of the constraints reference data from various elements in the unit conversion data [units.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/units.xml).
910+
These may be either element values or element attribute values.
911+
See [Unit_Conversion](tr35-info.md#Unit_Conversion).
912+
913+
<a name='unit_identifier' href='#unit_identifier'>unit_identifier</a>
914+
<br/>:= core_unit_identifier
915+
<br/>   | mixed_unit_identifier
916+
<br/>   | long_unit_identifier
917+
918+
<a name='core_unit_identifier' href='#core_unit_identifier'>core_unit_identifier</a>
919+
<br/>:= product_unit ("-" per "-" product_unit)\*
920+
<br/>   | per "-" product_unit ("-" per "-" product_unit)\*
921+
* *Examples:*
922+
* foot-per-second-per-second
923+
* per-second
924+
* *Notes:*
925+
* The normalized form will have only one "per"
926+
927+
per
928+
<br/>:= "per"
929+
* [ wfc: The token 'per' is the single value in \<unitIdComponent type="per"\> ]
930+
931+
<a name='product_unit' href='#product_unit'>product_unit</a>
932+
<br/>:= single_unit ("-" single_unit)* ("-" pu_single_unit)*
933+
<br/>   | pu_single_unit ("-" pu_single_unit)*
934+
* [ wfc: No pu\_single\_unit may precede a single unit ]
935+
* *Examples:*
936+
* foot-pound-force
937+
938+
<a name='single_unit' href='#single_unit'>single_unit</a>
939+
<br/>:= dimensionality_prefix? simple_unit
940+
<br/>   | unit_constant
941+
* *Examples:*
942+
* square-kilometer
943+
* 100
944+
945+
<a name='pu_single_unit' href='#pu_single_unit'>pu_single_unit</a>
946+
<br/>:= "xxx-" single_unit
947+
<br/>   | "x-" single_unit
948+
* *Examples:*
949+
* xxx-square-knuts (a Harry Potter unit)
950+
* *Notes:*
951+
* "x-" is only for backwards compatibility; it is deprecated and should not be generated
952+
* See [Private-Use Units](https://github.com/unicode-org/cldr/edit/main/docs/ldml/tr35-general.md#Private_Use_Units)
953+
954+
<a name='unit_constant' href='#unit_constant'>unit_constant</a>
955+
<br/>:= [1-9][0-9]* ("e" [1-9][0-9]*)?
956+
* *Examples:*
957+
* kilowatt-hour-per-100-kilometer
958+
* gallon-per-100-mile
959+
* per-200-pound
960+
* per-12
961+
* [ wfc: The numeric value of the unit constant must be an integer greater than one. ]
962+
* *Notes:*
963+
* The normal interpretation of `e` is used, where 2e6 \= 2×10⁶.
964+
* The `e` notation is optional: per-100-kilometer and per-1e2-kilometer are equivalent unit\_identifiers.
965+
* When constructing identifiers, exponents should be greater than 3 and multiples of 3, even though parsers must accept the wider range.
966+
967+
<a name='dimensionality_prefix' href='#dimensionality_prefix'>dimensionality_prefix</a>
968+
<br/>:= "square-"
969+
<br/>   | "cubic-"
970+
<br/>   | "pow" ([2-9]|1[0-5]) "-"
971+
* [ wfc: Must be value in: \<unitIdComponent type="power"\>]
972+
* *Notes:*
973+
* "pow2-" and "pow3-" canonicalize to "square-" and "cubic-"
974+
975+
<a name='simple_unit' href='#simple_unit'>simple_unit</a>
976+
<br/>:= (prefix_component "-")* (prefixed_unit
977+
<br/>   | base_component) ("-" suffix_component)*
978+
<br/>   | currency_unit
979+
<br/>   | ("em" | "g" | "us" | "hg" | "of")
980+
* *Examples:*
981+
* kilometer
982+
* meter
983+
* cup-metric
984+
* fluid-ounce
985+
* curr-chf
986+
* em
987+
* *Notes:*
988+
* Five simple units are currently allowed as legacy usage, for tokens that wouldn’t otherwise be a base\_component due to length (eg, "g-force").Those are likely to be deprecated in teh future, with conformant aliases added: the "hg" and "of" are already only in deprecated simple\_units.
989+
990+
<a name='prefixed_unit' href='#prefixed_unit'>prefixed_unit</a>
991+
prefix base_component
992+
* *Examples:*
993+
* kilometer
994+
995+
<a name='prefix' href='#prefix'>prefix</a>
996+
<br/>:= si_prefix
997+
<br/>   | binary_prefix
998+
999+
<a name='si_prefix' href='#si_prefix'>si_prefix</a>
1000+
<br/>:= "deka"
1001+
<br/>   | "hecto"
1002+
<br/>   | "kilo", …
1003+
* [ wfc: Must be an attribute value of the `type` in: \<unitPrefix type='…' … power10='…'\> ]
1004+
* *Notes:*
1005+
* See also [NIST special publication 811](https://www.nist.gov/pml/special-publication-811)
1006+
1007+
<a name='binary_prefix' href='#binary_prefix'>binary_prefix</a>
1008+
<br/>:= "kibi", "mebi", …
1009+
* [ wfc: Must be an attribute value of the `type` in: \<unitPrefix type='…' … power2='…'\>]
1010+
* *Notes:*
1011+
* See also [Prefixes for binary multiples](https://physics.nist.gov/cuu/Units/binary.html)
1012+
1013+
<a name='prefix_component' href='#prefix_component'>prefix_component</a>
1014+
<br/>:= [a-z]{3,}
1015+
* [ vc: must be value in: \<unitIdComponent type="prefix"\>]
1016+
* *Notes:*
1017+
* The set of prefix components often expands in new releases, so the requirement to be one of these attribute values is a validity constraint, not a well-formedness constraint. *
1018+
1019+
<a name='base_component' href='#base_component'>base_component</a>
1020+
<br/>:= [a-z]{3,}
1021+
* [ wfc: must not have a prefix as an initial segment. ]
1022+
* [ wfc: must not be a value in \<unitIdComponent type="X"\> for X in \{prefix, suffix, power, and, per} ]
1023+
* [ vc: Must be an attribute value of the `source` in: \<convertUnit source='…' …\> or the `type` in \<unitAlias type="…" replacement="…" …\> ]
1024+
* *Notes:*
1025+
* The set of base components typically expands in new releases, so the requirement to be one of these attribute values is a validity constraint, not a well-formedness constraint.
1026+
* The base-components in unitAlias `type` are deprecated, should be converted to their replacement values.
1027+
* No two different base\_components will share the first 8 letters; see [Unit Identifier Uniqueness](https://github.com/unicode-org/cldr/edit/main/docs/ldml/tr35-general.md#Unit_Identifier_Uniqueness).) ]
1028+
1029+
<a name='suffix_component' href='#suffix_component'>suffix_component</a>
1030+
<br/>:= [a-z]{3,}
1031+
* [ vc: must be value in: \<unitIdComponent type="suffix"\> ]
1032+
* *Notes:*
1033+
* The set of suffix components often expands in new releases, so the requirement to be one of these attribute values is a validity constraint, not a well-formedness constraint.
1034+
1035+
<a name='mixed_unit_identifier' href='#mixed_unit_identifier'>mixed_unit_identifier</a>
1036+
<br/>:= (single_unit | pu_single_unit) ("-" and "-" (single_unit | pu_single_unit ))*
1037+
* *Examples:*
1038+
* foot-and-inch
1039+
1040+
and
1041+
<br/>:= "and"
1042+
* [ wfc: The token 'and' is the single value in \<unitIdComponent type="and"\> ]
1043+
1044+
<a name='long_unit_identifier' href='#long_unit_identifier'>long_unit_identifier</a>
1045+
<br/>:= grouping "-" core_unit_identifier
1046+
1047+
grouping
1048+
<br/>:= [a-z]{3,}
1049+
1050+
<a name='currency_unit' href='#currency_unit'>currency_unit</a>
1051+
<br/>:= "curr-" [a-z]{3}
1052+
* [ wfc: The first part of the currency\_unit is a standard prefix; the second part of the currency unit must be a valid [Unicode currency identifier](https://github.com/unicode-org/cldr/blob/main/docs/ldml/tr35.md#UnicodeCurrencyIdentifier)]
1053+
* *Examples:*
1054+
* curr-eur-per-square-meter
1055+
* pound-per-curr-usd
1056+
* *Notes:*
1057+
* CLDR does not provide conversions for currencies; this is only intended for formatting.
1058+
* The locale data for currency display names is supplied in the `currencies` element, not in the `units` element.
10561059

10571060
Note that while the syntax allows for unit_constants in multiple places, the typical use case is only one instance, after a "-per-".
10581061
The normalized form of a unit identifier has at most one unit_constant in the numerator and one in the denominator.
@@ -3143,4 +3146,4 @@ The authors, contributors, and publishers have taken care in the preparation of
31433146
but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom.
31443147
This publication is provided “AS-IS” without charge as a convenience to users.
31453148

3146-
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.
3149+
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.

docs/ldml/tr35-info.md

+12-2
Original file line numberDiff line numberDiff line change
@@ -1208,9 +1208,19 @@ Instructions for use are supplied in the header of the file.
12081208

12091209
Different locales have different preferences for which unit or combination of units is used for a particular usage, such as measuring a person’s height. This is more fine-grained than merely a preference for metric versus US or UK measurement systems. For example, one locale may use meters alone, while another may use centimeters alone or a combination of meters and centimeters; a third may use inches alone, or (informally) a combination of feet and inches.
12101210

1211+
The determination of preferred units uses the user preference data in [units.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/units.xml) together with **input unit**, the **input unit usage**, and the **input locale identifer**.
1212+
* The _well-formed_ and _valid_ **units** are defined according to [Unit Syntax](tr35-general.html#unit-syntax).
1213+
* The _well-formed_ **unit usages** are of the form [a-z0-9]{3-8}("-" [a-z0-9]{3-8})*.
1214+
The _valid_ **unit usages** are the union of the set of `NMTOKENS` in the `usage` attribute value for the `unitPreferences` element in [units.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/units.xml).
1215+
For example, the following `unitPreferences` elements produce the set {default, floor, geograph, land}.
1216+
* \<unitPreferences category="area" usage="default">
1217+
* \<unitPreferences category="area" usage="geograph land">
1218+
* \<unitPreferences category="area" usage="floor">
1219+
* There are currently no deprecated **unit usages**.
1220+
Should there be any in the future, for backwards compatibility the above definition would be expanded to include unitUsageAlias elements.
1221+
12111222
### <a name="Unit_Preferences_Overrides" href="#Unit_Preferences_Overrides">Unit Preferences Overrides</a>
12121223

1213-
The determination of preferred units uses the user preference data together with **input unit**, the **input usage**, and the **input locale identifer**.
12141224
Within the locale identifier, the subtags that can affect the result are:
12151225
* the value of the keys mu, ms, and rg
12161226
* the region in the locale identifier (if there is one)
@@ -1473,4 +1483,4 @@ The authors, contributors, and publishers have taken care in the preparation of
14731483
but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom.
14741484
This publication is provided “AS-IS” without charge as a convenience to users.
14751485

1476-
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.
1486+
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.

0 commit comments

Comments
 (0)