Skip to content

Commit

Permalink
CLDR-16720 json: add transforms (#4036)
Browse files Browse the repository at this point in the history
  • Loading branch information
srl295 authored and conradarcturus committed Sep 25, 2024
1 parent 891be0f commit 12d4847
Show file tree
Hide file tree
Showing 7 changed files with 184 additions and 36 deletions.
48 changes: 25 additions & 23 deletions docs/site/downloads/cldr-46.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ It only covers the data, which is available at [release-46-alpha3](https://githu
## Overview

Unicode CLDR provides key building blocks for software supporting the world's languages.
CLDR data is used by all [major software systems](https://cldr.unicode.org/index#TOC-Who-uses-CLDR-)
(including all mobile phones) for their software internationalization and localization,
CLDR data is used by all [major software systems](https://cldr.unicode.org/index#TOC-Who-uses-CLDR-)
(including all mobile phones) for their software internationalization and localization,
adapting software to the conventions of different languages.

The most significant changes in this release were:

- Updates to Unicode 16.0 (including major changes to collation),
- Further revisions to the Message Format 2.0 tech preview,
- Substantial additions and modifications of Emoji search keyword data,
- Updates to Unicode 16.0 (including major changes to collation),
- Further revisions to the Message Format 2.0 tech preview,
- Substantial additions and modifications of Emoji search keyword data,
- ‘Upleveling’ the locale coverage.

### Locale Coverage Status
Expand Down Expand Up @@ -127,7 +127,7 @@ Full localization will await the next submission phase for CLDR.
For a full listing, see [Delta Data](https://unicode.org/cldr/charts/46/delta/index.html)

### Emoji Search Keywords
The usage model for emoji search keywords is that
The usage model for emoji search keywords is that
- The user types one or more words in an emoji search field. The order of words doesn't matter; nor does upper- versus lowercase.
- Each word successively narrows a number of emoji in a results box
- heart → 🥰 😘 😻 💌 💘 💝 💖 💗 💓 💞 💕 💟 ❣️ 💔 ❤️‍🔥 ❤️‍🩹 ❤️ 🩷 🧡 💛 💚 💙 🩵 💜 🤎 🖤 🩶 🤍 💋 🫰 🫶 🫀 💏 💑 🏠 🏡 ♥️ 🩺
Expand All @@ -139,41 +139,41 @@ The usage model for emoji search keywords is that
Thus in the following, the user would just click on 🎉 if that works for them.
- celebrate → 🥳 🥂 🎈 🎉 🎊 🪅

In this release WhatsApp emoji search keyword data has been incorporated.
In this release WhatsApp emoji search keyword data has been incorporated.
In the process of doing that, the maximum number of search keywords per emoji has been increased,
and the keywords have been simplified in most locales by breaking up multi-word keywords.
An example would be white flag (🏳️), formerly having 3 keyword phrases of [white waving flag | white flag | waving flag],
now being replaced by the simpler 3 single keywords [white | waving | flag].
and the keywords have been simplified in most locales by breaking up multi-word keywords.
An example would be white flag (🏳️), formerly having 3 keyword phrases of [white waving flag | white flag | waving flag],
now being replaced by the simpler 3 single keywords [white | waving | flag].
The simpler version typically works as well or better in practice.

### Collation Data Changes
There are two significant changes to the CLDR root collation (CLDR default sort order).

#### Realigned With DUCET
The [DUCET](https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table) is the Unicode Collation Algorithm default sort order.
The [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) is a tailoring of the DUCET.
The [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) is a tailoring of the DUCET.
These sort orders have differed in the relative order of groups of characters including extenders, currency symbols, and non-decimal-digit numeric characters.

Starting with CLDR 46 and Unicode 16.0, the order of these groups is the same.
In both sort orders, non-decimal-digit numeric characters now sort after decimal digits,
Starting with CLDR 46 and Unicode 16.0, the order of these groups is the same.
In both sort orders, non-decimal-digit numeric characters now sort after decimal digits,
and the CLDR root collation no longer tailors any currency symbols (making some of them sort like letter sequences, as in the DUCET).

These changes eliminate sort order differences among almost all regular characters between the CLDR root collation and the DUCET.
These changes eliminate sort order differences among almost all regular characters between the CLDR root collation and the DUCET.
See the [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) documentation for details.

#### Improved Han Radical-Stroke Order
CLDR includes [data for sorting Han (CJK) characters in radical-stroke order](https://cldr-smoke.unicode.org/spec/main/ldml/tr35-collation.md#File_Format_FractionalUCA_txt).
It used to distinguish traditional and simplified forms of radicals on a higher level than sorting by the number of residual strokes.
Starting with CLDR 46, the CLDR radical-stroke order matches that of the [Unicode Radical-Stroke Index (large PDF)](https://www.unicode.org/Public/UCD/latest/charts/RSIndex.pdf).
[Its sorting algorithm is defined in UAX #38](https://www.unicode.org/reports/tr38/#SortingAlgorithm).
Traditional vs. simplified forms of radicals are distinguished on a lower level than the number of residual strokes.
This also has an effect on [alphabetic indexes](tr35-collation.md#Collation_Indexes) for radical-stroke sort orders,
CLDR includes [data for sorting Han (CJK) characters in radical-stroke order](https://cldr-smoke.unicode.org/spec/main/ldml/tr35-collation.md#File_Format_FractionalUCA_txt).
It used to distinguish traditional and simplified forms of radicals on a higher level than sorting by the number of residual strokes.
Starting with CLDR 46, the CLDR radical-stroke order matches that of the [Unicode Radical-Stroke Index (large PDF)](https://www.unicode.org/Public/UCD/latest/charts/RSIndex.pdf).
[Its sorting algorithm is defined in UAX #38](https://www.unicode.org/reports/tr38/#SortingAlgorithm).
Traditional vs. simplified forms of radicals are distinguished on a lower level than the number of residual strokes.
This also has an effect on [alphabetic indexes](tr35-collation.md#Collation_Indexes) for radical-stroke sort orders,
where only the traditional forms of radicals are now available as index characters.

### JSON Data Changes

1. Separate modern packages were dropped [CLDR-16465]
2. Adding transliteration rules [CLDR-16720] (In progress)
1. Separate modern packages were dropped [CLDR-16465]
2. Transliteration (transform) data is now available in the `cldr-transforms` package. The JSON file contains transform metadata, and the `_rulesFile` key indicates an external (`.txt`) file containing the actual rules. [CLDR-16720][].

### Markdown ###

Expand All @@ -185,7 +185,7 @@ This process should be completed before release.
### File Changes

Most files added in this release were for new locales.
There were the following new test files:
There were the following new test files:

**TBD***

Expand Down Expand Up @@ -215,3 +215,5 @@ Many people have made significant contributions to CLDR and LDML; see the [Ackno
The Unicode [Terms of Use](https://unicode.org/copyright.html) apply to CLDR data; in particular, see [Exhibit 1](https://unicode.org/copyright.html#Exhibit1).

For web pages with different views of CLDR data, see [http://cldr.unicode.org/index/charts](https://cldr.unicode.org/index/charts).

[CLDR-16720]: https://unicode-org.atlassian.net/issues/CLDR-16720
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,15 @@ public static CldrNode createNode(
String fullTrunk = extractAttrs(fullPathSegment, node.nondistinguishingAttributes);
if (!node.name.equals(fullTrunk)) {
throw new ParseException(
"Error in parsing \"" + pathSegment + " \":\"" + fullPathSegment, 0);
"Error in parsing \""
+ pathSegment
+ "\":\""
+ fullPathSegment
+ " - "
+ node.name
+ " != "
+ fullTrunk,
0);
}

for (String key : node.distinguishingAttributes.keySet()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
Expand All @@ -49,6 +50,7 @@
import org.unicode.cldr.util.CLDRLocale;
import org.unicode.cldr.util.CLDRPaths;
import org.unicode.cldr.util.CLDRTool;
import org.unicode.cldr.util.CLDRTransforms;
import org.unicode.cldr.util.CLDRURLS;
import org.unicode.cldr.util.CalculatedCoverageLevels;
import org.unicode.cldr.util.CldrUtility;
Expand Down Expand Up @@ -88,6 +90,7 @@ public class Ldml2JsonConverter {
private static final String CLDR_PKG_PREFIX = "cldr-";
private static final String FULL_TIER_SUFFIX = "-full";
private static final String MODERN_TIER_SUFFIX = "-modern";
private static final String TRANSFORM_RAW_SUFFIX = ".txt";
private static Logger logger = Logger.getLogger(Ldml2JsonConverter.class.getName());

enum RunType {
Expand All @@ -98,7 +101,8 @@ enum RunType {
rbnf(false, true),
annotations,
annotationsDerived,
bcp47(false, false);
bcp47(false, false),
transforms(false, false);

private final boolean isTiered;
private final boolean hasLocales;
Expand Down Expand Up @@ -739,6 +743,8 @@ private int convertCldrItems(
outFilename = filenameAsLangTag + ".json";
} else if (type == RunType.bcp47) {
outFilename = filename + ".json";
} else if (type == RunType.transforms) {
outFilename = filename + ".json";
} else if (js.section.equals("other")) {
// If you see other-___.json, it means items that were missing from
// JSON_config_*.txt
Expand Down Expand Up @@ -775,11 +781,11 @@ private int convertCldrItems(
if (type == RunType.main) {
avl.full.add(filenameAsLangTag);
}
} else if (type == RunType.rbnf) {
js.packageName = "rbnf";
tier = "";
} else if (type == RunType.bcp47) {
js.packageName = "bcp47";
} else if (type == RunType.rbnf
|| type == RunType.bcp47
|| type == RunType.transforms) {
// untiered, just use the name
js.packageName = type.name();
tier = "";
}
if (js.packageName != null) {
Expand Down Expand Up @@ -884,6 +890,24 @@ private int convertCldrItems(
}
}

if (item.getUntransformedPath()
.startsWith("//supplementalData/transforms")) {
// here, write the raw data
final String rawTransformFile = filename + TRANSFORM_RAW_SUFFIX;
try (PrintWriter outf =
FileUtilities.openUTF8Writer(outputDir, rawTransformFile)) {
outf.println(item.getValue());
// note: not logging the write here- it will be logged when the
// .json file is written.
}
final String path = item.getPath();
item.setPath(fixTransformPath(path));
final String fullPath = item.getFullPath();
item.setFullPath(fixTransformPath(fullPath));
// the value is now the raw filename
item.setValue(rawTransformFile);
}

// some items need to be split to multiple item before processing. None
// of those items need to be sorted.
// Applies to SPLITTABLE_ATTRS attributes.
Expand Down Expand Up @@ -943,7 +967,31 @@ private int convertCldrItems(
outputUnitPreferenceData(js, theItems, out, nodesForLastItem);
}

// closeNodes(out, nodesForLastItem.size() - 2, 0);
// Special processing for transforms.
if (type == RunType.transforms) {
final JsonObject jo = out.getAsJsonObject("transforms");
if (jo == null || jo.isEmpty()) {
throw new RuntimeException(
"Could not get transforms object in " + filename);
}
@SuppressWarnings("unchecked")
final Entry<String, JsonElement>[] s = jo.entrySet().toArray(new Entry[0]);
if (s == null || s.length != 1) {
throw new RuntimeException(
"Could not get 1 subelement of transforms in " + filename);
}
// key doesn't matter.
// move subitem up
out = s[0].getValue().getAsJsonObject();
final Entry<String, JsonElement>[] s2 =
out.entrySet().toArray(new Entry[0]);
if (s2 == null || s2.length != 1) {
throw new RuntimeException(
"Could not get 1 sub-subelement of transforms in " + filename);
}
// move sub-subitem up.
out = s2[0].getValue().getAsJsonObject();
}

// write JSON
try (PrintWriter outf = FileUtilities.openUTF8Writer(outputDir, outFilename)) {
Expand Down Expand Up @@ -990,6 +1038,51 @@ private int convertCldrItems(
return totalItemsInFile;
}

/**
* Fixup an XPathParts with a specific transform element
*
* @param xpp the XPathParts to modify
* @param attribute the attribute name, such as "alias"
*/
private static final void fixTransformPath(final XPathParts xpp, final String attribute) {
final String v = xpp.getAttributeValue(-2, attribute); // on penultimate element
if (v == null) return;
final Set<String> aliases = new HashSet<>();
final Set<String> bcpAliases = new HashSet<>();
for (final String s : v.split(" ")) {
final String q = Locale.forLanguageTag(s).toLanguageTag();
if (s.equals(q)) {
// bcp47 round trips- add to bcp list
bcpAliases.add(s);
} else {
// different - add to other aliases.
aliases.add(s);
}
}
if (aliases.isEmpty()) {
xpp.removeAttribute(-2, attribute);
} else {
xpp.setAttribute(-2, attribute, String.join(" ", aliases.toArray(new String[0])));
}
if (bcpAliases.isEmpty()) {
xpp.removeAttribute(-2, attribute + "Bcp47");
} else {
xpp.setAttribute(
-2, attribute + "Bcp47", String.join(" ", bcpAliases.toArray(new String[0])));
}
}

/**
* Fixup a transform path, expanding the alias and backwardAlias into bcp47 and non-bcp47
* attributes.
*/
private static final String fixTransformPath(final String path) {
final XPathParts xpp = XPathParts.getFrozenInstance(path).cloneAsThawed();
fixTransformPath(xpp, "alias");
fixTransformPath(xpp, "backwardAlias");
return xpp.toString();
}

private static String valueSectionsFormat(int values, int sections) {
return MessageFormat.format(
"({0, plural, one {# value} other {# values}} in {1, plural, one {# section} other {# sections}})",
Expand Down Expand Up @@ -1453,6 +1546,24 @@ public void writeDefaultContent(String outputDir) throws IOException {
outf.close();
}

public void writeTransformMetadata(String outputDir) throws IOException {
final String dirName = outputDir + "/cldr-" + RunType.transforms.name();
final String fileName = RunType.transforms.name() + ".json";
PrintWriter outf = FileUtilities.openUTF8Writer(dirName, fileName);
System.out.println(
PACKAGE_ICON
+ " Creating packaging file => "
+ dirName
+ File.separator
+ fileName);
JsonObject obj = new JsonObject();
obj.add(
RunType.transforms.name(),
gson.toJsonTree(CLDRTransforms.getInstance().getJsonIndex()));
outf.println(gson.toJson(obj));
outf.close();
}

public void writeCoverageLevels(String outputDir) throws IOException {
try (PrintWriter outf =
FileUtilities.openUTF8Writer(outputDir + "/cldr-core", "coverageLevels.json"); ) {
Expand Down Expand Up @@ -2225,6 +2336,8 @@ public void processDirectory(String dirName, DraftStatus minimalDraftStatus)
if (Boolean.parseBoolean(options.get("packagelist").getValue())) {
writePackageList(outputDir);
}
} else if (type == RunType.transforms) {
writeTransformMetadata(outputDir);
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,14 @@ class LdmlConvertRules {
"identity:variant:type",

// in common/bcp47/*.xml
"keyword:key:name");
"keyword:key:name",

// transforms

// transforms
"transforms:transform:source",
"transforms:transform:target",
"transforms:transform:direction");

/**
* The set of element:attribute pair in which the attribute should be treated as value. All the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1128,4 +1128,20 @@ static String parseDoubleColon(String x, Set<String> others) {
}
return "";
}

public class CLDRTransformsJsonIndex {
/** raw list of available IDs */
public String[] available =
getAvailableIds().stream()
.map((String id) -> id.replace(".xml", ""))
.sorted()
.collect(Collectors.toList())
.toArray(new String[0]);
}

/** This gets the metadata (index file) exposed as cldr-json/cldr-transforms/transforms.json */
public CLDRTransformsJsonIndex getJsonIndex() {
final CLDRTransformsJsonIndex index = new CLDRTransformsJsonIndex();
return index;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
section=transforms ; path=//cldr/supplemental/transforms/.* ; package=transforms ; packageDesc=Transform data
dependency=core ; package=transforms
Original file line number Diff line number Diff line change
Expand Up @@ -130,10 +130,6 @@
< (.*(GMT|UTC).*/exemplarCity)(.*)
>

#
< (.*/transforms/transform[^/]*)/(.*)
> $1/tRules/$2

#
< (.*)\[@territories="([^"]*)"\](.*)\[@alt="variant"\](.*)
> $1\[@territories="$2-alt-variant"\]
Expand Down Expand Up @@ -173,3 +169,7 @@
# ParentLocales
< (.*/parentLocales)\[@component="([^"]*)"\]/(parentLocale)(.*)$
> $1/$2$4

# Transform - drop terminal tRule element
< //supplementalData/transforms/transform(.*)/tRule.*$
> //supplementalData/transforms/transform$1/_rulesFile

0 comments on commit 12d4847

Please sign in to comment.