CLDR-16720 json: add transforms (#4036)

unicode-org · Sep 25, 2024 · 12d4847 · 12d4847
1 parent 891be0f
commit 12d4847
Show file tree

Hide file tree

Showing 7 changed files with 184 additions and 36 deletions.
diff --git a/docs/site/downloads/cldr-46.md b/docs/site/downloads/cldr-46.md
@@ -15,15 +15,15 @@ It only covers the data, which is available at [release-46-alpha3](https://githu
 ## Overview
 
 Unicode CLDR provides key building blocks for software supporting the world's languages.
-CLDR data is used by all [major software systems](https://cldr.unicode.org/index#TOC-Who-uses-CLDR-) 
-(including all mobile phones) for their software internationalization and localization, 
+CLDR data is used by all [major software systems](https://cldr.unicode.org/index#TOC-Who-uses-CLDR-)
+(including all mobile phones) for their software internationalization and localization,
 adapting software to the conventions of different languages.
 
 The most significant changes in this release were:
 
-- Updates to Unicode 16.0 (including major changes to collation), 
-- Further revisions to the Message Format 2.0 tech preview, 
-- Substantial additions and modifications of Emoji search keyword data, 
+- Updates to Unicode 16.0 (including major changes to collation),
+- Further revisions to the Message Format 2.0 tech preview,
+- Substantial additions and modifications of Emoji search keyword data,
 - ‘Upleveling’ the locale coverage.
 
 ### Locale Coverage Status
@@ -127,7 +127,7 @@ Full localization will await the next submission phase for CLDR.
 For a full listing, see [Delta Data](https://unicode.org/cldr/charts/46/delta/index.html)
 
 ### Emoji Search Keywords
-The usage model for emoji search keywords is that 
+The usage model for emoji search keywords is that
 - The user types one or more words in an emoji search field. The order of words doesn't matter; nor does upper- versus lowercase.
 - Each word successively narrows a number of emoji in a results box
     - heart → 🥰 😘 😻 💌 💘 💝 💖 💗 💓 💞 💕 💟 ❣️ 💔 ❤️‍🔥 ❤️‍🩹 ❤️ 🩷 🧡 💛 💚 💙 🩵 💜 🤎 🖤 🩶 🤍 💋 🫰 🫶 🫀 💏 💑 🏠 🏡 ♥️ 🩺
@@ -139,41 +139,41 @@ The usage model for emoji search keywords is that
 Thus in the following, the user would just click on 🎉 if that works for them.
     - celebrate → 🥳 🥂 🎈 🎉 🎊 🪅
 
-In this release WhatsApp emoji search keyword data has been incorporated. 
+In this release WhatsApp emoji search keyword data has been incorporated.
 In the process of doing that, the maximum number of search keywords per emoji has been increased,
-and the keywords have been simplified in most locales by breaking up multi-word keywords. 
-An example would be white flag (🏳️), formerly having 3 keyword phrases of [white waving flag | white flag | waving flag], 
-now being replaced by the simpler 3 single keywords [white | waving | flag]. 
+and the keywords have been simplified in most locales by breaking up multi-word keywords.
+An example would be white flag (🏳️), formerly having 3 keyword phrases of [white waving flag | white flag | waving flag],
+now being replaced by the simpler 3 single keywords [white | waving | flag].
 The simpler version typically works as well or better in practice.
 
 ### Collation Data Changes
 There are two significant changes to the CLDR root collation (CLDR default sort order).
 
 #### Realigned With DUCET
 The [DUCET](https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table) is the Unicode Collation Algorithm default sort order.
-The [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) is a tailoring of the DUCET. 
+The [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) is a tailoring of the DUCET.
 These sort orders have differed in the relative order of groups of characters including extenders, currency symbols, and non-decimal-digit numeric characters.
 
-Starting with CLDR 46 and Unicode 16.0, the order of these groups is the same. 
-In both sort orders, non-decimal-digit numeric characters now sort after decimal digits, 
+Starting with CLDR 46 and Unicode 16.0, the order of these groups is the same.
+In both sort orders, non-decimal-digit numeric characters now sort after decimal digits,
 and the CLDR root collation no longer tailors any currency symbols (making some of them sort like letter sequences, as in the DUCET).
 
-These changes eliminate sort order differences among almost all regular characters between the CLDR root collation and the DUCET. 
+These changes eliminate sort order differences among almost all regular characters between the CLDR root collation and the DUCET.
 See the [CLDR root collation](https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) documentation for details.
 
 #### Improved Han Radical-Stroke Order
-CLDR includes [data for sorting Han (CJK) characters in radical-stroke order](https://cldr-smoke.unicode.org/spec/main/ldml/tr35-collation.md#File_Format_FractionalUCA_txt). 
-It used to distinguish traditional and simplified forms of radicals on a higher level than sorting by the number of residual strokes. 
-Starting with CLDR 46, the CLDR radical-stroke order matches that of the [Unicode Radical-Stroke Index (large PDF)](https://www.unicode.org/Public/UCD/latest/charts/RSIndex.pdf). 
-[Its sorting algorithm is defined in UAX #38](https://www.unicode.org/reports/tr38/#SortingAlgorithm). 
-Traditional vs. simplified forms of radicals are distinguished on a lower level than the number of residual strokes. 
-This also has an effect on [alphabetic indexes](tr35-collation.md#Collation_Indexes) for radical-stroke sort orders, 
+CLDR includes [data for sorting Han (CJK) characters in radical-stroke order](https://cldr-smoke.unicode.org/spec/main/ldml/tr35-collation.md#File_Format_FractionalUCA_txt).
+It used to distinguish traditional and simplified forms of radicals on a higher level than sorting by the number of residual strokes.
+Starting with CLDR 46, the CLDR radical-stroke order matches that of the [Unicode Radical-Stroke Index (large PDF)](https://www.unicode.org/Public/UCD/latest/charts/RSIndex.pdf).
+[Its sorting algorithm is defined in UAX #38](https://www.unicode.org/reports/tr38/#SortingAlgorithm).
+Traditional vs. simplified forms of radicals are distinguished on a lower level than the number of residual strokes.
+This also has an effect on [alphabetic indexes](tr35-collation.md#Collation_Indexes) for radical-stroke sort orders,
 where only the traditional forms of radicals are now available as index characters.
 
 ### JSON Data Changes
 
-1. Separate modern packages were dropped [CLDR-16465] 
-2. Adding transliteration rules [CLDR-16720] (In progress)
+1. Separate modern packages were dropped [CLDR-16465]
+2. Transliteration (transform) data is now available in the `cldr-transforms` package. The JSON file contains transform metadata, and the `_rulesFile` key indicates an external (`.txt`) file containing the actual rules. [CLDR-16720][].
 
 ### Markdown ###
 
@@ -185,7 +185,7 @@ This process should be completed before release.
 ### File Changes
 
 Most files added in this release were for new locales.
-There were the following new test files: 
+There were the following new test files:
 
 **TBD***
 
@@ -215,3 +215,5 @@ Many people have made significant contributions to CLDR and LDML; see the [Ackno
 The Unicode [Terms of Use](https://unicode.org/copyright.html) apply to CLDR data; in particular, see [Exhibit 1](https://unicode.org/copyright.html#Exhibit1).
 
 For web pages with different views of CLDR data, see [http://cldr.unicode.org/index/charts](https://cldr.unicode.org/index/charts).
+
+[CLDR-16720]: https://unicode-org.atlassian.net/issues/CLDR-16720
diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/json/CldrNode.java b/tools/cldr-code/src/main/java/org/unicode/cldr/json/CldrNode.java
@@ -23,7 +23,15 @@ public static CldrNode createNode(
         String fullTrunk = extractAttrs(fullPathSegment, node.nondistinguishingAttributes);
         if (!node.name.equals(fullTrunk)) {
             throw new ParseException(
-                    "Error in parsing \"" + pathSegment + " \":\"" + fullPathSegment, 0);
+                    "Error in parsing \""
+                            + pathSegment
+                            + "\":\""
+                            + fullPathSegment
+                            + " - "
+                            + node.name
+                            + " != "
+                            + fullTrunk,
+                    0);
         }
 
         for (String key : node.distinguishingAttributes.keySet()) {

diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java b/tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java
@@ -23,6 +23,7 @@
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collections;
+import java.util.HashSet;
 import java.util.Iterator;
 import java.util.LinkedList;
 import java.util.List;
@@ -49,6 +50,7 @@
 import org.unicode.cldr.util.CLDRLocale;
 import org.unicode.cldr.util.CLDRPaths;
 import org.unicode.cldr.util.CLDRTool;
+import org.unicode.cldr.util.CLDRTransforms;
 import org.unicode.cldr.util.CLDRURLS;
 import org.unicode.cldr.util.CalculatedCoverageLevels;
 import org.unicode.cldr.util.CldrUtility;
@@ -88,6 +90,7 @@ public class Ldml2JsonConverter {
     private static final String CLDR_PKG_PREFIX = "cldr-";
     private static final String FULL_TIER_SUFFIX = "-full";
     private static final String MODERN_TIER_SUFFIX = "-modern";
+    private static final String TRANSFORM_RAW_SUFFIX = ".txt";
     private static Logger logger = Logger.getLogger(Ldml2JsonConverter.class.getName());
 
     enum RunType {
@@ -98,7 +101,8 @@ enum RunType {
         rbnf(false, true),
         annotations,
         annotationsDerived,
-        bcp47(false, false);
+        bcp47(false, false),
+        transforms(false, false);
 
         private final boolean isTiered;
         private final boolean hasLocales;
@@ -739,6 +743,8 @@ private int convertCldrItems(
                 outFilename = filenameAsLangTag + ".json";
             } else if (type == RunType.bcp47) {
                 outFilename = filename + ".json";
+            } else if (type == RunType.transforms) {
+                outFilename = filename + ".json";
             } else if (js.section.equals("other")) {
                 // If you see other-___.json, it means items that were missing from
                 // JSON_config_*.txt
@@ -775,11 +781,11 @@ private int convertCldrItems(
                         if (type == RunType.main) {
                             avl.full.add(filenameAsLangTag);
                         }
-                    } else if (type == RunType.rbnf) {
-                        js.packageName = "rbnf";
-                        tier = "";
-                    } else if (type == RunType.bcp47) {
-                        js.packageName = "bcp47";
+                    } else if (type == RunType.rbnf
+                            || type == RunType.bcp47
+                            || type == RunType.transforms) {
+                        // untiered, just use the name
+                        js.packageName = type.name();
                         tier = "";
                     }
                     if (js.packageName != null) {
@@ -884,6 +890,24 @@ private int convertCldrItems(
                             }
                         }
 
+                        if (item.getUntransformedPath()
+                                .startsWith("//supplementalData/transforms")) {
+                            // here, write the raw data
+                            final String rawTransformFile = filename + TRANSFORM_RAW_SUFFIX;
+                            try (PrintWriter outf =
+                                    FileUtilities.openUTF8Writer(outputDir, rawTransformFile)) {
+                                outf.println(item.getValue());
+                                // note: not logging the write here- it will be logged when the
+                                // .json file is written.
+                            }
+                            final String path = item.getPath();
+                            item.setPath(fixTransformPath(path));
+                            final String fullPath = item.getFullPath();
+                            item.setFullPath(fixTransformPath(fullPath));
+                            // the value is now the raw filename
+                            item.setValue(rawTransformFile);
+                        }
+
                         // some items need to be split to multiple item before processing. None
                         // of those items need to be sorted.
                         // Applies to SPLITTABLE_ATTRS attributes.
@@ -943,7 +967,31 @@ private int convertCldrItems(
                         outputUnitPreferenceData(js, theItems, out, nodesForLastItem);
                     }
 
-                    // closeNodes(out, nodesForLastItem.size() - 2, 0);
+                    // Special processing for transforms.
+                    if (type == RunType.transforms) {
+                        final JsonObject jo = out.getAsJsonObject("transforms");
+                        if (jo == null || jo.isEmpty()) {
+                            throw new RuntimeException(
+                                    "Could not get transforms object in " + filename);
+                        }
+                        @SuppressWarnings("unchecked")
+                        final Entry<String, JsonElement>[] s = jo.entrySet().toArray(new Entry[0]);
+                        if (s == null || s.length != 1) {
+                            throw new RuntimeException(
+                                    "Could not get 1 subelement of transforms in " + filename);
+                        }
+                        // key doesn't matter.
+                        // move subitem up
+                        out = s[0].getValue().getAsJsonObject();
+                        final Entry<String, JsonElement>[] s2 =
+                                out.entrySet().toArray(new Entry[0]);
+                        if (s2 == null || s2.length != 1) {
+                            throw new RuntimeException(
+                                    "Could not get 1 sub-subelement of transforms in " + filename);
+                        }
+                        // move sub-subitem up.
+                        out = s2[0].getValue().getAsJsonObject();
+                    }
 
                     // write JSON
                     try (PrintWriter outf = FileUtilities.openUTF8Writer(outputDir, outFilename)) {
@@ -990,6 +1038,51 @@ private int convertCldrItems(
         return totalItemsInFile;
     }
 
+    /**
+     * Fixup an XPathParts with a specific transform element
+     *
+     * @param xpp the XPathParts to modify
+     * @param attribute the attribute name, such as "alias"
+     */
+    private static final void fixTransformPath(final XPathParts xpp, final String attribute) {
+        final String v = xpp.getAttributeValue(-2, attribute); // on penultimate element
+        if (v == null) return;
+        final Set<String> aliases = new HashSet<>();
+        final Set<String> bcpAliases = new HashSet<>();
+        for (final String s : v.split(" ")) {
+            final String q = Locale.forLanguageTag(s).toLanguageTag();
+            if (s.equals(q)) {
+                // bcp47 round trips- add to bcp list
+                bcpAliases.add(s);
+            } else {
+                // different - add to other aliases.
+                aliases.add(s);
+            }
+        }
+        if (aliases.isEmpty()) {
+            xpp.removeAttribute(-2, attribute);
+        } else {
+            xpp.setAttribute(-2, attribute, String.join(" ", aliases.toArray(new String[0])));
+        }
+        if (bcpAliases.isEmpty()) {
+            xpp.removeAttribute(-2, attribute + "Bcp47");
+        } else {
+            xpp.setAttribute(
+                    -2, attribute + "Bcp47", String.join(" ", bcpAliases.toArray(new String[0])));
+        }
+    }
+
+    /**
+     * Fixup a transform path, expanding the alias and backwardAlias into bcp47 and non-bcp47
+     * attributes.
+     */
+    private static final String fixTransformPath(final String path) {
+        final XPathParts xpp = XPathParts.getFrozenInstance(path).cloneAsThawed();
+        fixTransformPath(xpp, "alias");
+        fixTransformPath(xpp, "backwardAlias");
+        return xpp.toString();
+    }
+
     private static String valueSectionsFormat(int values, int sections) {
         return MessageFormat.format(
                 "({0, plural,  one {# value} other {# values}} in {1, plural, one {# section} other {# sections}})",
@@ -1453,6 +1546,24 @@ public void writeDefaultContent(String outputDir) throws IOException {
         outf.close();
     }
 
+    public void writeTransformMetadata(String outputDir) throws IOException {
+        final String dirName = outputDir + "/cldr-" + RunType.transforms.name();
+        final String fileName = RunType.transforms.name() + ".json";
+        PrintWriter outf = FileUtilities.openUTF8Writer(dirName, fileName);
+        System.out.println(
+                PACKAGE_ICON
+                        + " Creating packaging file => "
+                        + dirName
+                        + File.separator
+                        + fileName);
+        JsonObject obj = new JsonObject();
+        obj.add(
+                RunType.transforms.name(),
+                gson.toJsonTree(CLDRTransforms.getInstance().getJsonIndex()));
+        outf.println(gson.toJson(obj));
+        outf.close();
+    }
+
     public void writeCoverageLevels(String outputDir) throws IOException {
         try (PrintWriter outf =
                 FileUtilities.openUTF8Writer(outputDir + "/cldr-core", "coverageLevels.json"); ) {
@@ -2225,6 +2336,8 @@ public void processDirectory(String dirName, DraftStatus minimalDraftStatus)
                 if (Boolean.parseBoolean(options.get("packagelist").getValue())) {
                     writePackageList(outputDir);
                 }
+            } else if (type == RunType.transforms) {
+                writeTransformMetadata(outputDir);
             }
         }
     }

diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/json/LdmlConvertRules.java b/tools/cldr-code/src/main/java/org/unicode/cldr/json/LdmlConvertRules.java
@@ -154,7 +154,14 @@ class LdmlConvertRules {
                     "identity:variant:type",
 
                     // in common/bcp47/*.xml
-                    "keyword:key:name");
+                    "keyword:key:name",
+
+                    // transforms
+
+                    // transforms
+                    "transforms:transform:source",
+                    "transforms:transform:target",
+                    "transforms:transform:direction");
 
     /**
      * The set of element:attribute pair in which the attribute should be treated as value. All the

diff --git a/tools/cldr-code/src/main/java/org/unicode/cldr/util/CLDRTransforms.java b/tools/cldr-code/src/main/java/org/unicode/cldr/util/CLDRTransforms.java
@@ -1128,4 +1128,20 @@ static String parseDoubleColon(String x, Set<String> others) {
         }
         return "";
     }
+
+    public class CLDRTransformsJsonIndex {
+        /** raw list of available IDs */
+        public String[] available =
+                getAvailableIds().stream()
+                        .map((String id) -> id.replace(".xml", ""))
+                        .sorted()
+                        .collect(Collectors.toList())
+                        .toArray(new String[0]);
+    }
+
+    /** This gets the metadata (index file) exposed as cldr-json/cldr-transforms/transforms.json */
+    public CLDRTransformsJsonIndex getJsonIndex() {
+        final CLDRTransformsJsonIndex index = new CLDRTransformsJsonIndex();
+        return index;
+    }
 }
diff --git a/tools/cldr-code/src/main/resources/org/unicode/cldr/json/JSON_config_transforms.txt b/tools/cldr-code/src/main/resources/org/unicode/cldr/json/JSON_config_transforms.txt
@@ -0,0 +1,2 @@
+section=transforms ; path=//cldr/supplemental/transforms/.* ; package=transforms ; packageDesc=Transform data
+dependency=core ; package=transforms
diff --git a/tools/cldr-code/src/main/resources/org/unicode/cldr/json/pathTransforms.txt b/tools/cldr-code/src/main/resources/org/unicode/cldr/json/pathTransforms.txt
@@ -130,10 +130,6 @@
 < (.*(GMT|UTC).*/exemplarCity)(.*)
 >
 
-#
-< (.*/transforms/transform[^/]*)/(.*)
-> $1/tRules/$2
-
 #
 < (.*)\[@territories="([^"]*)"\](.*)\[@alt="variant"\](.*)
 > $1\[@territories="$2-alt-variant"\]
@@ -173,3 +169,7 @@
 # ParentLocales
 < (.*/parentLocales)\[@component="([^"]*)"\]/(parentLocale)(.*)$
 > $1/$2$4
+
+# Transform - drop terminal tRule element
+< //supplementalData/transforms/transform(.*)/tRule.*$
+> //supplementalData/transforms/transform$1/_rulesFile