Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
cc76245
A test which I expected to fail, but not in this way
eggrobin Dec 1, 2023
e23d1c1
Pre-16 and NFKCQC
eggrobin Dec 1, 2023
24fe8e1
🤪
eggrobin Dec 2, 2023
328c761
Canonical closure tests
eggrobin Dec 29, 2023
2d0ceaf
Generate canonical closures
eggrobin Dec 29, 2023
3880c4f
Some interesting sequences
eggrobin Dec 29, 2023
b3b53c0
Some very crappy code
eggrobin Dec 29, 2023
22dfd8c
Drop Hangul and make sure we have all overlaps
eggrobin Dec 29, 2023
a742327
Split it into its own part and look at chaining compositions, not dec…
eggrobin Jan 3, 2024
182cc3a
despam
eggrobin Jan 3, 2024
53459b0
spots
eggrobin Jan 3, 2024
5f16271
Regenerate UCD
eggrobin Jan 3, 2024
747f982
Some comments.
eggrobin Jan 3, 2024
695c95e
Allow a single non-decomposable starter at either end of the chain
eggrobin Jan 4, 2024
9fea9ea
Deduplicate parts 4 and 5
eggrobin Jan 4, 2024
7362f2d
Remove redundant test cases in NFC (covered by the NFC column of othe…
eggrobin Jan 5, 2024
cdd391a
Clean things up
eggrobin Jan 5, 2024
7bcb9b4
more cleanup
eggrobin Jan 5, 2024
cf4275c
more cleanup
eggrobin Jan 5, 2024
3cb23ac
More testing
eggrobin Jan 7, 2024
e41b3ea
Fix the QC properties
eggrobin Jan 7, 2024
0c312ce
stray import
eggrobin Jan 7, 2024
361a977
factor
eggrobin Jan 7, 2024
0380b27
report all failures
eggrobin Jan 7, 2024
85c2b67
Failing test
eggrobin Jan 10, 2024
6188e19
an attempt at error messages
eggrobin Jan 10, 2024
afc7d8c
comma
eggrobin Jan 10, 2024
29b341f
table and less escaping
eggrobin Jan 10, 2024
0497007
Try to get the errors only once
eggrobin Jan 10, 2024
4ca3390
We have screwed up since the beginning of time.
eggrobin Jan 10, 2024
80acf01
Revert invariant tests
eggrobin Jan 10, 2024
93c9570
Report various kinds of errors
eggrobin Jan 10, 2024
281f70b
report parse errors
eggrobin Jan 10, 2024
70a4ee2
Break everything
eggrobin Jan 10, 2024
03b74f2
make it a bit more readable hopefully
eggrobin Jan 10, 2024
4ac8e84
Revert "Break everything"
eggrobin Jan 10, 2024
3ba16f5
It is only an error if it is not what we expect.
eggrobin Jan 10, 2024
546071e
Revert "Revert invariant tests"
eggrobin Jan 10, 2024
008fa45
Put the condition in the right place
eggrobin Jan 10, 2024
40c4eab
Merge branch 'invariant-test-in-diff' into canonically-consistent-gra…
eggrobin Jan 10, 2024
1c84e41
Document our past mistakes, don’t expect them to go away
eggrobin Jan 10, 2024
c107514
Merge branch 'normalization-woes' into canonically-consistent-graphem…
eggrobin Jan 10, 2024
85eca4e
bad expectations
eggrobin Jan 10, 2024
963735f
fix it
eggrobin Jan 10, 2024
a93e7d1
Regenerate UCD
eggrobin Jan 10, 2024
47ae9c4
Fehlermeldungszeilen
eggrobin Jan 12, 2024
cdb3107
Merge remote-tracking branch 'la-vache/main' into invariant-test-in-diff
eggrobin Jan 12, 2024
bbee45c
Merge branch 'invariant-test-in-diff' into canonically-consistent-gra…
eggrobin Jan 12, 2024
7a6220b
Markus’s suggestions
eggrobin Jan 20, 2024
89cdf7a
Merge remote-tracking branch 'la-vache/main' into normalization-woes
eggrobin Jan 20, 2024
e1a01ed
More honest primaryCompositesByMeowNFDCodePoint maps
eggrobin Jan 20, 2024
b0b4cf6
Regenerate UCD
eggrobin Jan 20, 2024
910039c
Merge branch 'normalization-woes' of https://github.com/eggrobin/unic…
eggrobin Jan 20, 2024
c21622e
spotless
eggrobin Jan 20, 2024
66e7296
Merge remote-tracking branch 'la-vache/main' into normalization-woes
eggrobin Jan 22, 2024
2c57460
Merge branch 'normalization-woes' into canonically-consistent-graphem…
eggrobin Jan 23, 2024
f58c609
Merge remote-tracking branch 'la-vache/main' into canonically-consist…
eggrobin Jan 23, 2024
6830f83
Merge remote-tracking branch 'la-vache/main' into canonically-consist…
eggrobin Jan 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/cli-build-instructions.yml
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ jobs:
- name: Run command - Build and Test
run: |
cd unicodetools/mine/src
MAVEN_OPTS="-ea" mvn -s .github/workflows/mvn-settings.xml package -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=$CURRENT_UVERSION
MAVEN_OPTS="-ea" mvn -s .github/workflows/mvn-settings.xml package -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=$CURRENT_UVERSION -DEMIT_GITHUB_ERRORS
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ public class TestUnicodeInvariants {
private static int showRangeLimit = 20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition, and cleanup

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I split that out into its own PR #646, which will thus need its own approval.

static boolean doHtml = true;
public static final String DEFAULT_FILE = "UnicodeInvariantTest.txt";
public static final HTMLTabber htmlTabber = new Tabber.HTMLTabber();
public static final boolean EMIT_GITHUB_ERRORS =
System.getProperty("EMIT_GITHUB_ERRORS") != null;

private static final int
// HELP1 = 0,
Expand Down Expand Up @@ -171,8 +174,6 @@ public static int testInvariants(String inputFile, boolean doRange) throws IOExc
out3.write('\uFEFF'); // BOM
}
try (final BufferedReader in = getInputReader(inputFile)) {
final HTMLTabber tabber = new Tabber.HTMLTabber();

errorLister =
new BagFormatter()
.setMergeRanges(doRange)
Expand All @@ -183,7 +184,7 @@ public static int testInvariants(String inputFile, boolean doRange) throws IOExc
.setFixName(toHTML);
errorLister.setShowTotal(false);
if (doHtml) {
errorLister.setTabber(tabber);
errorLister.setTabber(htmlTabber);
}

showLister =
Expand All @@ -198,7 +199,7 @@ public static int testInvariants(String inputFile, boolean doRange) throws IOExc
showLister.setValueSource(LATEST_PROPS.getProperty("script"));
}
if (doHtml) {
showLister.setTabber(tabber);
showLister.setTabber(htmlTabber);
}

// symbolTable = new ChainedSymbolTable();
Expand All @@ -207,7 +208,7 @@ public static int testInvariants(String inputFile, boolean doRange) throws IOExc
// ToolUnicodePropertySource.make(UCD.lastVersion).getSymbolTable("\u00D7"),
//
// ToolUnicodePropertySource.make(Default.ucdVersion()).getSymbolTable("")});
while (true) {
for (int lineNumber = 1; ; ++lineNumber) {
String line = in.readLine();
if (line == null) {
break;
Expand All @@ -230,24 +231,24 @@ public static int testInvariants(String inputFile, boolean doRange) throws IOExc
} else if (line.startsWith("Let")) {
letLine(pp, line);
} else if (line.startsWith("In")) {
inLine(pp, line);
inLine(pp, line, lineNumber);
} else if (line.startsWith("ShowScript")) {
showScript = true;
} else if (line.startsWith("HideScript")) {
showScript = false;
} else if (line.startsWith("Map")) {
testMapLine(line, pp);
testMapLine(line, pp, lineNumber);
} else if (line.startsWith("ShowMap")) {
showMapLine(line, pp);
} else if (line.startsWith("Show")) {
showLine(line, pp);
} else if (line.startsWith("EquivalencesOf")) {
equivalencesLine(line, pp);
equivalencesLine(line, pp, lineNumber);
} else {
testLine(line, pp);
testLine(line, pp, lineNumber);
}
} catch (final Exception e) {
parseErrorCount = parseError(parseErrorCount, line, e);
parseErrorCount = parseError(parseErrorCount, line, e, lineNumber);
continue;
}
}
Expand Down Expand Up @@ -276,7 +277,9 @@ static class PropertyComparison {
UnicodeProperty property2;
}

private static void equivalencesLine(String line, ParsePosition pp) throws ParseException {
private static void equivalencesLine(String line, ParsePosition pp, int lineNumber)
throws ParseException {
// TODO(egg): ::error etc.
pp.setIndex("EquivalencesOf".length());
final UnicodeSet domain = new UnicodeSet(line, pp, symbolTable);
final var leftProperty = CompoundProperty.of(LATEST_PROPS, line, pp);
Expand Down Expand Up @@ -457,7 +460,8 @@ private static void equivalencesLine(String line, ParsePosition pp) throws Parse
}
}

private static void inLine(ParsePosition pp, String line) throws ParseException {
private static void inLine(ParsePosition pp, String line, int lineNumber)
throws ParseException {
pp.setIndex(2);
final PropertyComparison propertyComparison = getPropertyComparison(pp, line);
final UnicodeMap<String> failures = new UnicodeMap<>();
Expand All @@ -476,6 +480,7 @@ private static void inLine(ParsePosition pp, String line) throws ParseException
if (failureCount != 0) {
testFailureCount++;
printErrorLine("Test Failure", Side.START, testFailureCount);
// TODO(egg): ::error etc.
println(
"## Got unexpected "
+ (propertyComparison.shouldBeEqual ? "differences" : "equalities")
Expand Down Expand Up @@ -710,7 +715,8 @@ private static void showMapLine(String line, ParsePosition pp) {
showLister.setMergeRanges(doRange);
}

private static void testLine(String line, ParsePosition pp) throws ParseException {
private static void testLine(String line, ParsePosition pp, int lineNumber)
throws ParseException {
if (line.startsWith("Test")) {
line = line.substring(4).trim();
}
Expand Down Expand Up @@ -776,21 +782,24 @@ private static void testLine(String line, ParsePosition pp) throws ParseExceptio
"In",
rightSide,
"But Not In",
leftSide);
leftSide,
lineNumber);
checkExpected(
rightAndLeft,
new UnicodeSet(rightSet).retainAll(leftSet),
"In",
rightSide,
"And In",
leftSide);
leftSide,
lineNumber);
checkExpected(
left_right,
new UnicodeSet(leftSet).removeAll(rightSet),
"In",
leftSide,
"But Not In",
rightSide);
rightSide,
lineNumber);
}

public static void checkRelation(ParsePosition pp, char relation) throws ParseException {
Expand All @@ -810,7 +819,8 @@ private static void checkExpected(
String rightStatus,
String rightSide,
String leftStatus,
String leftSide) {
String leftSide,
int lineNumber) {
switch (expected) {
case empty:
if (segment.size() == 0) {
Expand All @@ -829,9 +839,31 @@ private static void checkExpected(
}
testFailureCount++;
printErrorLine("Test Failure", Side.START, testFailureCount);
println("## Expected " + expected + ", got: " + segment.size() + "\t" + segment.toString());
println("## " + rightStatus + "\t" + rightSide);
println("## " + leftStatus + "\t" + leftSide);
final var errorMessage =
new String[] {
"Expected " + expected + ", got: " + segment.size() + "\t" + segment.toString(),
rightStatus + "\t" + rightSide,
leftStatus + "\t" + leftSide
};
var monoTable = new StringWriter();
for (String line : errorMessage) {
println("## " + line);
}
errorLister.setTabber(new Tabber.MonoTabber());
errorLister.setLineSeparator("\n");
errorLister.showSetNames(new PrintWriter(monoTable), segment);
if (EMIT_GITHUB_ERRORS) {
System.err.println(
"::error file=unicodetools/src/main/resources/org/unicode/text/UCD/"
+ DEFAULT_FILE
+ ",line="
+ lineNumber
+ ",title=Invariant test failure::"
+ (String.join("\n", errorMessage) + "\n" + monoTable.toString())
.replace("%", "%25")
.replace("\n", "%0A"));
}
errorLister.setTabber(htmlTabber);
if (doHtml) {
out.println("<table class='e'>");
}
Expand All @@ -853,7 +885,8 @@ private static void checkExpected(
getProperties(Settings.lastVersion),
IndexUnicodeProperties.make(Settings.lastVersion)));

private static void testMapLine(String line, ParsePosition pp) throws ParseException {
private static void testMapLine(String line, ParsePosition pp, int lineNumber)
throws ParseException {
char relation = 0;
String rightSide = null;
String leftSide = null;
Expand Down Expand Up @@ -915,21 +948,24 @@ private static void testMapLine(String line, ParsePosition pp) throws ParseExcep
"In",
rightSide,
"But Not In",
leftSide);
leftSide,
lineNumber);
checkExpected(
rightAndLeft,
UnicodeMapParser.retainAll(new UnicodeMap<String>().putAll(rightSet), leftSet),
"In",
rightSide,
"And In",
leftSide);
leftSide,
lineNumber);
checkExpected(
left_right,
UnicodeMapParser.removeAll(new UnicodeMap<String>().putAll(leftSet), rightSet),
"In",
leftSide,
"But Not In",
rightSide);
rightSide,
lineNumber);
}

private static void checkExpected(
Expand All @@ -938,7 +974,8 @@ private static void checkExpected(
String rightStatus,
String rightSide,
String leftStatus,
String leftSide) {
String leftSide,
int lineNumber) {
switch (expected) {
case empty:
if (segment.size() == 0) {
Expand Down Expand Up @@ -1015,7 +1052,7 @@ private static void showSet(ParsePosition pp, final String value) {
println();
}

private static int parseError(int parseErrorCount, String line, Exception e) {
private static int parseError(int parseErrorCount, String line, Exception e, int lineNumber) {
parseErrorCount++;
if (e instanceof ParseException) {
final int index = ((ParseException) e).getErrorOffset();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -746,6 +746,37 @@ Let $PostBaseSpacingMarks_Tweak = [\u103B \u1056 \u1057 \u1A57 \u1A6D]
Let $PostBaseSpacingMarks_Missed = []
[$PostBaseSpacingMarks_All - $PostBaseSpacingMarks_Tweak - $PostBaseSpacingMarks_Missed] ⊂ [:GCB=XX:]

# Check the consistency of grapheme cluster segmentation (both legacy and
# extended) with canonical equivalence.
# Non-starters are GCB=Extend or GCB=SpacingMark, so that GB9 and GB9a keep
# together any sequences that may be reordered by the Canonical Ordering
# Algorithm. This has been true ever since Extended Grapheme Clusters were
# added.
\P{U5.1.0:ccc=0} ⊆ [\p{U5.1.0:GCB=Extend}\p{U5.1.0:GCB=SpacingMark}]
\P{ccc=0} ⊆ [\p{GCB=Extend}\p{GCB=SpacingMark}]
# Non-starters are actually GCB=Extend, so that GB9 alone does the job, since
# there is no GB9a in legacy grapheme clusters.
# But not before Unicode Version 16.0, even though we were saying so since
# Unicode Version 4.0 (https://www.unicode.org/reports/tr29/tr29-4.html#Implementation_Notes),
# oops (see L2/24-009).
\P{U4.0.0:ccc=0} ⊆ \p{U4.0.0:Grapheme_Extend}
\P{U4.1.0:ccc=0} ⊆ \p{U4.1.0:GCB=Extend}
\P{U15.1.0:ccc=0} ⊆ \p{U15.1.0:GCB=Extend}
\P{ccc=0} ⊆ \p{GCB=Extend}

# Characters that appear in non-initial position in the canonical decomposition
# of another character are either Extend, V, or T, so that sequences that are
# equivalent to a canonical composite are kept together by GB6..GB9.
# We only look at the starters, since we dealt with non-starters above.
# Characters that appear in non-initial position in the canonical decomposition
# of a primary composite are NFC_QC=Maybe. We would need to separately check
# the characters that appear in non-initial position in the canonical
# decomposition of a full composition exclusion.
# We would also need to separately check that the characters are T or V only
# appear in canonical decompositions where they follow an LV, LVT, V, or T, or
# an LV or V, respectively.
[\p{NFC_QC=Maybe}&\p{ccc=0}] ⊆ [\p{GCB=Extend}\p{GCB=T}\p{GCB=V}]

##########################
# Emoji
##########################
Expand Down