Update toCase tests to use latest SpecialCasing.txt, add update script #5026

jackhorton · 2018-04-21T08:15:25Z

Slight detour from my regularly scheduled programming of making toCase faster/safer.

sethbrenith · 2018-04-21T16:57:05Z

tools/specialcasings.py

+""".format(os.path.basename(__file__), args.url))
+            output.write("")
+            output.write("var SpecialCasings = " + json.dumps(casings_dict_list, indent = 4, sort_keys = True).replace("\\\\", "\\") + ";")
+    elif args.json is not None:


Might change this to plain if, in case someone specifies both output types

I suppose, but also the JSON mode isnt useful for anything that I am aware of, I just made the option available because I was already making JSON output for the JS version. I think there may be an additional output mode to output a C struct initializer because I think UnifiedRegex uses this data too, in the future.

sethbrenith · 2018-04-21T16:58:27Z

tools/specialcasings.py

+            output.write("var SpecialCasings = " + json.dumps(casings_dict_list, indent = 4, sort_keys = True).replace("\\\\", "\\") + ";")
+    elif args.json is not None:
+        with open(args.json, "w") as output:
+            output.write(json.dumps(casings_dict_list, indent = 4, sort_keys = True))


Why does this one not need removal of double-escapes?

Because we really want it to be 6 characters, literally "\uFFFF", but the json.dumps output escapes the backslash to make it 7 characters, "\uFFFF". Since this is a JS object and a JS string, we dont want to escape the backslash, the backslash is there to escape the "u"

Right, but why not do that on the second output type option?

Because I thought JSON strings didn't actually have or care about the "\u" escape sequence, so the string itself should be "\u". Perhaps that is wrong?

JSON definitely supports escaped characters, or it would never have gotten so ubiquitous (we'd all be using some JSON-plus-encoding format instead). As you mentioned elsewhere, it all really depends on the use case, and if nobody is depending on this JSON file for anything then it doesn't really matter.

In reply to: 183243067 [](ancestors = 183243067)

sethbrenith · 2018-04-21T16:59:37Z

tools/specialcasings.py

+from argparse import ArgumentParser
+
+def to_unicode_escape(codepoints):
+    return "".join(map(lambda codepoint: "\\u" + codepoint, codepoints.split(" ")))


Why do we need to do this manually? Does the json module not escape correctly?

To clarify: could we build up an actual Unicode string using unichr and do escaping upon output?

The source file, SpecialCasing.txt, only uses the 4-digit hex codepoint, and we want to turn it into a \u unicode literal.

Sorry for the terse and unhelpful comments, I was on my phone. I was just brainstorming ways we wouldn't have to do the step at the end where we change double-backslash to single-backslash because that feels mildly dangerous: we have valid JSON and then put it through a text filter that could in some cases produce invalid JSON. I know that the current content is fine with that transformation, but maybe we could build something that is more robust against future modification. I thought we could do something here like "".join(map(lambda codepoint: unichr(int(codepoint, 16)), codepoints.split(" "))), which would build up an actual unicode Python string instead of one containing backslash characters. Then during the output step we just trust json.dumps to put backslash characters in the right places. The current version obviously works fine, so feel free to leave as-is if you prefer it.

In reply to: 183217085 [](ancestors = 183217085)

My python knowledge is enough to produce something that works but not necessarily something that is good -- I had no idea about that unichr function. I can switch to that.

On second thought, I might be steering us straight off the cliffs of insanity with this suggestion, because of the following sentence in the Python docs:

The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF].

In reply to: 183435269 [](ancestors = 183435269)

jefgen · 2018-04-21T18:02:33Z

test/Strings/unicode_toUpperCase_toLowerCase.js

+    {
+        name: "SpecialCasing.txt",
+        body() {
+            const hasICU = WScript.Platform.INTL_LIBRARY === "icu";


Just as a FYI, you might also see some failures on really old versions of ICU that have older Unicode data in them.

Thats ok. It looks like all of the tests are passing and we only support ICU 55 at the earliest, and even that is hopefully on its way out soon as 16.04 fades compared to 18.04

Worst case scenario, since we are getting the CLDR version now from ICU, I could plumb that info through to WScript.Platform in order to do one-off exclusions based on data version.

dilijev · 2018-06-22T00:59:50Z

test/Common/SpecialCasings.js

+        "name": "GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI", 
+        "upper": "\u03A9\u0342\u0399"
+    }
+];


nit: might as well output a newline here.

dilijev

LGTM

dilijev · 2018-06-22T01:00:19Z

@jackhorton was this PR blocked on anything besides my review?

jackhorton · 2018-06-22T15:05:24Z

Nope, just fell by the wayside. We talked about it and my memory is telling me that you said this script would not be useful for RegExp tables, so in that case I will probably dump the logic surrounding output formats entirely and make it just be a script for creating the JS file.

dilijev · 2018-06-22T22:05:07Z

@jackhorton that's what I remember, too. Maybe keep a copy of that logic around in a separate branch so we can investigate unifying the logic, if applicable.

Update toCase tests to use latest SpecialCasing.txt, add update script

23e14a9

jackhorton requested review from dilijev, sethbrenith, MSLaguana and jefgen April 21, 2018 08:15

sethbrenith approved these changes Apr 21, 2018

View reviewed changes

jefgen reviewed Apr 21, 2018

View reviewed changes

dilijev reviewed Jun 22, 2018

View reviewed changes

dilijev approved these changes Jun 22, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update toCase tests to use latest SpecialCasing.txt, add update script #5026

Update toCase tests to use latest SpecialCasing.txt, add update script #5026

jackhorton commented Apr 21, 2018

sethbrenith Apr 21, 2018

jackhorton Apr 21, 2018

sethbrenith Apr 21, 2018

jackhorton Apr 21, 2018

sethbrenith Apr 22, 2018

jackhorton Apr 22, 2018

sethbrenith Apr 23, 2018

sethbrenith Apr 21, 2018

sethbrenith Apr 21, 2018

jackhorton Apr 21, 2018

sethbrenith Apr 23, 2018 •

edited

Loading

jackhorton Apr 23, 2018

sethbrenith Apr 23, 2018

jefgen Apr 21, 2018

jackhorton Apr 21, 2018

jackhorton Apr 21, 2018

dilijev Jun 22, 2018

dilijev left a comment

dilijev commented Jun 22, 2018

jackhorton commented Jun 22, 2018

dilijev commented Jun 22, 2018

Update toCase tests to use latest SpecialCasing.txt, add update script #5026

Are you sure you want to change the base?

Update toCase tests to use latest SpecialCasing.txt, add update script #5026

Conversation

jackhorton commented Apr 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethbrenith Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilijev left a comment

Choose a reason for hiding this comment

dilijev commented Jun 22, 2018

jackhorton commented Jun 22, 2018

dilijev commented Jun 22, 2018

sethbrenith Apr 23, 2018 •

edited

Loading