Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update toCase tests to use latest SpecialCasing.txt, add update script #5026

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jackhorton
Copy link
Contributor

Slight detour from my regularly scheduled programming of making toCase faster/safer.

""".format(os.path.basename(__file__), args.url))
output.write("")
output.write("var SpecialCasings = " + json.dumps(casings_dict_list, indent = 4, sort_keys = True).replace("\\\\", "\\") + ";")
elif args.json is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might change this to plain if, in case someone specifies both output types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose, but also the JSON mode isnt useful for anything that I am aware of, I just made the option available because I was already making JSON output for the JS version. I think there may be an additional output mode to output a C struct initializer because I think UnifiedRegex uses this data too, in the future.

output.write("var SpecialCasings = " + json.dumps(casings_dict_list, indent = 4, sort_keys = True).replace("\\\\", "\\") + ";")
elif args.json is not None:
with open(args.json, "w") as output:
output.write(json.dumps(casings_dict_list, indent = 4, sort_keys = True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this one not need removal of double-escapes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we really want it to be 6 characters, literally "\uFFFF", but the json.dumps output escapes the backslash to make it 7 characters, "\uFFFF". Since this is a JS object and a JS string, we dont want to escape the backslash, the backslash is there to escape the "u"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but why not do that on the second output type option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I thought JSON strings didn't actually have or care about the "\u" escape sequence, so the string itself should be "\u". Perhaps that is wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON definitely supports escaped characters, or it would never have gotten so ubiquitous (we'd all be using some JSON-plus-encoding format instead). As you mentioned elsewhere, it all really depends on the use case, and if nobody is depending on this JSON file for anything then it doesn't really matter.


In reply to: 183243067 [](ancestors = 183243067)

from argparse import ArgumentParser

def to_unicode_escape(codepoints):
return "".join(map(lambda codepoint: "\\u" + codepoint, codepoints.split(" ")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do this manually? Does the json module not escape correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: could we build up an actual Unicode string using unichr and do escaping upon output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source file, SpecialCasing.txt, only uses the 4-digit hex codepoint, and we want to turn it into a \u unicode literal.

Copy link
Contributor

@sethbrenith sethbrenith Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the terse and unhelpful comments, I was on my phone. I was just brainstorming ways we wouldn't have to do the step at the end where we change double-backslash to single-backslash because that feels mildly dangerous: we have valid JSON and then put it through a text filter that could in some cases produce invalid JSON. I know that the current content is fine with that transformation, but maybe we could build something that is more robust against future modification. I thought we could do something here like "".join(map(lambda codepoint: unichr(int(codepoint, 16)), codepoints.split(" "))), which would build up an actual unicode Python string instead of one containing backslash characters. Then during the output step we just trust json.dumps to put backslash characters in the right places. The current version obviously works fine, so feel free to leave as-is if you prefer it.


In reply to: 183217085 [](ancestors = 183217085)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My python knowledge is enough to produce something that works but not necessarily something that is good -- I had no idea about that unichr function. I can switch to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I might be steering us straight off the cliffs of insanity with this suggestion, because of the following sentence in the Python docs:

The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF].


In reply to: 183435269 [](ancestors = 183435269)

{
name: "SpecialCasing.txt",
body() {
const hasICU = WScript.Platform.INTL_LIBRARY === "icu";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a FYI, you might also see some failures on really old versions of ICU that have older Unicode data in them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats ok. It looks like all of the tests are passing and we only support ICU 55 at the earliest, and even that is hopefully on its way out soon as 16.04 fades compared to 18.04

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worst case scenario, since we are getting the CLDR version now from ICU, I could plumb that info through to WScript.Platform in order to do one-off exclusions based on data version.

"name": "GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI",
"upper": "\u03A9\u0342\u0399"
}
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might as well output a newline here.

Copy link
Contributor

@dilijev dilijev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dilijev
Copy link
Contributor

dilijev commented Jun 22, 2018

@jackhorton was this PR blocked on anything besides my review?

@jackhorton
Copy link
Contributor Author

Nope, just fell by the wayside. We talked about it and my memory is telling me that you said this script would not be useful for RegExp tables, so in that case I will probably dump the logic surrounding output formats entirely and make it just be a script for creating the JS file.

@dilijev
Copy link
Contributor

dilijev commented Jun 22, 2018

@jackhorton that's what I remember, too. Maybe keep a copy of that logic around in a separate branch so we can investigate unifying the logic, if applicable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants