-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update toCase tests to use latest SpecialCasing.txt, add update script #5026
base: master
Are you sure you want to change the base?
Conversation
""".format(os.path.basename(__file__), args.url)) | ||
output.write("") | ||
output.write("var SpecialCasings = " + json.dumps(casings_dict_list, indent = 4, sort_keys = True).replace("\\\\", "\\") + ";") | ||
elif args.json is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might change this to plain if
, in case someone specifies both output types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose, but also the JSON mode isnt useful for anything that I am aware of, I just made the option available because I was already making JSON output for the JS version. I think there may be an additional output mode to output a C struct initializer because I think UnifiedRegex uses this data too, in the future.
output.write("var SpecialCasings = " + json.dumps(casings_dict_list, indent = 4, sort_keys = True).replace("\\\\", "\\") + ";") | ||
elif args.json is not None: | ||
with open(args.json, "w") as output: | ||
output.write(json.dumps(casings_dict_list, indent = 4, sort_keys = True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this one not need removal of double-escapes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we really want it to be 6 characters, literally "\uFFFF", but the json.dumps output escapes the backslash to make it 7 characters, "\uFFFF". Since this is a JS object and a JS string, we dont want to escape the backslash, the backslash is there to escape the "u"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but why not do that on the second output type option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I thought JSON strings didn't actually have or care about the "\u" escape sequence, so the string itself should be "\u". Perhaps that is wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON definitely supports escaped characters, or it would never have gotten so ubiquitous (we'd all be using some JSON-plus-encoding format instead). As you mentioned elsewhere, it all really depends on the use case, and if nobody is depending on this JSON file for anything then it doesn't really matter.
In reply to: 183243067 [](ancestors = 183243067)
from argparse import ArgumentParser | ||
|
||
def to_unicode_escape(codepoints): | ||
return "".join(map(lambda codepoint: "\\u" + codepoint, codepoints.split(" "))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to do this manually? Does the json module not escape correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: could we build up an actual Unicode string using unichr
and do escaping upon output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source file, SpecialCasing.txt, only uses the 4-digit hex codepoint, and we want to turn it into a \u unicode literal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the terse and unhelpful comments, I was on my phone. I was just brainstorming ways we wouldn't have to do the step at the end where we change double-backslash to single-backslash because that feels mildly dangerous: we have valid JSON and then put it through a text filter that could in some cases produce invalid JSON. I know that the current content is fine with that transformation, but maybe we could build something that is more robust against future modification. I thought we could do something here like "".join(map(lambda codepoint: unichr(int(codepoint, 16)), codepoints.split(" ")))
, which would build up an actual unicode Python string instead of one containing backslash characters. Then during the output step we just trust json.dumps
to put backslash characters in the right places. The current version obviously works fine, so feel free to leave as-is if you prefer it.
In reply to: 183217085 [](ancestors = 183217085)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My python knowledge is enough to produce something that works but not necessarily something that is good -- I had no idea about that unichr function. I can switch to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I might be steering us straight off the cliffs of insanity with this suggestion, because of the following sentence in the Python docs:
The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF].
In reply to: 183435269 [](ancestors = 183435269)
{ | ||
name: "SpecialCasing.txt", | ||
body() { | ||
const hasICU = WScript.Platform.INTL_LIBRARY === "icu"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a FYI, you might also see some failures on really old versions of ICU that have older Unicode data in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats ok. It looks like all of the tests are passing and we only support ICU 55 at the earliest, and even that is hopefully on its way out soon as 16.04 fades compared to 18.04
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worst case scenario, since we are getting the CLDR version now from ICU, I could plumb that info through to WScript.Platform in order to do one-off exclusions based on data version.
"name": "GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI", | ||
"upper": "\u03A9\u0342\u0399" | ||
} | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might as well output a newline here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@jackhorton was this PR blocked on anything besides my review? |
Nope, just fell by the wayside. We talked about it and my memory is telling me that you said this script would not be useful for RegExp tables, so in that case I will probably dump the logic surrounding output formats entirely and make it just be a script for creating the JS file. |
@jackhorton that's what I remember, too. Maybe keep a copy of that logic around in a separate branch so we can investigate unifying the logic, if applicable. |
Slight detour from my regularly scheduled programming of making toCase faster/safer.