Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode escape sequences #48

Open
hollasch opened this issue Oct 2, 2024 · 4 comments
Open

Support Unicode escape sequences #48

hollasch opened this issue Oct 2, 2024 · 4 comments
Assignees

Comments

@hollasch
Copy link
Owner

hollasch commented Oct 2, 2024

Model after C++:

  • \unnnn — code point U+nnnn (4 hexadecimal digits)
  • \u{nnnn} — code point U+n... (arbitrary number of hexadecimal digits)
  • \Unnnnnnnn — code point U+nnnnnnnn (8 hexadecimal digits)
@hollasch hollasch self-assigned this Oct 7, 2024
@hollasch
Copy link
Owner Author

Well duh — you can't use backslash escapes, as that would collide with Windows style directory separators. Instead, let's go with this:

  • %uXXXX — XXXX is four hex digits. Unicode code point in the BMP: 0x0000–0xD7FF, 0xE000–0xFFFF
  • %UXXXXXXXX – XXXXXXXX is eight hex digits. Unicode code points above the BMP: 0x10000– 0x10FFFF.

@hollasch
Copy link
Owner Author

Of course, with the % escape sequence, we need %% to represent a literal %.

@hollasch
Copy link
Owner Author

hollasch commented Oct 20, 2024

Could also use #, as in foo#u0278bar versus foo%u0278bar.

Just thinking out loud, we could also use the special character as a symmetric delimiter. For example: foo%278%bar. Or could specify a special character as the ending delimiter, like foo%278)bar, or foo#278;bar. In these cases, a doubled beginning delimiter represents the literal value, so %% denotes %, or ## denotes #, respectively for the two prior examples.

Rats — using ; or ) (for the end delimiter) is problematic on Windows. You'd have to escape these on the command line. I'd rather avoid having to escape special characters at the command line, as that gets you into double-escaping for scripts and such.

@hollasch
Copy link
Owner Author

All right. I think the best option for now is to with #AB…C# to denote Unicode codepoint AB…C, and ## to denote a literal #.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant