-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Unicode #1
Comments
This would be great. How painful did this change look to be? I might be able to contribute if it's not huge. |
My plan was to avoid having a dependency on ICU -- store everything I suggest 1) modifying the internal string representation in state.h 2) This would be great. How painful did this change look to be? I might be — |
Great, thanks for the info @sparkprime. I'll update here if I get a chance to try it; I need more emoji in my json 🍻 I really love Jsonnet BTW. My team is using it along with ApiDoc to create API documentation that doubles as a mock API server for developing apps against APIs that aren't finished yet. |
Glad you like it! I did some reading and it seems wstring is not what we want because it has UTF16 behavior on windows. So we probably need to do something like
with functions to convert from UTF8-encoded std::string to that and back. There are a bunch of places where the HeapString internal representation leaks out into other places as well, e.g. field names, std.extVar() keys, filenames (from std.thisFile) etc. |
@hotdog929 you may be interested |
I'm going to have a go at this because I think it's probably harder / more work than I originally thought. |
That was a productive 4 hours ;) |
Wow @sparkprime, way to kill it!! |
Nice! :D Perhaps I should also add a |
Looks like normal unicode characters (like I'm suspicious of the encode_utf8 method, but I'm struggling to understand what all the bit masking and shifting is doing. |
I think I have a fix, looks like a typo on this line:
Changing that to the following seems more right
|
Submitted a fix as #78. I didn't see an easy way to test this as the
One solution for testing could be to add support for the ECMAScript6 code point escapes (like If you have another idea for testing, I'd love to hear it! |
Thanks for tracking this down! I suppose you can do things like "🚀🚀🚀"[1] which should == "🚀". \u{XXX} should be a no-brainer though, it could be added in the lexer quite easily. I have been worried for a long time about the limitation of \u and whether it's necessary to support e.g. things like this as well https://bugs.launchpad.net/zorba/+bug/1024448 |
No problem! It was enlightening to learn more about the inner workings of unicode |
Fix build errors, implement parseCommaList
Currently only ASCII is supported in strings. It should not be too hard to accept UTF-8 (raising an error for invalid input), and adjust internal string routines to support unparsing those strings correctly, as well as routines for iterating over codepoints, correctly determining the length (in codepoints), etc.
The text was updated successfully, but these errors were encountered: