-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPython mangles unicode characters #5618
Comments
Python suggests that those two are both valid utf-8 represenatations of the same code point:
|
This is beginning to looks suspiciously like CESU-8. It's nice that it's being decoded to the right unicode character in python and maybe we should do that too in julia, but I think it would still be nice if we got proper utf8 over the wire. EDIT: I was thinking CESU8 because the hex of the two characters in the six-byte sequence is (UTF8 encoded) d835 and df12 which are the UTF16 surrogates here. |
Update: I use wireshark to look at the websocket traffic and the UTF8 is transmitted properly as
(emphasis mine) |
Looks like it's the Python stdlib In [23]: json.dumps(u'\U0001d712', ensure_ascii=True)
Out[23]: '"\\ud835\\udf12"'
In [24]: json.dumps(u'\U0001d712', ensure_ascii=False)
Out[24]: u'"\U0001d712"' |
Awesome detective work. Any problem with just turning that off? |
It should be okay to do that, but I don't think anything incorrect is currently happening. The JSON spec says any character may be escaped, and that unicode escapes must use four hex characters, so high code points must be split. Both byte sequences are valid utf8 representations of the same unicode code point. |
#5628 should stop applying ensure_ascii between the server and the kernel. It is still applied between the server and javascript, but js has never indicated a problem with that (and it would require centralizing a bunch of things to make that happen). |
Ok, I didn't know about that JSON quirk. I'll fix this on the Julia JSON side as well. |
This occurs in IJulia, but I've checked the ZMQ message boundary so I think it probably happens somewhere between the browser and before it hits the kernel. When entering a high unicode character (such as U+1D712) it arrives at the kernel as
rather than the expected
.
Now, in julia we do our best to retain invalid unicode data which gives the effect that if you print the above again it prints correctly in the browser. I assume somewhere in the software stack there's an encoding that's not quite UTF8, though it does convert back, so that's nice. I also checked that printing the proper UTF8 character shows up correctly in the browser. Now, this might be a browser issue, but I checked that it is set to UTF8 and it happens in Firefox and Safari as well and when I paste it out into a texteditor I get the proper UTF8 character.
Let me know if there's anything I can do to help investigate.
The text was updated successfully, but these errors were encountered: