Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPython mangles unicode characters #5618

Closed
Keno opened this issue Apr 14, 2014 · 8 comments · Fixed by #5628
Closed

IPython mangles unicode characters #5618

Keno opened this issue Apr 14, 2014 · 8 comments · Fixed by #5628
Milestone

Comments

@Keno
Copy link

Keno commented Apr 14, 2014

This occurs in IJulia, but I've checked the ZMQ message boundary so I think it probably happens somewhere between the browser and before it hits the kernel. When entering a high unicode character (such as U+1D712) it arrives at the kernel as

6-element Array{Uint8,1}:
 0xed
 0xa0
 0xb5
 0xed
 0xbc
 0x92

rather than the expected

4-element Array{Uint8,1}:
 0xf0
 0x9d
 0x9c
 0x92

.
Now, in julia we do our best to retain invalid unicode data which gives the effect that if you print the above again it prints correctly in the browser. I assume somewhere in the software stack there's an encoding that's not quite UTF8, though it does convert back, so that's nice. I also checked that printing the proper UTF8 character shows up correctly in the browser. Now, this might be a browser issue, but I checked that it is set to UTF8 and it happens in Firefox and Safari as well and when I paste it out into a texteditor I get the proper UTF8 character.

Let me know if there's anything I can do to help investigate.

@minrk
Copy link
Member

minrk commented Apr 14, 2014

Python suggests that those two are both valid utf-8 represenatations of the same code point:

In [1]: b'\xed\xa0\xb5\xed\xbc\x92'.decode('utf-8') == b'\xf0\x9d\x9c\x92'.decode('utf-8')
Out[1]: True
In [2]: b'\xed\xa0\xb5\xed\xbc\x92'.decode('utf-8') == u'\U0001d712'
Out[2]: True

@Keno
Copy link
Author

Keno commented Apr 14, 2014

This is beginning to looks suspiciously like CESU-8. It's nice that it's being decoded to the right unicode character in python and maybe we should do that too in julia, but I think it would still be nice if we got proper utf8 over the wire.

EDIT: I was thinking CESU8 because the hex of the two characters in the six-byte sequence is (UTF8 encoded) d835 and df12 which are the UTF16 surrogates here.

@Keno
Copy link
Author

Keno commented Apr 15, 2014

Update: I use wireshark to look at the websocket traffic and the UTF8 is transmitted properly as

7b 22 63 6f 64 65 22 3a 22 5c 22 *f0 9d 9c 92* 5c  {"code":"\"....\

(emphasis mine)

@minrk
Copy link
Member

minrk commented Apr 15, 2014

Looks like it's the Python stdlib ensure_ascii flag, which defaults to True:

In [23]: json.dumps(u'\U0001d712', ensure_ascii=True)
Out[23]: '"\\ud835\\udf12"'

In [24]: json.dumps(u'\U0001d712', ensure_ascii=False)
Out[24]: u'"\U0001d712"'

@Keno
Copy link
Author

Keno commented Apr 15, 2014

Awesome detective work. Any problem with just turning that off?

@minrk
Copy link
Member

minrk commented Apr 15, 2014

It should be okay to do that, but I don't think anything incorrect is currently happening. The JSON spec says any character may be escaped, and that unicode escapes must use four hex characters, so high code points must be split. Both byte sequences are valid utf8 representations of the same unicode code point.

@minrk
Copy link
Member

minrk commented Apr 15, 2014

#5628 should stop applying ensure_ascii between the server and the kernel. It is still applied between the server and javascript, but js has never indicated a problem with that (and it would require centralizing a bunch of things to make that happen).

@Keno
Copy link
Author

Keno commented Apr 15, 2014

Ok, I didn't know about that JSON quirk. I'll fix this on the Julia JSON side as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants