IPython mangles unicode characters #5618

Keno · 2014-04-14T23:32:10Z

This occurs in IJulia, but I've checked the ZMQ message boundary so I think it probably happens somewhere between the browser and before it hits the kernel. When entering a high unicode character (such as U+1D712) it arrives at the kernel as

6-element Array{Uint8,1}:
 0xed
 0xa0
 0xb5
 0xed
 0xbc
 0x92

rather than the expected

4-element Array{Uint8,1}:
 0xf0
 0x9d
 0x9c
 0x92

.
Now, in julia we do our best to retain invalid unicode data which gives the effect that if you print the above again it prints correctly in the browser. I assume somewhere in the software stack there's an encoding that's not quite UTF8, though it does convert back, so that's nice. I also checked that printing the proper UTF8 character shows up correctly in the browser. Now, this might be a browser issue, but I checked that it is set to UTF8 and it happens in Firefox and Safari as well and when I paste it out into a texteditor I get the proper UTF8 character.

Let me know if there's anything I can do to help investigate.

The text was updated successfully, but these errors were encountered:

minrk · 2014-04-14T23:48:31Z

Python suggests that those two are both valid utf-8 represenatations of the same code point:

In [1]: b'\xed\xa0\xb5\xed\xbc\x92'.decode('utf-8') == b'\xf0\x9d\x9c\x92'.decode('utf-8')
Out[1]: True
In [2]: b'\xed\xa0\xb5\xed\xbc\x92'.decode('utf-8') == u'\U0001d712'
Out[2]: True

Keno · 2014-04-14T23:56:40Z

This is beginning to looks suspiciously like CESU-8. It's nice that it's being decoded to the right unicode character in python and maybe we should do that too in julia, but I think it would still be nice if we got proper utf8 over the wire.

EDIT: I was thinking CESU8 because the hex of the two characters in the six-byte sequence is (UTF8 encoded) d835 and df12 which are the UTF16 surrogates here.

Keno · 2014-04-15T00:08:50Z

Update: I use wireshark to look at the websocket traffic and the UTF8 is transmitted properly as

7b 22 63 6f 64 65 22 3a 22 5c 22 *f0 9d 9c 92* 5c  {"code":"\"....\

(emphasis mine)

minrk · 2014-04-15T03:55:12Z

Looks like it's the Python stdlib ensure_ascii flag, which defaults to True:

In [23]: json.dumps(u'\U0001d712', ensure_ascii=True)
Out[23]: '"\\ud835\\udf12"'

In [24]: json.dumps(u'\U0001d712', ensure_ascii=False)
Out[24]: u'"\U0001d712"'

Keno · 2014-04-15T04:17:12Z

Awesome detective work. Any problem with just turning that off?

minrk · 2014-04-15T18:28:11Z

It should be okay to do that, but I don't think anything incorrect is currently happening. The JSON spec says any character may be escaped, and that unicode escapes must use four hex characters, so high code points must be split. Both byte sequences are valid utf8 representations of the same unicode code point.

minrk · 2014-04-15T20:15:38Z

#5628 should stop applying ensure_ascii between the server and the kernel. It is still applied between the server and javascript, but js has never indicated a problem with that (and it would require centralizing a bunch of things to make that happen).

Keno · 2014-04-15T20:16:17Z

Ok, I didn't know about that JSON quirk. I'll fix this on the Julia JSON side as well.

minrk mentioned this issue Apr 15, 2014

don't ensure_ascii in JSON messages in Session #5628

Merged

Keno mentioned this issue Apr 15, 2014

Unicode characters above the BMP JuliaIO/JSON.jl#57

Closed

takluyver closed this as completed in #5628 Apr 16, 2014

Keno mentioned this issue Apr 20, 2014

Random error using variables with unicode characters JuliaLang/julia#5712

Closed

minrk added this to the 3.0 milestone May 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPython mangles unicode characters #5618

IPython mangles unicode characters #5618

Keno commented Apr 14, 2014

minrk commented Apr 14, 2014

Keno commented Apr 14, 2014

Keno commented Apr 15, 2014

minrk commented Apr 15, 2014

Keno commented Apr 15, 2014

minrk commented Apr 15, 2014

minrk commented Apr 15, 2014

Keno commented Apr 15, 2014

IPython mangles unicode characters #5618

IPython mangles unicode characters #5618

Comments

Keno commented Apr 14, 2014

minrk commented Apr 14, 2014

Keno commented Apr 14, 2014

Keno commented Apr 15, 2014

minrk commented Apr 15, 2014

Keno commented Apr 15, 2014

minrk commented Apr 15, 2014

minrk commented Apr 15, 2014

Keno commented Apr 15, 2014