|
| 1 | +Unicode for ASCII folks |
| 2 | + |
| 3 | +* Why do you need to learn about unicode |
| 4 | + |
| 5 | +In today's world, everything is in unicode. |
| 6 | + |
| 7 | +You might not think you need unicode, but you do. |
| 8 | + |
| 9 | +- What happens when someone tries to add a emoji in your comment box? |
| 10 | +- What happens if you have users who have spécial characters in their names? |
| 11 | +- What about all the people who don't speak english? |
| 12 | + |
| 13 | +* What the hell is unicode |
| 14 | +** History |
| 15 | + |
| 16 | +- Wild west |
| 17 | +- ASCII |
| 18 | +- 256 characters are enough for everyone, right? |
| 19 | +- Code pages. Extended code pages |
| 20 | +- Unicode |
| 21 | + |
| 22 | +** Unicode |
| 23 | + |
| 24 | +Unicode just defines these things: |
| 25 | + |
| 26 | +- A code point: think of it as an array index |
| 27 | +- A character name |
| 28 | +- A reference glyph (how it /should/ look) |
| 29 | + |
| 30 | +Example: |
| 31 | + |
| 32 | +| Code Point | Letter Name | Example | Hex Code | |
| 33 | +|------------+------------------------+---------+----------| |
| 34 | +| 65 | LATIN CAPITAL LETTER A | A | 0x41 | |
| 35 | +| 181 | MICRO SIGN | µ | 0xb5 | |
| 36 | +| 8377 | INDIAN RUPEE SIGN | ₹ | 0x20b9 | |
| 37 | +| 128542 | DISAPPOINTED FACE | 😞 | 0x1f61e | |
| 38 | + |
| 39 | +That's basically it. |
| 40 | + |
| 41 | +* Unicode Encodings |
| 42 | + |
| 43 | +This is the fun part. Encoding is basically a way to represent a unicode |
| 44 | +character in memory or on disk. |
| 45 | + |
| 46 | +So, think of encoding as a function that takes a series of characters |
| 47 | +(i.e., a String), and returns a byte array. Decoding does the reverse. |
| 48 | + |
| 49 | +#+BEGIN_SRC java |
| 50 | + |
| 51 | + public interface Encoding { |
| 52 | + byte[] encode(String inputString); |
| 53 | + String decode(byte[] inputBytes); |
| 54 | + } |
| 55 | + |
| 56 | +#+END_SRC |
| 57 | + |
| 58 | +UTF-8 is an encoding. UT8-16 is another encoding. There are many types of encodings. |
| 59 | + |
| 60 | +Please note that the byte array itself may not have any indication of which encoding is being used. We will come back to this point later. |
| 61 | + |
| 62 | + |
| 63 | +** UTF-8 |
| 64 | + |
| 65 | +UTF-8 is a great hack. I love UTF-8. |
| 66 | + |
| 67 | +- UTF-8 is a variable length encoding |
| 68 | + |
| 69 | + What this means is that each unicode character may take 1 byte to represent, or it may take 2, 3 or even 4 bytes. |
| 70 | + |
| 71 | + This means that the length of the byte array is not the length of the string! |
| 72 | + |
| 73 | +- UTF-8 is compatible with ASCII. |
| 74 | + |
| 75 | + So, all ASCII characters are represented in 1 byte, with the same byte value. |
| 76 | + |
| 77 | + This makes it backwards compatible with a lot of systems, and also saves memory / disk space if your text contains english predominantly. |
| 78 | + |
| 79 | +** UTF-16 (and UCS-2) |
| 80 | + |
| 81 | +When they introduced unicode, they thought 2 bytes (65,536 characters) |
| 82 | +would be more than enough. This was wrong. There are more than 65,536 |
| 83 | +characters in unicode right now. |
| 84 | + |
| 85 | +The initial unicode encoding, UCS-2, only supports characters from |
| 86 | +0x0000 to 0xFFFF. It does not support full unicode. |
| 87 | + |
| 88 | +UTF-16 was developed as a replacement for UCS-2. |
| 89 | + |
| 90 | +UTF-16 is variable length. By default, each character is 2 bytes. But |
| 91 | +characters whose code point is larger than 0xFFFF can still be |
| 92 | +presented using four bytes. |
| 93 | + |
| 94 | +Java uses UTF-16 internally (in-memory) to represent all strings. |
| 95 | + |
| 96 | +* Fonts |
| 97 | + |
| 98 | +Fonts are used to render characters to screen (or print). |
| 99 | + |
| 100 | +You might have done the right thing in your files (used the right |
| 101 | +encoding and right characters). There is no guarantee that it will be |
| 102 | +displayed properly though — the font defines how a character will be |
| 103 | +rendered, and all fonts do not support all characters. |
| 104 | + |
| 105 | +For example, this used to be a problem with the Indian Rupee Sign ₹. |
| 106 | + |
| 107 | +When it was introduced, most fonts did not have support, so people used |
| 108 | +images and other hacks to get around it. Situation is pretty good now, |
| 109 | +but if you still have users who have very old operating systems (Xp :P), |
| 110 | +they would not be able to show this symbol. |
| 111 | + |
| 112 | +* Some implications |
| 113 | + |
| 114 | +** Wrong encoding being picked up |
| 115 | + |
| 116 | +This is the absolute most common issue you'll see. Open a page, and you see random characters. |
| 117 | + |
| 118 | +[[./bad-encoding.png]] |
| 119 | + |
| 120 | +This is because there was no encoding defined (or the wrong encoding was defined). |
| 121 | + |
| 122 | +Example files: [[file:bad-encoding.html][bad-encoding.html]], [[file:good-encoding.html][good-encoding.html]] |
| 123 | + |
| 124 | +To prevent: |
| 125 | + |
| 126 | +- Be very clear when you read data from user. Whether it's a form submit, a REST API, or reading files, you need to know what the input encoding it. |
| 127 | +- Always, always use utf-8. Normalize user input to utf-8 and store it. Respond to user with utf-8 always. |
| 128 | +- Servers should define response content type in the HTTP Response (most frameworks will do this automatically) |
| 129 | +- HTML pages should define content type in the HTML body (default is UTF-8, so you only need to define if the body is *not* UTF-8) |
| 130 | +- Be careful about mixing encodings! *Just use utf-8*. |
| 131 | + |
| 132 | +** Language specific quirks |
| 133 | + |
| 134 | +This is a fun little exercise. What should happen if I run this piece of JS code? |
| 135 | + |
| 136 | +#+BEGIN_SRC js |
| 137 | + |
| 138 | + console.log("Length of µ is ", 'µ'.length); |
| 139 | + console.log('Length of 😞 is ', '😞'.length); |
| 140 | + |
| 141 | +#+END_SRC |
| 142 | + |
| 143 | +How about Python? |
| 144 | + |
| 145 | +#+BEGIN_SRC python |
| 146 | + # coding=utf-8 |
| 147 | + print("Len of µ is %d" % (len("µ"))) |
| 148 | + print("Len of 😞 is %d" % (len("😞"))) |
| 149 | +#+END_SRC |
| 150 | + |
| 151 | +#+RESULTS: |
| 152 | + |
| 153 | +How about F#? |
| 154 | + |
| 155 | +#+BEGIN_SRC fsharp |
| 156 | + printfn "Len of µ is %d" "µ".Length |
| 157 | + printfn "Len of 😞 is %d" "😞".Length |
| 158 | +#+END_SRC |
| 159 | + |
| 160 | +Something as simple as getting the length of a string will give you different results in different languages. |
| 161 | + |
| 162 | +There are tons of different quirks like this. You need to understand how the language you work with deals with unicode. |
| 163 | + |
| 164 | +It's a complex enough topic that python made backwards incompatible changes (python3) in order to properly support unicode! |
| 165 | + |
| 166 | +** Language and platform specific quirks |
| 167 | + |
| 168 | +One example here is UTF-8 [[https://en.wikipedia.org/wiki/Byte_order_mark][Byte Order Mark]]. |
| 169 | + |
| 170 | +It's added to the start of files by some programs (and some operating systems) to indicate that it contains unicode data, in either UTF-8 or UTF-16 format. |
| 171 | + |
| 172 | +It's the character U+FEFF. |
| 173 | + |
| 174 | +- In UTF-8, it's represented by 3 bytes: 0xEF, 0xBB, 0xBF |
| 175 | +- In UTF-16, it's represented on-disk as 0xFE 0xFF (big-endian) or 0xFF 0xFE (little-endian) |
| 176 | + |
| 177 | +Unfortunately, it is often missing, or present but with wrong values. Some tools do not understand it, some tools add it without asking, etc. |
| 178 | + |
| 179 | +It's not recommended to use the BOM, but you may need to deal with it. |
| 180 | + |
| 181 | +** Regexes |
| 182 | + |
| 183 | +Your regex may not always work. Try to use [[https://msdn.microsoft.com/en-us/library/20bw873z(v%3Dvs.110).aspx][pre-defined character classes]] |
| 184 | +instead of enumerating characters yourself. But be aware that different |
| 185 | +languages have different ways to deal with this, and unicode support in |
| 186 | +regex is not great for languages like JavaScript. |
| 187 | + |
| 188 | +** Security |
| 189 | + |
| 190 | +Example: You might think that you're clicking on a link to |
| 191 | +wikipedia.org, but you're actually clicking on a link to wikipediа.org. |
| 192 | + |
| 193 | +Thankfully, browsers deal with this type of attack. |
| 194 | + |
| 195 | +Any place where there is a manual judgement involved is vulnerable to |
| 196 | +this type of attack though. |
| 197 | + |
| 198 | +** Practical example |
| 199 | +*** Normalization |
| 200 | + |
| 201 | +While parsing form-16, we need to normalize all unicode spaces, hyphens, etc to make it easy to parse. |
| 202 | + |
| 203 | +We chose the option of converting unicode characters to equivalent ASCII and then parsing, instead of making entire parser aware of unicode. |
| 204 | + |
| 205 | +*** Full text search |
| 206 | + |
| 207 | +If you search for 'fiance', you should also get results if the text contains 'fiancé' |
| 208 | + |
| 209 | +This is a hard problem. |
| 210 | + |
| 211 | +Full-text search databases have to deal with unicode and they have to |
| 212 | +normalize text in order to give you good search results. Naïve |
| 213 | +implementations will fail. |
| 214 | + |
| 215 | +*** Sorting & Collation |
| 216 | + Another hard problem. There are standards to deal with this. |
| 217 | + |
| 218 | +* Some +weird+ interesting topics |
| 219 | + |
| 220 | +** Ligatures |
| 221 | + |
| 222 | + Combining characters to a single displayed glyph. Obviously, required |
| 223 | + for languages like Hindi. But there are places where even english has |
| 224 | + ligatures (for typographic & stylistic purposes). |
| 225 | + |
| 226 | + Example: |
| 227 | + |
| 228 | + | Letters | स ् क ू ल | |
| 229 | + | Letters without space | स्कूल | |
| 230 | + |
| 231 | + What would this show? |
| 232 | + |
| 233 | + #+BEGIN_SRC js |
| 234 | + return "स्कूल".length; |
| 235 | + #+END_SRC |
| 236 | + |
| 237 | + (Whether it's correct or not, I leave it to you to discuss!) |
| 238 | + |
| 239 | +** Flags |
| 240 | + |
| 241 | + Unicode has flags. |
| 242 | + |
| 243 | + 🇮🇳 |
| 244 | + |
| 245 | + This flag is created from two characters: 🇮 🇳 (India's country code). |
| 246 | + When taken together, this becomes the flag. |
| 247 | + |
| 248 | + This is an interesting trade-off: there is no character for the Indian |
| 249 | + flag, but fonts define ligatures for IN (in that unicode sequence) to |
| 250 | + map it to the Indian flag. |
| 251 | + |
| 252 | +** Box Drawing |
| 253 | + |
| 254 | + There are a bunch of [[https://en.wikipedia.org/wiki/Box_Drawing][characters]] that are designed for drawing boxes. |
| 255 | + |
| 256 | + Here's an example drawing (from [[https://en.wikipedia.org/wiki/Box-drawing_character][wikipedia]]) |
| 257 | + |
| 258 | + #+BEGIN_EXAMPLE |
| 259 | +┌─┬┐ ╔═╦╗ ╓─╥╖ ╒═╤╕ |
| 260 | +│ ││ ║ ║║ ║ ║║ │ ││ |
| 261 | +├─┼┤ ╠═╬╣ ╟─╫╢ ╞═╪╡ |
| 262 | +└─┴┘ ╚═╩╝ ╙─╨╜ ╘═╧╛ |
| 263 | +┌───────────────────┐ |
| 264 | +│ ╔═══╗ Some Text │▒ |
| 265 | +│ ╚═╦═╝ in the box │▒ |
| 266 | +╞═╤══╩══╤═══════════╡▒ |
| 267 | +│ ├──┬──┤ │▒ |
| 268 | +│ └──┴──┘ │▒ |
| 269 | +└───────────────────┘▒ |
| 270 | + ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ |
| 271 | + #+END_EXAMPLE |
| 272 | + |
| 273 | +* References |
| 274 | + |
| 275 | +Unicode is crazy, but it works. That it works at all is a miracle. |
| 276 | + |
| 277 | +This talk was just a very very brief overview. If you're curious, there |
| 278 | +are tons of resources on the internet. |
| 279 | + |
| 280 | +- https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ |
| 281 | + |
| 282 | + Classic blog post |
| 283 | + |
| 284 | +- http://www.copypastecharacter.com/ |
| 285 | + |
| 286 | + Fun website where you can see random unicode characters |
| 287 | + |
| 288 | +- http://chardet.readthedocs.io/en/latest/faq.html |
| 289 | + |
| 290 | + chardet is a python library that can 'guess' the encoding of a input stream of bytes. |
| 291 | + |
| 292 | + There are ports for other languages. Use only when dealing with unstructured data or third party sources! |
| 293 | + |
| 294 | +- https://speakerdeck.com/mathiasbynens/hacking-with-unicode-in-2016 |
| 295 | + |
| 296 | + Very interesting presentation on unicode related security implications |
| 297 | + |
| 298 | +- http://blogs.technet.com/b/mmpc/archive/2011/08/10/can-we-believe-our-eyes.aspx |
| 299 | + |
| 300 | + More security stuff with unicode |
| 301 | + |
| 302 | +- https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/ |
| 303 | + |
| 304 | + Everything you know about text is wrong. |
| 305 | + |
| 306 | +- https://xkcd.com/1726/ |
| 307 | + |
0 commit comments