Skip to content

Commit 0f1f29e

Browse files
committed
Initial commit. Adding presentation
0 parents  commit 0f1f29e

File tree

4 files changed

+325
-0
lines changed

4 files changed

+325
-0
lines changed

bad-encoding.html

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
<html>
2+
<head>
3+
<meta charset="latin1">
4+
</head>
5+
6+
<body>
7+
<h2>Hello from “unicode” land⁈</h2>
8+
</body>
9+
</html>

bad-encoding.png

26 KB
Loading

good-encoding.html

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
<html>
2+
<head>
3+
<meta charset="utf-8">
4+
</head>
5+
6+
<body>
7+
<h2>Hello from “unicode” land⁈</h2>
8+
</body>
9+
</html>

presentation.org

+307
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,307 @@
1+
Unicode for ASCII folks
2+
3+
* Why do you need to learn about unicode
4+
5+
In today's world, everything is in unicode.
6+
7+
You might not think you need unicode, but you do.
8+
9+
- What happens when someone tries to add a emoji in your comment box?
10+
- What happens if you have users who have spécial characters in their names?
11+
- What about all the people who don't speak english?
12+
13+
* What the hell is unicode
14+
** History
15+
16+
- Wild west
17+
- ASCII
18+
- 256 characters are enough for everyone, right?
19+
- Code pages. Extended code pages
20+
- Unicode
21+
22+
** Unicode
23+
24+
Unicode just defines these things:
25+
26+
- A code point: think of it as an array index
27+
- A character name
28+
- A reference glyph (how it /should/ look)
29+
30+
Example:
31+
32+
| Code Point | Letter Name | Example | Hex Code |
33+
|------------+------------------------+---------+----------|
34+
| 65 | LATIN CAPITAL LETTER A | A | 0x41 |
35+
| 181 | MICRO SIGN | µ | 0xb5 |
36+
| 8377 | INDIAN RUPEE SIGN | ₹ | 0x20b9 |
37+
| 128542 | DISAPPOINTED FACE | 😞 | 0x1f61e |
38+
39+
That's basically it.
40+
41+
* Unicode Encodings
42+
43+
This is the fun part. Encoding is basically a way to represent a unicode
44+
character in memory or on disk.
45+
46+
So, think of encoding as a function that takes a series of characters
47+
(i.e., a String), and returns a byte array. Decoding does the reverse.
48+
49+
#+BEGIN_SRC java
50+
51+
public interface Encoding {
52+
byte[] encode(String inputString);
53+
String decode(byte[] inputBytes);
54+
}
55+
56+
#+END_SRC
57+
58+
UTF-8 is an encoding. UT8-16 is another encoding. There are many types of encodings.
59+
60+
Please note that the byte array itself may not have any indication of which encoding is being used. We will come back to this point later.
61+
62+
63+
** UTF-8
64+
65+
UTF-8 is a great hack. I love UTF-8.
66+
67+
- UTF-8 is a variable length encoding
68+
69+
What this means is that each unicode character may take 1 byte to represent, or it may take 2, 3 or even 4 bytes.
70+
71+
This means that the length of the byte array is not the length of the string!
72+
73+
- UTF-8 is compatible with ASCII.
74+
75+
So, all ASCII characters are represented in 1 byte, with the same byte value.
76+
77+
This makes it backwards compatible with a lot of systems, and also saves memory / disk space if your text contains english predominantly.
78+
79+
** UTF-16 (and UCS-2)
80+
81+
When they introduced unicode, they thought 2 bytes (65,536 characters)
82+
would be more than enough. This was wrong. There are more than 65,536
83+
characters in unicode right now.
84+
85+
The initial unicode encoding, UCS-2, only supports characters from
86+
0x0000 to 0xFFFF. It does not support full unicode.
87+
88+
UTF-16 was developed as a replacement for UCS-2.
89+
90+
UTF-16 is variable length. By default, each character is 2 bytes. But
91+
characters whose code point is larger than 0xFFFF can still be
92+
presented using four bytes.
93+
94+
Java uses UTF-16 internally (in-memory) to represent all strings.
95+
96+
* Fonts
97+
98+
Fonts are used to render characters to screen (or print).
99+
100+
You might have done the right thing in your files (used the right
101+
encoding and right characters). There is no guarantee that it will be
102+
displayed properly though — the font defines how a character will be
103+
rendered, and all fonts do not support all characters.
104+
105+
For example, this used to be a problem with the Indian Rupee Sign ₹.
106+
107+
When it was introduced, most fonts did not have support, so people used
108+
images and other hacks to get around it. Situation is pretty good now,
109+
but if you still have users who have very old operating systems (Xp :P),
110+
they would not be able to show this symbol.
111+
112+
* Some implications
113+
114+
** Wrong encoding being picked up
115+
116+
This is the absolute most common issue you'll see. Open a page, and you see random characters.
117+
118+
[[./bad-encoding.png]]
119+
120+
This is because there was no encoding defined (or the wrong encoding was defined).
121+
122+
Example files: [[file:bad-encoding.html][bad-encoding.html]], [[file:good-encoding.html][good-encoding.html]]
123+
124+
To prevent:
125+
126+
- Be very clear when you read data from user. Whether it's a form submit, a REST API, or reading files, you need to know what the input encoding it.
127+
- Always, always use utf-8. Normalize user input to utf-8 and store it. Respond to user with utf-8 always.
128+
- Servers should define response content type in the HTTP Response (most frameworks will do this automatically)
129+
- HTML pages should define content type in the HTML body (default is UTF-8, so you only need to define if the body is *not* UTF-8)
130+
- Be careful about mixing encodings! *Just use utf-8*.
131+
132+
** Language specific quirks
133+
134+
This is a fun little exercise. What should happen if I run this piece of JS code?
135+
136+
#+BEGIN_SRC js
137+
138+
console.log("Length of µ is ", 'µ'.length);
139+
console.log('Length of 😞 is ', '😞'.length);
140+
141+
#+END_SRC
142+
143+
How about Python?
144+
145+
#+BEGIN_SRC python
146+
# coding=utf-8
147+
print("Len of µ is %d" % (len("µ")))
148+
print("Len of 😞 is %d" % (len("😞")))
149+
#+END_SRC
150+
151+
#+RESULTS:
152+
153+
How about F#?
154+
155+
#+BEGIN_SRC fsharp
156+
printfn "Len of µ is %d" "µ".Length
157+
printfn "Len of 😞 is %d" "😞".Length
158+
#+END_SRC
159+
160+
Something as simple as getting the length of a string will give you different results in different languages.
161+
162+
There are tons of different quirks like this. You need to understand how the language you work with deals with unicode.
163+
164+
It's a complex enough topic that python made backwards incompatible changes (python3) in order to properly support unicode!
165+
166+
** Language and platform specific quirks
167+
168+
One example here is UTF-8 [[https://en.wikipedia.org/wiki/Byte_order_mark][Byte Order Mark]].
169+
170+
It's added to the start of files by some programs (and some operating systems) to indicate that it contains unicode data, in either UTF-8 or UTF-16 format.
171+
172+
It's the character U+FEFF.
173+
174+
- In UTF-8, it's represented by 3 bytes: 0xEF, 0xBB, 0xBF
175+
- In UTF-16, it's represented on-disk as 0xFE 0xFF (big-endian) or 0xFF 0xFE (little-endian)
176+
177+
Unfortunately, it is often missing, or present but with wrong values. Some tools do not understand it, some tools add it without asking, etc.
178+
179+
It's not recommended to use the BOM, but you may need to deal with it.
180+
181+
** Regexes
182+
183+
Your regex may not always work. Try to use [[https://msdn.microsoft.com/en-us/library/20bw873z(v%3Dvs.110).aspx][pre-defined character classes]]
184+
instead of enumerating characters yourself. But be aware that different
185+
languages have different ways to deal with this, and unicode support in
186+
regex is not great for languages like JavaScript.
187+
188+
** Security
189+
190+
Example: You might think that you're clicking on a link to
191+
wikipedia.org, but you're actually clicking on a link to wikipediа.org.
192+
193+
Thankfully, browsers deal with this type of attack.
194+
195+
Any place where there is a manual judgement involved is vulnerable to
196+
this type of attack though.
197+
198+
** Practical example
199+
*** Normalization
200+
201+
While parsing form-16, we need to normalize all unicode spaces, hyphens, etc to make it easy to parse.
202+
203+
We chose the option of converting unicode characters to equivalent ASCII and then parsing, instead of making entire parser aware of unicode.
204+
205+
*** Full text search
206+
207+
If you search for 'fiance', you should also get results if the text contains 'fiancé'
208+
209+
This is a hard problem.
210+
211+
Full-text search databases have to deal with unicode and they have to
212+
normalize text in order to give you good search results. Naïve
213+
implementations will fail.
214+
215+
*** Sorting & Collation
216+
Another hard problem. There are standards to deal with this.
217+
218+
* Some +weird+ interesting topics
219+
220+
** Ligatures
221+
222+
Combining characters to a single displayed glyph. Obviously, required
223+
for languages like Hindi. But there are places where even english has
224+
ligatures (for typographic & stylistic purposes).
225+
226+
Example:
227+
228+
| Letters | स ् क ू ल |
229+
| Letters without space | स्कूल |
230+
231+
What would this show?
232+
233+
#+BEGIN_SRC js
234+
return "स्कूल".length;
235+
#+END_SRC
236+
237+
(Whether it's correct or not, I leave it to you to discuss!)
238+
239+
** Flags
240+
241+
Unicode has flags.
242+
243+
🇮🇳
244+
245+
This flag is created from two characters: 🇮 🇳 (India's country code).
246+
When taken together, this becomes the flag.
247+
248+
This is an interesting trade-off: there is no character for the Indian
249+
flag, but fonts define ligatures for IN (in that unicode sequence) to
250+
map it to the Indian flag.
251+
252+
** Box Drawing
253+
254+
There are a bunch of [[https://en.wikipedia.org/wiki/Box_Drawing][characters]] that are designed for drawing boxes.
255+
256+
Here's an example drawing (from [[https://en.wikipedia.org/wiki/Box-drawing_character][wikipedia]])
257+
258+
#+BEGIN_EXAMPLE
259+
┌─┬┐ ╔═╦╗ ╓─╥╖ ╒═╤╕
260+
│ ││ ║ ║║ ║ ║║ │ ││
261+
├─┼┤ ╠═╬╣ ╟─╫╢ ╞═╪╡
262+
└─┴┘ ╚═╩╝ ╙─╨╜ ╘═╧╛
263+
┌───────────────────┐
264+
│ ╔═══╗ Some Text │▒
265+
│ ╚═╦═╝ in the box │▒
266+
╞═╤══╩══╤═══════════╡▒
267+
│ ├──┬──┤ │▒
268+
│ └──┴──┘ │▒
269+
└───────────────────┘▒
270+
▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
271+
#+END_EXAMPLE
272+
273+
* References
274+
275+
Unicode is crazy, but it works. That it works at all is a miracle.
276+
277+
This talk was just a very very brief overview. If you're curious, there
278+
are tons of resources on the internet.
279+
280+
- https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
281+
282+
Classic blog post
283+
284+
- http://www.copypastecharacter.com/
285+
286+
Fun website where you can see random unicode characters
287+
288+
- http://chardet.readthedocs.io/en/latest/faq.html
289+
290+
chardet is a python library that can 'guess' the encoding of a input stream of bytes.
291+
292+
There are ports for other languages. Use only when dealing with unstructured data or third party sources!
293+
294+
- https://speakerdeck.com/mathiasbynens/hacking-with-unicode-in-2016
295+
296+
Very interesting presentation on unicode related security implications
297+
298+
- http://blogs.technet.com/b/mmpc/archive/2011/08/10/can-we-believe-our-eyes.aspx
299+
300+
More security stuff with unicode
301+
302+
- https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/
303+
304+
Everything you know about text is wrong.
305+
306+
- https://xkcd.com/1726/
307+

0 commit comments

Comments
 (0)