-
Notifications
You must be signed in to change notification settings - Fork 5
/
DESIGN
201 lines (155 loc) · 9.51 KB
/
DESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
Zooko O'Whielacronx
November 2002
human-oriented base-32 encoding
INTRO
The base-32 encoding implemented in this library differs from that described in
draft-josefsson-base-encoding-04.txt [1] in several ways. This document
describes why we made each different choice, and also includes a section at the
end on "COMPATIBILITY AND INTEROPERATION".
This encoding is implemented in a project named libbase32 [2].
This is version 0.9.4.4 of this document. The latest version should always be
available at:
http://cvs.sf.net/cgi-bin/viewcvs.cgi/libbase32/libbase32/DESIGN?rev=HEAD
RATIONALE
The rationale for base-32 encoding in [1] is as written therein: "The Base 32
encoding is designed to represent arbitrary sequences of octets in a form that
needs to be case insensitive but need not be humanly readable.".
The rationale for libbase32 is different -- it is to represent arbitrary
sequences of octets in a form that is as convenient as possible for human users
to manipulate. In particular, libbase32 was created in order to serve the Mnet
project [3], where 30-octet cryptographic values are encoded into URIs for
humans to manipulate. Anticipated uses of these URIs include cut-and-paste,
text editing (e.g. in HTML files), manual transcription via a keyboard, manual
transcription via pen-and-paper, vocal transcription over phone or radio, etc.
The desiderata for such an encoding are:
* minimizing transcription errors -- e.g. the well-known problem of confusing
`0' with `O'
* encoding into other structures -- e.g. search engines, structured or marked-
up text, file systems, command shells
* brevity -- Shorter URLs are better than longer ones.
* ergonomics -- Human users (especially non-technical ones) should find the
URIs as easy and pleasant as possible. The uglier the URI looks, the worse.
DESIGN
Base
The first decision we made was to use base-32 instead of base-64. An earlier
version of this project used base-64, but a discussion on the p2p-hackers
mailing list [4] convinced us that the added length of base-32 encoding was
worth the added clarity provided by: case-insensitivity, the absence of non-
alphanumeric characters, and the ability to omit a few of the most troublesome
alphanumeric characters.
In particular, we realized that it would probably be faster and more comfortable
to vocally transcribe a base-32 encoded 30-octet string (48 characters, case-
insensitive, no non-alphanumeric characters) than a base-64 encoded one
(40 characters, case-sensitive, plus two non-alphanumeric characters).
Alphabet
There are 26 alphabet characters and 10 digits, for a total of 36 characters
available. We need only 32 characters for our base-32 alphabet, so we can
choose four characters to exclude. This is where we part company with
traditional base-32 encodings. For example [1] eliminates `0', `1', `8', and
`9'. This choice eliminates two characters that are relatively unambiguous
(`8' and `9') while retaining others that are potentially confusing. Others
have suggested eliminating `0', `1', `O', and `L', which is likewise suboptimal.
Our choice of confusing characters to eliminate is: `0', `l', `v', and `2'. Our
reasoning is that `0' is potentially mistaken for `o', that `l' is potentially
mistaken for `1' or `i', that `v' is potentially mistaken for `u' or `r'
(especially in handwriting) and that `2' is potentially mistaken for `z'
(especially in handwriting).
Note that we choose to focus on typed and written transcription more than on
vocal, since humans already have a well-established system of disambiguating
spoken alphanumerics, such as the United States military's "Alpha Bravo Charlie
Delta" and telephone operators' "Is that 'd' as in 'dog'?".
Order of Alphabet
Some of the alphabet characters will appear more frequently than others in the
final position of the encoded strings, assuming an even distribution of binary
inputs. (This is true whether the lengths of your inputs are evenly distributed
over all integer numbers of bits *or* evenly distributed over all integer
numbers of octets.) Here is a table showing which length-in-bits (modulo 5)
results in which possible trailing characters in the
"abcdefghijklmnopqrstuvwxyz234567" encoding:
1: aq
2: aqiy
3: aqiyemu4
4: aqiyemu4cgkosw26
0: abcdefghijklmnopqrstuvwxyz234567
Here is the same table for our alphabet:
1: yo
2: yoea
3: yoearcwh
4: yoearcwhngkq1s46
0: ybndrfg8ejkmcpqxot1uwisza345h769 (the whole alphabet)
We have permuted the alphabet to make the more commonly occuring characters also
be those that we think are easier to read, write, speak, and remember.
Length Encoding and Sub-Octet Data
Suppose you have 10 bits of data to transmit, and the recipient (the decoder) is
expecting 10 bits of data. All previous base-32 encoding schemes assume that
the binary data to be encoded is in 8-bit octets, so you would have to pad the
data out to 2 octets and encode it in base-32, resulting in a string
4-characters long. The decoder will decode that into 2 octets (16 bits),
ignoring the least significant 4 bits of the encoded string, and then ignore the
least significant 6 bits of the decoded data.
In the base-32 encoding described here, if the encoder and decoder both know the
exact length of the data in bits (modulo 40), then they can use this shared
information to optimize the size of the transmitted (encoded) string. In the
example that you have 10 bits of data to transmit, libbase32 allows you to
transmit the optimal encoded string: two characters.
You can always use this encoding the same way you would use the other
encodings -- with an "input is in 8-bit octets" assumption. This would be
appropriate if the length in bits is always a multiple of 8, if both sides are
not sure of the length in bits modulo 40, or if this encoding is being used in a
way that optimizing one or two characters out of the encoded string isn't worth
the potential confusion.
Padding
Traditionally base-32 encodings have specified trailing padding to round out the
number of characters to an even multiple of 8. This is apparently intended as
an error detection code, but we do not consider the error detection capabilities
of this code to be worth the increased length of the encoded strings, so we do
not do this.
A NOTE ON COMPATIBILITY AND INTEROPERATION
If your application could possibly interoperate with another application, then
you should consider the risk of precluding such interoperation by encoding
semantically identical objects into syntactically different representations.
For example, many current systems include the SHA-1 hash of the contents of a
file, and this hash value can be represented for user or programmatic sharing in
base-32 encoded form [5, 6, 7, 8]. These four systems all use traditional
base-32 encoding as described in [1]. If your system will expose the SHA-1 hash
of the contents of a file, then you should consider the benefits of having such
hash values be exchangeable with those systems by using the same encoding
including base, alphabet, permutation of alphabet, length-encoding, padding,
treatment of illegal characters and line-breaks.
If, however, the semantic meaning of the objects that you are exposing is not
something that can be used by another system, due to semantic differences, then
you gain nothing with regard to interoperation by using the same ASCII encoding,
and in fact by doing so you may incur *worse* interoperation problems by making
it impossible for the applications to use syntactic features (namely, by
recognizing the encoding scheme) to disambiguate between semantic features.
Lucas Gonze has suggested [9] that different schemes could in fact
*deliberately* add characters which would be illegal in another scheme in order
to enable syntactic differentiation. (This would be morally similar to the
"check digit" included in most credit card numbers.)
The author has also suggested [10] encoding schematic compatibility in the
lengths. For example, mnetIds will probably be 48 characters in base-32 encoded
form (encoding 30 octets of data). If it turns out that other strings of that
length and form occur in the wild, then the mnetIds could be redefined to be 47
or 49 characters in order to make them recognizable.
Clearly the best semantic differentiation is an unambiguous one that is
transmitted out-of-band (outside of the ASCII encoding, that is), such as URI
scheme names (e.g.: SHA1:blahblahblah or mnet://blahblahblah). However, users
might not always preserve those.
REFERENCES
[1] http://www.ietf.org/internet-drafts/draft-josefsson-base-encoding-04.txt
[2] http://sf.net/projects/libbase32
[3] http://mnet.sf.net/
[4] http://zgp.org/pipermail/p2p-hackers/2001-October/
[5] Gnutella [need URL for SHA1 and base-32 encoding stuff]
[6] Bitzi [need URL for specification stuff]
[7] CAW [need URL]
[8] http://open-content.net/specs/draft-jchapweske-thex-01.html
[9] http://zgp.org/pipermail/p2p-hackers/2002-November/000924.html
[10] http://zgp.org/pipermail/p2p-hackers/2002-November/000927.html
NEEDED TO ADD
* new design element: check characters!? (e.g. Luhn-like algorithm)
* Full spec, including the issues named by draft-josefsson-base-encoding-04.txt
such as treatment of illegal chars, etc.
+ also issues of length-encoding, scheme-encoding (e.g. URIs), etc.
* Explanation of why we avoid non-alphanumerics.
* Mention of the myriad other clarity issues such as those Gojomo posted?