-
Notifications
You must be signed in to change notification settings - Fork 16
/
URI.identifier.txt
475 lines (413 loc) · 32.9 KB
/
URI.identifier.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
URI
STANDARDS ==> #RFCs:
# - syntax: 1630, 3986
# - IRI: 3987
# - URI vs URN vs URL: 3305
# - usage: 1736, 1737, 7320
# - well known URI: 5785, 8307
#WHATWG URL spec:
# - focused on browser
# - sometimes contradict RFCs:
## - this means "in WHATWG spec only"
## - main differences:
## - "URL" means "IRI" in visual representation, "URI" in logical representation
## - HASH can contain any char unescaped
## - SCHEME:PATH (except for HTTP|Websocket|FTP) can contain any [:graph:] unescaped
#W3C:
# - clean URI: "On Linking Alternative Representations To Enable Discovery And Publishing", "Usage Patterns For Client-Side URI parameters"
# - opaque URI: "The use of Metadata in URIs"
# - HASH: "Identifying Application State", "Best Practices for Fragment Identifiers and Media Type Definitions"
# - URI describing confidential info: "Good Practices for Capability URLs"
# - CURIE: "CURIE Syntax 1.0"
SUMMARY ==> #URI (Uniform Resource Identifier):
# - syntax:
# - SCHEME, AUTHORITY, ORIGIN, USERINFO, HOSTNAME, PORT, PATH, QUERY, HASH
# - percent-encoding and Punycode
# - case-insensitive: SCHEME, HOSTNAME. Rest depends
# - max length 255
# - relative references (to base URI): '', 'URI', '/URI', '//URI'
# - should try not to restrict syntax
# - IRIs: like URIs but only percent-encode ASCII chars "to escape"
## - WHATWG use "URL" meaning URI for logical representation, IRI for visual
# - can describe: identity, location, content, metadata
# - describe a resource, which can take many representations (using MIME types)
# - transience should be avoided
# - URI scheme describe interaction
# - cardinality:
# - RESOURCE == RESOURCE2 depends on SCHEME|AUTHORITY
# - URI == URI2 is helped by normalization
# - sameness (URI always returns same resource, anywhere in global namespace)
# - no ambiguity (URI returns only one resource)
# - uniqueness:
# - only 1 URI for 1 resource
# - optional: can use canonical URI, short URI
# - clean URI: accessible, separation of concern by not infering implementation nor format
# - data URIs
# - well-known URIs: at /.well-known/
/=+===============================+=\
/ : : \
)==: SYNTAX :==(
\ :_______________________________: /
\=+===============================+=/
PARTS ==> #[SCHEME:][//[USERINFO@]HOSTNAME[:[PORT]]][PATH][?[QUERY]][#[HASH]]:
# - SCHEME
# - AUTHORITY: //[USERINFO@]HOSTNAME[:[PORT]]
# - ORIGIN:
# - SCHEME://HOSTNAME[:[PORT]]
# - can be null, e.g. initial HTTP request
# - ORIGIN null !== ORIGIN null
## - null unless http[s]:, ws[s]:, ftp:, gopher:, file: or blob:
# - USERINFO:
## - USERNAME:PASSWORD
# - should avoid plaintext USERNAME|PASSWORD
# - HOST: HOSTNAME[:PORT]
# - HOSTNAME:
# - can be other things too, e.g. IP address (for IP protocols)
# - max length 255 bytes
#. - with DNS, divided between SUBDOMAIN and DOMAIN (which includes TOP-LEVEL-DOMAIN)
# - PORT:
# - socket number
## - 0-65536
# - PATH:
# - / imply hierarchy
# - when AUTHORITY and PATH both present, must be separated by /
# - can use . or ..
# - must be resolved as fast as possible
#. - last part is FILENAME, which can include SUFFIX|EXTENSION
#. - other part is DIRECTORY
# - QUERY:
# - can be any syntax
# - but usually ?VAR=VAL&...
# - sometimes also ;VAR=VAL;...
# - nesting:
# - supported by tools like Express REQ.query
# - but using [ ] in QUERY is not standard
# - HASH|FRAGMENT:
# - can be:
# - part of RESOURCE
# - RESOURCE2 referenced by RESOURCE
# - RESOURCE variant
# - should not be used for state computation: use PATH|QUERY instead
# - semantics depends on MIME type
# - when there is a parent MIME type, should not interfer with it
# - should prefer HASH based on resource semantics rather than syntax
# - processed by client not server
#. - RESOURCE: [PATH][?[QUERY]][#[HASH]]
CHARSET ==> ##Forbidden characters:
## - not even percent-encoded
## - tab, newline or leading|trailing control chars or space
#Characters "to escape":
# - allowed but percent-encoding UTF-8 byte value
# - also known as "URL encoding"
# - HOSTNAME uses Punycode instead
# - any char (including Unicode) except forbidden characters
# - case-insensitive, uppercase
## - not in SCHEME
#Characters not "to escape" (beyond general URI delimiters):
# PORT [:digit:]
# SCHEME USERINFO HOSTNAME PATH QUERY HASH [:alnum:] - . +
# USERINFO HOSTNAME PATH QUERY HASH _ ~ ! $ & ' ( ) * ,
# HOSTNAME PATH QUERY HASH ; =
# PATH QUERY HASH : @
# QUERY HASH / ?
## HASH any char, including Unicode, excluding non-[:graph:] ASCII
## PATH [:graph:] if no AUTHORITY and PATH does not start with / and SCHEME not http[s]:, ws[s]:, ftp:, gopher:, file:
##WHATWG adds : ; = as characters "to escape" for USERINFO
#Escaping:
# - JavaScript: encode|decodeURI[Component]()
CASE ==> # - SCHEME, HOSTNAME: case-insensitive, lowercase recommended
# - rest: depends on SCHEME and AUTHORITY
MAX LENGTH ==> #Depends on URI scheme, but recommmends 255 bytes
#IE11 has limit of 2KB
RELATIVE REFERENCE ==> # - when SCHEME: is ommitted
# - relative to another URI2 ("base URI"), which can be (in priority order):
# - defined in RESOURCE (e.g. <base> in HTML)
# - defined in parent RESOURCE2
# - top-level resource's URI2
# - application-specific default URI2
# - if URI:
# - empty or only QUERY|HASH:
# - "same document reference"
# - same as URI2
# - including QUERY2 unless QUERY specified
# - excluding HASH2. HASH is also used.
# - URI:
# - "relative path reference"
# - relative to SCHEME2://AUTHORITY2/PATH2/
# - /PATH2/ excludes filename
# - /URI:
# - "absolute path reference"
# - relative to SCHEME2://AUTHORITY2/
# - //URI:
# - "network path reference"
# - relative to SCHEME2:
# - "URI reference": URI or relative reference
SYNTAX RESTRICTIONS ==> # - URI syntax can be restricted by specific URI scheme
# - but should be avoided to separate URI (identification) from URI scheme (interaction)
# - should not be restricted by specific AUTHORITY
# - except for PORT
# - otherwise it can:
# - conflict with other technologies also restricting URI syntax
# - make URI more dependant on implementation
# - make it less flexible
# - HASH syntax|semantics are defined by MIME type, not SCHEME nor AUTHORITY
PARSING ==> #See: uri.js, DOM
/=+===============================+=\
/ : : \
)==: IRI :==(
\ :_______________________________: /
\=+===============================+=/
I18N ==> #IRI (Internationalized Resource Identifier) are like URI but i18n'd:
# - do not need percent-encoding:
# - most Unicode chars
# - encoding is unspecified but usually UTF-8
# - if embedded inside a document with a specific encoding, should use that one
# - still need percent-encoding:
# - ASCII chars "to escape" in URIs
# - private use or non chars Unicode chars
# - i.e. U+E000 to U+F8FF, U+FDD0 to U+FDEF, U+F|10XXXX
## - any invisible char
# - bidirectionality:
# - must use ltr for logical representation (i.e. digital format)
# - but can use any for visual representation (i.e. what people see/input)
RELATIONSHIP WITH URIS ==> # - URIs and IRIs are siblings, not subsets
# - can convert from each other
# - specific SCHEME or protocol might add steps to conversion, e.g. Punycode for IDNA (DNS)
# - scheme, protocol or format need to explicitly allow IRIs to be able to use them
WHATWG "URL" DEFINITION ==> ## - "URL":
## - visual representation: IRI, with USERINFO hidden
## - logical representation: URI
## - "URL" does not imply location, "URI" and "IRI" words deprecated
/=+===============================+=\
/ : : \
)==: CURIE :==(
\ :_______________________________: /
\=+===============================+=/
GOAL ==> #CURIE (Compact URI) are like IRI, but with shorter length
SYNTAX ==> #[PREFIX:]REFERENCE
# - PREFIX:
# - [:alnum:] . - _
# - cannot be _ alone
# - if optional, must provide way to get default value
# - avoid any SCHEME to avoid any confusion with actual IRI
# - REFERENCE
#CURIE <-> IRI translation:
# - is implementation-specific
# - can be simple IRI reference, with PREFIX mapping to the base IRI
#[CURIE]:
# - "SafeCURIE": CURIE enclosed in [ ]
# - optional but recommended to distinguish from IRI
USAGE ==> #Not where only IRI are usually expected (e.g. <a href>)
#Do not expose to consumers, i.e. translate CURIE as IRI when consumed
/=+===============================+=\
/ : : \
)==: SEMANTICS :==(
\ :_______________________________: /
\=+===============================+=/
PURPOSE ==> #URIs can describe (not exclusive to each other):
# - identity:
# - resource ID, regardless of location
# - resolution:
# - translating into location/metadata
# - example: Handle System
# - example: URN, DOI
# - location:
# - address where a resource might be located
# - called URL (Uniform Resource Locator)
# - must use one|several network protocols|algorithm ("dereferencing"):
# - URI scheme does not always equate protocol, e.g. data: or file:
# - can be safe (read-only) or safe (write)
# - content:
# - including hash
# - example: data URI
# - metadata:
# - e.g.: URLs, owner, access restrictions, encoding
# - example: URC (never implemented), RDF
# - authorization ("capability URI")
#If describes content|metadata|authorization:
# - is opaque|unreflective (as oppose to transparent|reflective)
# - should validate that content|metadata|authorization currently match resource's
# - content|metadata|authorization should not be transient
CAPABILITY URI ==> #Advantage: does not need to store nor enter info separately
#Problem is that URI not designed to be confidential:
# - displayed in browser URL bar or history
# - can use HTML5 history to remove it
# - often cached
# - should require no caching
# - often logged
# - revealed by Referer [C] (see web security doc)
# - might be exposed to consumers like URL shorteners
#Mitigation:
# - specific to capability URI:
# - URI scheme should encrypt capability URI, e.g. https:
# - specific to HTTP:
# - put directory (but not individual URIs) in robots.txt
# - like any authorization token:
# - should be unique and unpredictable
# - should have expiration date
# - should allow revoking
# - unless needs sharing, should be account-wise
# - should be embedded inside resources encrypted, not being accessed by untrusted parties
URI SCHEME ==> #Describes:
# - interaction, i.e. identity resolution, location dereferencing, content retrieval or metadata usage
# - what URI should describe
# - should use URI scheme according to its design, e.g. no content|metadata in http: URI (see clean URI)
#Should promote reuse, i.e. only:
# - use registered ones
# - create new ones if real need
RESOURCE ==> #Can be anything:
# - including real-life objects|beings or concepts.
# - if essence is information, is "information resource".
REPRESENTATION ==> #How a resource is being represented for a specific request, using MIME type
#Resource !== representation:
# - two different resources can have same representation, e.g. copied resource
# - two different representations can have same resource, e.g. variant
TRANSIENCE ==> #Long-term persistent:
# - URI identity description should be
#Transient:
# - URI location|content|metadata description can be, but should be avoided
# - if URL:
# - rotten|dead|broken link
# - can prevent with:
# - permalink or PURL (permanent URL), redirecting to transient URL
# - robustlink:
# - redirects to archived version when dead link
# - can also always provide archived versions as alternatives (e.g. in a dropdown next to link)
# - implementations:
# - mementoweb robustlinks:
# - <a> uses:
# - normal link: href "URI-R", [data-versionurl "URI-M", ]data-versiondate "URI-M-DATE"
# - link to previous version: href "URI-M", data-originalurl "URI-R", data-versiondate "URI-M-DATE"
# - <meta itemprop="datePublished|Modified" content="YYYY-MM-DD">
# - uses Mementoweb Timetravel API
# - robustify.js
/=+===============================+=\
/ : : \
)==: CARDINALITY :==(
\ :_______________________________: /
\=+===============================+=/
RESOURCE IDENTITY ==> #Whether RESOURCE == RESOURCE2:
# - depends on SCHEME and AUTHORITY
# - example of amibiguous points:
# - non-semantic changes (formats, encoding, etc.)
# - different versions of same resource
SAMENESS ==> #Different requests to same URI cannot return different resources
# - at different places (global namespace) or times
#Global namespace:
# - points to same resource in any network
# - but resource might be available or not depending on network
# - sometimes the context is part of the resource itself:
# - e.g.:
# - file:
# - http://localhost
# - time-based protocols
# - should be avoided:
# - if is for security reason, can still use global namespace and add access control
# - implies central authorities:
# - URN NAMESPACE are registered to IANA
# - URL:
# - HOST IP and domain names are delegated to providers registered to IANA
# - PATH is responsibility of server
NO AMBIGUITY ==> #One request to 1 URI cannot return many resources at once ("URI collision")
UNIQUENESS ==> #Cannot bind duplicate URIs ("URI aliases") to 1 resource:
# - this requirement is optional, but can create problems:
# - when updating resource, need to update all URIs
# - if not enforces, can use:
# - canonical URI:
# - the URI which has priority
# - others might redirect to canonical URI (URLs only)
# - short URI: smaller size than others
# - depends on URI identity:
# - whether URI == URI2
# - normalization:
# - safe:
# - lowercase SCHEME, HOST
# - uppercase %-encoded chars
# - %-decode chars that do not need to be
# - resolving . and ..
# - resolving relative references
# - safe (IRIs only):
# - converting to same encoding (e.g. UTF-16 -> UTF-8)
# - Unicode characters normalization (e.g. STR.normalize() in JavaScript)
# - case-sensitivity should be for ASCII only
# - SCHEME-specific (unsafe):
# - removing parts that equal default value (e.g. PORT, PATH, filename)
# - replace with equivalent SCHEME (e.g. http -> https)
# - remove empty parts, e.g. ? or # alone
# - lowercase|uppercase of case-insensitive parts (other than CHEME, HOST)
# - remove duplicate /
# - IP:
# - IP -> hostname
# - HTTP:
# - prepend www. to HOST
# - AUTHORITY-specific (unsafe):
# - add trailing /
# - sorting QUERY
# - removing unused parts (e.g. undefined query variables)
# - remove #HASH
# - should limit normalization by reducing possible of equivalent syntaxes
/=+===============================+=\
/ : : \
)==: CLEAN URI :==(
\ :_______________________________: /
\=+===============================+=/
GOAL ==> #Semantic/clean/RESTful URI
# - accessibility
# - separation of concerns, e.g. does not need to change URL when implementation|representation changes
HOW ==> #Should not infer implementation:
# - no file|script names or anything linked to the technical stack
# - only use abtract, human-friendly identifiers
# - using slugs, i.e. short [[:alpha:][:lower:]-]
# - e.g. several /RESOURCE and /ID
#Should not infer representation (e.g. variants):
# - use protocol instead of URI:
# - e.g. HTTP request headers|body
# - use format instead of URI:
# - e.g. media|hreflang|type|sizes links HTML attribute
# - problems:
# - sometimes not allowed (e.g. request body in HTTP GET)
# - harder to set non-programatically (e.g. as opposed to using browser URL bar)
# - avoid QUERY
HELPERS ==> #Technologies that can help keep URI clean:
# - HTTP headers
# - links HTML attributes: media|hreflang|type|sizes
# - URI templates
# - Link [S]
# - well known URIs
/=+===============================+=\
/ : : \
)==: URI TEMPLATES :==(
\ :_______________________________: /
\=+===============================+=/
DOC ==> #See URI templates doc
/=+===============================+=\
/ : : \
)==: WELL-KNOWN URIS :==(
\ :_______________________________: /
\=+===============================+=/
WELL-KNOWN URIS ==> #URIs:
# - that need to be accessed before any other resource (e.g. resource access policy)
# - whose PATH is common among different authorities:
# - registered with IANA
# - PATH starts with /.well-known/
/=+===============================+=\
/ : : \
)==: ACCESS CONTROL :==(
\ :_______________________________: /
\=+===============================+=/
TYPES ==> #Content access (i.e. deferencing an URI) !== providing URI
#Content access can be:
# - normal origin server hosting
# - redirection and caching, e.g. proxies, internet archives, search engines
# - including, e.g. embedding
HOW TO RESTRICT ==> #Should prefer restricting content access over restricting URI itself
#Should prefer technological ways to restrict content access:
# - authentication
# - robots.txt
# - X-Frame-Options [S]
# - Cache-Control: no-store [S]
# - Cache-Control: no-transform [S]
#Should avoid putting content access in legal ways like terms of use:
# - but when using, should provide Link: <URI>;rel="license" [S]