-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for human ID rules. #3
Changes from 10 commits
4f3ee12
408a051
f2422ea
37a7f21
0131543
3d5ec5e
0ab2d66
c9f6534
87f656e
ee3fe98
aebfcda
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,81 +1,132 @@ | ||
This document outlines the format for human-readable IDs within matrix. | ||
Abstract | ||
======== | ||
|
||
Overview | ||
-------- | ||
UTF-8 is quickly becoming the standard character encoding set on the web. As | ||
such, Matrix requires that all strings MUST be encoded as UTF-8. However, | ||
This document outlines the format for human-readable IDs within Matrix. | ||
|
||
Background | ||
---------- | ||
UTF-8 is the dominant character encoding for Unicode on the web. However, | ||
using Unicode as the character set for human-readable IDs is troublesome. There | ||
are many different characters which appear identical to each other, but would | ||
identify different users. In addition, there are non-printable characters which | ||
cannot be rendered by the end-user. This opens up a security vulnerability with | ||
produce different IDs. In addition, there are non-printable characters which | ||
cannot be rendered by the end-user. This creates an opportunity for | ||
phishing/spoofing of IDs, commonly known as a homograph attack. | ||
|
||
Web browers encountered this problem when International Domain Names were | ||
Web browsers encountered this problem when International Domain Names were | ||
introduced. A variety of checks were put in place in order to protect users. If | ||
an address failed the check, the raw punycode would be displayed to | ||
disambiguate the address. Similar checks are performed by home servers in | ||
Matrix. However, Matrix does not use punycode representations, and so does not | ||
show raw punycode on a failed check. Instead, home servers must outright reject | ||
these misleading IDs. | ||
disambiguate the address. | ||
|
||
Types of human-readable IDs | ||
--------------------------- | ||
There are two main human-readable IDs in question: | ||
The human-readable IDs in Matrix are Room Aliases and User IDs. | ||
Room aliases look like ``#localpart:domain``. These aliases point to opaque | ||
non human-readable room IDs. These pointers can change to point at a different | ||
room ID at any time. User IDs look like ``@localpart:domain``. These represent | ||
actual end-users (there is no indirection). | ||
|
||
- Room aliases | ||
- User IDs | ||
Proposal | ||
======== | ||
|
||
Room aliases look like ``#localpart:domain``. These aliases point to opaque | ||
non human-readable room IDs. These pointers can change, so there is already an | ||
issue present with the same ID pointing to a different destination at a later | ||
date. | ||
|
||
User IDs look like ``@localpart:domain``. These represent actual end-users, and | ||
unlike room aliases, there is no layer of indirection. This presents a much | ||
greater concern with homograph attacks. | ||
|
||
Checks | ||
------ | ||
- Similar to web browsers. | ||
- blacklisted chars (e.g. non-printable characters) | ||
- mix of language sets from 'preferred' language not allowed. | ||
- Language sets from CLDR dataset. | ||
- Treated in segments (localpart, domain) | ||
- Additional restrictions for ease of processing IDs. | ||
|
||
- Room alias localparts MUST NOT have ``#`` or ``:``. | ||
- User ID localparts MUST NOT have ``@`` or ``:``. | ||
|
||
Rejecting | ||
--------- | ||
- Home servers MUST reject room aliases which do not pass the check, both on | ||
GETs and PUTs. | ||
- Home servers MUST reject user ID localparts which do not pass the check, both | ||
on creation and on events. | ||
- Any home server whose domain does not pass this check, MUST use their punycode | ||
domain name instead of the IDN, to prevent other home servers rejecting you. | ||
- Error code is ``M_FAILED_HUMAN_ID_CHECK``. (generic enough for both failing | ||
due to homograph attacks, and failing due to including ``:`` s, etc) | ||
- Error message MAY go into further information about which characters were | ||
rejected and why. | ||
- Error message SHOULD contain a ``failed_keys`` key which contains an array | ||
of strings which represent the keys which failed the check e.g:: | ||
|
||
failed_keys: [ user_id, room_alias ] | ||
|
||
Other considerations | ||
-------------------- | ||
- Basic security: Informational key on the event attached by HS to say "unsafe | ||
User IDs and Room Aliases MUST be Unicode as UTF-8. Checks are performed on | ||
these IDs by homeservers to protect users from phishing/spoofing attacks. | ||
These checks are: | ||
|
||
User ID Localparts: | ||
- MUST NOT contain a ``:`` or start with a ``@`` or ``.`` | ||
- MUST NOT contain one of the 107 blacklisted characters on this list: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "MUST not contain any" surely? Ditto later. |
||
http://kb.mozillazine.org/Network.IDN.blacklist_chars | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not quite sure if this is the right restriction. I share Matthew's disquiet that this list is 7 years old, but I think it's not a bad match for what we want. At the opposite extreme, the unicode consortium have a lot to say about what should be allowed in an 'identifier': http://www.unicode.org/reports/tr31/. For them, an identifier seems to be more about how you might name a variable in a programming language, but they also hint that it might be appropriate for domain names. (The TL;DR of that document is that you look at this file and allow things which are 'Allowed', plus/minus minor tweaks for your specific case). This is highly restrictive and disallows a lot of things like emoji and mathematical symbols. I feel like this is too restrictive for us? |
||
- After stripping " 0-9, +, -, [, ], _, and the space character it MUST NOT | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is the wrong test - or at least misworded. 'A' is in the exemplar characters for both English and French, for example... I think what you're after is that it only contains characters from one script, after ignoring Common and Inherited script characters, as per http://unicode.org/reports/tr39/#Mixed_Script_Detection. That may be overly restrictive, but it's probably easier to have something restrictive and relax it later, than vice versa. (ditto for room aliases) |
||
contain characters from >1 language, defined by the `exemplar characters`_ | ||
on http://cldr.unicode.org/ | ||
|
||
.. _exemplar characters: http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters | ||
|
||
Room Alias Localparts: | ||
- MUST NOT contain a ``:`` | ||
- MUST NOT contain one of the 107 blacklisted characters on this list: | ||
http://kb.mozillazine.org/Network.IDN.blacklist_chars | ||
- After stripping " 0-9, +, -, [, ], _, and the space character it MUST NOT | ||
contain characters from >1 language, defined by the `exemplar characters`_ | ||
on http://cldr.unicode.org/ | ||
|
||
.. _exemplar characters: http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters | ||
|
||
In the event of a failed user ID check, well behaved homeservers MUST: | ||
- Rewrite user IDs in the offending events to be punycode with an additional ``@`` | ||
prefix **before** delivering them to clients. There are no guarantees for | ||
consistency between homeserver ID checking implementations. As a result, user | ||
IDs MUST be sent in their *original* form over federation. This can be done in | ||
a stateless manner as the punycode form has no information loss. | ||
|
||
In the event of a failed room alias check, well behaved homeservers MUST: | ||
- Send an HTTP status code 400 with an ``errcode`` of ``M_FAILED_HUMAN_ID_CHECK`` | ||
to the client if the client is attempting to *create* this alias. | ||
- Send an HTTP status code 400 with an ``errcode`` of ``M_FAILED_HUMAN_ID_CHECK`` | ||
to the client if the client is attempting to *join* a room via this alias. | ||
|
||
Examples:: | ||
|
||
@ebаy:domain.com (Cyrillic 'a', everything else English) | ||
@@xn--eby-7cd:domain.com (Punycode with additional '@') | ||
|
||
Homeservers SHOULD NOT allow two user IDs that differ only by case. This | ||
SHOULD be applied based on the capitalisation rules in the CLDR dataset: | ||
http://cldr.unicode.org/ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where specifically? I couldn't find them... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we want to use the CLDR datasets here - this operation can and should be done irrespective of language. I think what we want here is to check for ids which are the same when "casefolded". This is defined in Section 3.13 of the unicode spec. Before casefolding, we need to normalise the input (probably with NFKC) to remove ambiguity between "Ç" and "C+◌̧", etc. |
||
|
||
This check SHOULD be applied when the user ID is created, in order to prevent | ||
registration with the same name and different capitalisations, e.g. | ||
``@foo:bar`` vs ``@Foo:bar`` vs ``@FOO:bar``. Home servers MAY canonicalise | ||
the user ID to be completely lower-case if desired. | ||
|
||
Rationale | ||
========= | ||
|
||
Each ID is split into segments (localpart/domain) around the ``:``. For | ||
this reason, ``:`` is a reserved character and cannot be a localpart character. | ||
The 107 blacklisted characters are used to prevent non-printable characters and | ||
spaces from being used. The decision to ban characters from more than 1 language | ||
matches the behaviour of `Google Chrome for IDN handling`_. This is to protect | ||
against common homograph attacks such as ebаy.com (Cyrillic "a", rest is | ||
English). This would always result in a failed check. Even with this though | ||
there are limitations. For example, сахар is entirely Cyrillic, whereas caxap is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may want to specify that homeservers have to check for 'confusables': both whole-script (which is this 'сахар' vs 'caxap' case), and single-script ('l' vs 'I'). |
||
entirely Latin. | ||
|
||
.. _Google Chrome for IDN handling: https://www.chromium.org/developers/design-documents/idn-in-google-chrome | ||
|
||
User ID localparts cannot start with ``@`` so that a namespace of localparts | ||
beginning with ``@`` can be created. This namespace is used for user IDs which | ||
fail the ID checks. A failed ID could look like ``@@xn--c1yn36f:domain.com``. | ||
|
||
If a user ID fails the check, the user ID on the event is renamed. This doesn't | ||
require extra work for clients, and users will see an odd user ID rather than a | ||
spoofed name. Renaming is done in order to protect users of a given HS, so if a | ||
malicious HS doesn't rename their IDs, it doesn't affect any other HS. | ||
|
||
Room aliases cannot be rewritten as punycode and sent to the HS the alias is | ||
referring to as the HS will not necessarily understand the rewritten alias. | ||
|
||
Other rejected solutions for failed checks | ||
------------------------------------------ | ||
- Additional key: Informational key on the event attached by HS to say "unsafe | ||
ID". Problem: clients can just ignore it, and since it will appear only very | ||
rarely, easy to forget when implementing clients. | ||
- Moderate security: Requires client handshake. Forces clients to implement | ||
- Require client handshake: Forces clients to implement | ||
a check, else they cannot communicate with the misleading ID. However, this | ||
is extra overhead in both client implementations and round-trips. | ||
- High security: Outright rejection of the ID at the point of creation / | ||
- Reject event: Outright rejection of the ID at the point of creation / | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure why this isn't still a valid policy; surely every properly behaving server is just rejecting the event as soon as they see it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't make me :sadpanda: That very bullet point contains:
So even if HSes are on their best behaviour they may think something is invalid because they're using an older dataset. |
||
receiving event. Point of creation rejection is preferable to avoid the ID | ||
entering the system in the first place. However, malicious HSes can just | ||
allow the ID. Hence, other home servers must reject them if they see them in | ||
events. Client never sees the problem ID, provided the HS is correctly | ||
implemented. | ||
- High security decided; client doesn't need to worry about it, no additional | ||
protocol complexity aside from rejection of an event. | ||
implemented. However, it is difficult to ensure that ALL HSes will come to the | ||
same conclusion (given the CLDR dataset does come out with new versions). | ||
|
||
Outstanding Problems | ||
==================== | ||
|
||
Capitalisation | ||
-------------- | ||
|
||
The capitalisation rules outlined above are nice but do not fully resolve issues | ||
where ``@alice:example.com`` tries to speak with ``@bob:domain.com`` using | ||
``@Bob:domain.com``. It is up to ``domain.com`` to map ``Bob`` to ``bob`` in | ||
a sensible way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't they start with @s or .s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the @ is described below (an arbitrary choice I guess) - I still don't know about the .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale for forbidding a . prefix was because at one point we were
going to namespace gateway'd user IDs as
@.irc.whatever.nickname:foo.com. This has been lost as we namespace
bridges any old way nowadays. Personally I still think it'd be useful
to reserve some prefixes as secondary sigils (twigils) if needed.
On 19/10/2015 16:14, Daniel Wagner-Hall wrote:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I 100% agree with Ara on this: some future-proofing by reserving prefixes we may want in the future is a Good Thing imo.