Proposal for human ID rules. #3

kegsay · 2014-12-22T10:50:17Z

Includes handing of capitalisation, spoof checks and escape sequences.

Rendered: https://github.com/matrix-org/matrix-doc/blob/human-id-rules/drafts/human-id-rules.rst

Includes handling of namespaces for bots, handing of capitalisation, spoof checks and escape sequences.

Clarify position on capitalisation.

Moar clarify.

Mention case canonicalisation on registration.

ara4n · 2014-12-23T22:28:50Z

Thoughts:

What about whitespace? (I suggest we veto whitespace entirely in aliases & user IDs, as it's too ambiguous and hard to parse)
Should we be preemptively banning other sigils than @ and .?
Shouldn't room aliases also have sigil restrictions (e.g. not starting with a ! or #?)
Why should user IDs be case-insensitive but not room aliases? What does case-insensitive actually mean in this context? Case sensitivity is locale specific - whose locale are we comparing in?
"Checks SHOULD be applied to room aliases" -- how and when are the checks actually applied? If it's just that they should be enforced on creation, it should be clearer. Why isn't this a MUST?
Should we be looking more at stringprep-style normalisation of the unicode (e.g. http://xmpp.org/extensions/xep-0328.html), to ensure we do accurate comparisons between strings? If so, where does this happen - at creation, or doing a canonicalisation lookup when discovering a contact, or whenever an event is sent?
I think we should be specifying how to handle ambiguous strings (e.g. @matthew:matrix.org v. @matthew:matrix.org) in the same document, alongside punycoding malicious-looking IDs. Whilst this is a slightly different problem, it's close enough that it'd be good to cover them all off. It sounds like the plan for doing this is a lookup API on the target HS that can be queried to find out the canonical name for an ID?
Why can't we rename room aliases if they fail the check (and change the client-server API on creating them to accept alternative proposed aliases if the one submitted is inappropriate)?
I'm not sure we should encourage the @.irc.freenode.matrix.:domain style namespacing - it'd be better if 3rd party users were bridged into their real matrix equivalents (e.g. Arathorn being bridged into @matthew:matrix.org rather than creating the virtual @.irc.freenode.arathorn:matrix.org user). I guess namespacing is worthwhile if we need it, though. Why dots? And wouldn't @m.irc etc be more matrixy?
In the final "Home servers MAY canonicalise the user ID to be completely lower-case if desired." - I'm not sure it's worth saying this, or if it is, we should spell out that this canonicalisation would be happening at sign-up. We should spell out whether HSes do case-insensitive comparisons on user IDs when processing them in events (it sounds like they're don't), and give some examples of how somebody trying to talk to @matthew:matrix.org will end up talking to @matthew:matrix.org instead (i.e. by doing a canonicalisation lookup on the matrix.org HS, if that's what we're doing).

NegativeMjark · 2014-12-29T19:29:11Z

What about whitespace? (I suggest we veto whitespace entirely in aliases & user IDs, as it's too ambiguous and hard to parse)

I agree.

Should we be preemptively banning other sigils than @ and .?
Shouldn't room aliases also have sigil restrictions (e.g. not starting with a ! or #?)

This can be left to individual HS policy. If there are particular sigils that are likely to be confusing then we should recommend that homeservers ban them. We could recommend that homeservers ban "@!#$:" and any other charaters that have special meaning for ids.

Why should user IDs be case-insensitive but not room aliases? What does case-insensitive actually mean in this context? Case sensitivity is locale specific - whose locale are we comparing in?
"Checks SHOULD be applied to room aliases" -- how and when are the checks actually applied? If it's just that they should be enforced on creation, it should be clearer. Why isn't this a MUST?

We want to stop people creating IDs or aliases that differ only by case. We apply the checks when the room or ID is created.

We have used SHOULD because the protocol as currently written will continue to function if the homeserver disobeys the recommendation. It is the responsibility of the homeserver's to manage the IDs it owns and we are making recommendations about how it does so.

Should we be looking more at stringprep-style normalisation of the unicode (e.g. http://xmpp.org/extensions/xep-0328.html), to ensure we do accurate comparisons between strings? If so, where does this happen - at creation, or doing a canonicalisation lookup when discovering a contact, or whenever an event is sent?

We should definitely apply unicode canonicalisation. (I think DNS uses NFC)?
And it would be a sensible way to get rid of white space.
I'm not sure we would want to do any case-folding mappings though.

I think we should be specifying how to handle ambiguous strings (e.g. @matthew:matrix.org v. @matthew:matrix.org) in the same document, alongside punycoding malicious-looking IDs. Whilst this is a slightly different problem, it's close enough that it'd be good to cover them all off. It sounds like the plan for doing this is a lookup API on the target HS that can be queried to find out the canonical name for an ID?

I think the options that work from a technical perspective are:

Canonicalise on the client using client locale. I think RFC 5891 suggests doing this for internationalized domain names. I guess using the client local makes it more likely to catch capitalisation mistakes that that client would make.
Canonicalise on the client/hs using something like nameprep. I think RFC 3490 suggests doing this for internationalized domain names. I'd be interested to know why they stopped recommending this.
Use "case insentive" comparisons for ids. I think RFC 4343 recommends this for the ASCII characters in domain names. This only works if you are from the US.
Use a lookup API on the HS that owns the user id or just make the invite API or room alias API do the canonicalisation. Gives us a lot of flexibility.

Why can't we rename room aliases if they fail the check (and change the client-server API on creating them to accept alternative proposed aliases if the one submitted is inappropriate)?

I'm not sure how we would determine a suitable alternative for the room aliases.

I'm not sure we should encourage the @.irc.freenode.matrix.:domain style namespacing - it'd be better if 3rd party users were bridged into their real matrix equivalents (e.g. Arathorn being bridged into @matthew:matrix.org rather than creating the virtual @.irc.freenode.arathorn:matrix.org user). I guess namespacing is worthwhile if we need it, though. Why dots? And wouldn't @m.irc etc be more matrixy?

I'm not sure about bridging Arathorn into @.matthew:matrix.org. If we are going to have decent crypto and security it seems wrong to encourage insecure protocols to masqurade as you. (Then again if you trust freenode, connect over SSL and have registered your nick it might not be too bad).

I think we need to be able to bridge IRC users that don't have existing matrix accounts. I'd like it if we had a way to namespace those accounts. I guess dots match how we namespace event types. I personally don't mind using "-" like we currently do in the IRC bridge.

kegsay · 2014-12-29T23:08:29Z

it'd be better if 3rd party users were bridged into their real matrix equivalents

In that case, we'd really need to use OAuth2 to provide limited access for a given third party to access the matrix account. Without that, it's obviously a huge security risk if any bridge can act freely on your behalf without your permission. I think we should be providing OAuth2 anyway, given "acting on behalf of a client" is a recurring problem. The limited access token would say only be able to send m.room.message in rooms you're in, and would not be able to invite / join / leave rooms.

ara4n · 2014-12-29T23:19:50Z

I'm pretty sure we want to be able to link random services (and apps) to our main account, otherwise we're going to end up with a billion fragmented accounts, which could be a terrible experience (unless we fix portability of accounts, or make multi-account clients really painless). I'm not sure that OAuth2 is good enough for this. See https://matrix.org/jira/browse/SPEC-79 for more discussion on this problem.

kegsay · 2014-12-30T09:26:33Z

I'm not sure that OAuth2 is good enough for this.

In what way is OAuth2 deficient? You are free to choose an arbitrary set of "scope" permissions which can be granted via OAuth2, e.g. this is Google's: https://www.googleapis.com/discovery/v1/apis/oauth2/v2/rest?fields=auth(oauth2(scopes))

A good writeup on OAuth scopes: https://brandur.org/oauth-scope

I would prefer we didn't re-invent the wheel on this...

ara4n · 2015-10-13T00:04:18Z

Am I hallucinating that we made some progress in finally locking this down? I'd much rather that we got this one landed and implemented in Synapse than adding more bells & whistles like #93...

kegsay · 2015-10-13T11:03:08Z

I don't think we've made any more progress on this. I'll try to push forward a revised edition which incorporates the past 10 months ❗ of work. This is a tricky one to do correctly because it can get tangled up with "should messages to @ foo:bar go to @ Foo:bar" - I will attempt to avoid this conflict as much as possible to increase the odds of us getting consensus. This does mean that the revised edition will not cover all bases, but that means it shouldn't be blocked on things like the "business card lookup API" or "how do we resolve case-mappings", etc.

…o human-id-rules

kegsay · 2015-10-13T14:22:06Z

Done. With respect to the exemplar characters on the CLDR datasets, there are Python bindings for the library International Components for Unicode (ICU) which exposes the CLDR datasets.

https://pypi.python.org/pypi/PyICU/

ICU itself exposes functions to do mappings to and from punycode: http://icu-project.org/apiref/icu4c/uidna_8h.html

There's also a handy spoofing library: http://icu-project.org/apiref/icu4c/uspoof_8h.html which has the handy test of USPOOF_MIXED_SCRIPT_CONFUSABLE and USPOOF_SINGLE_SCRIPT which looks to be what we want.

illicitonion · 2015-10-19T15:09:04Z

drafts/human-id-rules.rst

+These checks are:
+
+User ID Localparts:
+ - MUST NOT contain a ``:`` or start with a ``@`` or ``.``


Why can't they start with @s or .s?

Oh, the @ is described below (an arbitrary choice I guess) - I still don't know about the .

The rationale for forbidding a . prefix was because at one point we were
going to namespace gateway'd user IDs as
@.irc.whatever.nickname:foo.com. This has been lost as we namespace
bridges any old way nowadays. Personally I still think it'd be useful
to reserve some prefixes as secondary sigils (twigils) if needed.

On 19/10/2015 16:14, Daniel Wagner-Hall wrote:

In drafts/human-id-rules.rst
#3 (comment):

-- Error message MAY go into further information about which
characters were - rejected and why. -- Error message SHOULD
contain a failed_keys key which contains an array - of strings
which represent the keys which failed the check e.g:: - -
failed_keys: [ user_id, room_alias ] - -Other considerations
--------------------- -- Basic security: Informational key on the
event attached by HS to say "unsafe +User IDs and Room Aliases MUST
be Unicode as UTF-8. Checks are performed on +these IDs by
homeservers to protect users from phishing/spoofing attacks. +These
checks are: + +User ID Localparts: + - MUST NOT contain a : or
start with a @ or .

Oh, the @ is described below (an arbitrary choice I guess) - I still
don't know about the .

— Reply to this email directly or view it on GitHub
https://github.com/matrix-org/matrix-doc/pull/3/files#r42383296.

I 100% agree with Ara on this: some future-proofing by reserving prefixes we may want in the future is a Good Thing imo.

ara4n · 2015-12-10T01:03:36Z

After a discussion on #matrix:matrix.org about SPEC-1 i've gone and read through the latest proposal here. In general it feels good - but I have some concerns:

It feels like the SHOULD requirements disallowing case-insensitive clashes between IDs should be MUST? Having HSes which allow colliding IDs is surely a bug.
How do we handle canonical comparison that includes of unicode ligatures etc? e.g. the difference between "LATIN SMALL LIGATURE FFI" (ﬃ) and ffi. Leo says this might be solvable by NFKC.
I'm not sure how much I trust the 2009 vintage "107 blacklisted IDN character" list from the Mozilla wiki. Unicode has moved on a lot in the last 7 years.
How do we handle combining diacritics (i.e. zalgos)? This definition of a homograph attack doesn't seem to cover the difference between S͂̌ and Š͂, although the two are similar enough (especially when buried in a zalgo) to be confused by a human. Or are we saved due to combining diacritics not being 'exemplar characters'?
Similarly, there's an entire fleet of homograph attacks with similar looking emojis - e.g. applying slightly different Fitzpatrick skin tone modifiers to an emoji.
We seem to have forgotten little-L versus capital-I homograph attacks within the same alphabet.
By rejecting 'invalid' IDs rather than implementing a canonicalisation or 'business card' API that clients must query in order to turn @matthew:matrix.org into @matthew:matrix.org, we lose the ability to tell the client what the correct ID formatting actually is (unless we rely on the client to infer it from subsequent events, or unless we include it in the M_FAILED_HUMAN_ID_CHECK error?)

I'm wondering whether avoiding homograph attacks is a bit of a fool's quest, especially given simple ambiguities between I's and l's and 1's etc, and instead Kegan's suggestion of basically copying the IDN behaviour that Chrome uses is good enough for basic use of room aliases and user IDs. Meanwhile we'd rely more heavily in future on a reputation score to differentiate the real Slim Shady from SIim Shady.

Another consideration is that user IDs, room aliases and user display names / room names all have subtly different uses:

User IDs will be used only for disambiguating users with the same display name. You would typically not use them out-of-band for identifying a user when contacting them on a business card etc, as you'd use a 3PID instead. They do not necessarily need to be i18n friendly for a given language, but it might be polite. We could equally disambiguate users on a hash or punycode of the user-id though.
Room aliases however may be used out-of-band as a way of identifying a specific room (at that point in time). As such, they need to be i18n friendly to a given language.
Display names and Room names are not meant to be unique. They could also be richer (including emojis and zalgos etc), which we probably want to support in order to losslessly federate with the widest set of other systems which support more exotic display names. Ideally we would have a way of spotting and flagging homograph attacks however, given users use them in practice to identify users in a room, and when this happens we need to know when to activate display name disambiguation. To try to avoid these homograph attacks and recognise them when they happen, do we actually want to apply the same rules as for IDs? Or do we rely entirely on reputation scores instead (in future)? We should at least try to express the name in a canonical form (NFKC?) and stripping out whitespace and non-exemplar characters before checking for a clash? although just how deep the HS wants to dig to warn its users about potential phishing and name collisions is probably an implementation detail of the HS (albeit based on recommendations we should spec).

The reason to try harder to disambiguate user IDs & aliases is because they MUST be unique, and they may be used out-of-band, and so a homograph attack which subverts that uniqueness is slightly more malicious than one applied to display names. For display names, we can hopefully rely on social means to generally keep people honest and avoid trying to impersonate one another within a room. If "fiona" is already in the room and a "ﬁona" (with a fi unicode ligature) joins with the same avatar and starts speaking, hopefully someone in the room will realise there's a doppleganger going on and check user IDs and call foul. Meanwhile, invites would always disambiguate using both reputation and user ID to avoid phishing attacks from an ambiguous display name & avatar. (We can also disambiguate within a room by using the date the user joined the room - "Matthew (joined Tue)" v. "Matthew (joined Feb)" etc. This disambiguation would be rendered clientside, the server just putting an advisory 'ambiguous' flag on the member so the client knows to warn somehow.)

If this sounds sane, we should add a section about display names & disambiguation to the spec - presumably in the same place as vdH's rules on how to calculate display names.

A silly suggestion that came up was to render IDs in a bunch of different fonts and check for sufficient structural dissimilarity to other IDs in the system, but this is pretty daft.

richvdh · 2015-12-11T15:48:54Z

I'd come to much the same conclusion (viz: that trying to prevent homograph attacks is a fool's errand; and that attempting to do so is likely to result in a situation where people mistakenly trust our ability to do so and then get caught out by the cases when we can't). As such, it's better to provide alternative means for users to establish trust where it matters: visual hashes, reputation scores, etc.

I'd also come to the conclusion that one size does not necessarily fit all.

Display names

If "fiona" is already in the room and a "ﬁona" (with a fi unicode ligature) joins with the same avatar and starts speaking, hopefully someone in the room will realise there's a doppleganger going on and check user IDs and call foul.

hrm. If I were the second ﬁona, I'd join quietly and let my presence go unnoticed for a couple of days, and then start speaking. The chances of someone noticing this are minimal.

So any attempt to disambiguate is only going to be a courtesy, where we have two non-malicious Matthews in a room and you'd like to keep track of which is which. I'm not even sure how effective that will be, particularly if people come and go from rooms, and Matthew in room X is different from Matthew in room Y. I'm tempted to give up any technological attempt to disambiguate and rely on some social means ("can one of you change your displayname?"); though of course that risks a "I was Matthew first!"/"I'm always Matthew" scenario.

So practically speaking, as a service to the user, we should try to identify visually-similar display names and provide a means of disambiguation. What exactly that means is somewhat open to debate.

The unicode consortium have some recommendations in this area: http://www.unicode.org/reports/tr36/#Visual_Spoofing_Recommendations.

The unicode normalisation transformations (especially NFKC) are a good first pass.
Normalisation doesn't do anything to distinguish homographs from different scripts (Cryllic 'А' vs Latin 'A'). Disallowing mixed-script names does little to solve this problem, whilst risking annoying people with arbitrary restrictions. The unicode consortium publish lists of confusable characters (http://www.unicode.org/Public/security/latest/confusables.txt) and these could be used as a basis for deciding if disambiguation is required. This list includes small L vs big i, as well as all the weird whitespace characters.
Nor do they do anything to distinguish characters of different case or accenting (Matthew vs matthew vs Matthéw). My feeling is that we don't need to disambiguate these anyway. This means that Zalgoed nicks could be hard to distinguish, but I feel like you're setting yourself up for that if you Zalgo your nick.

IDs

I'll have to think about these harder :)

kegsay · 2015-12-15T10:21:45Z

Are there any actionable items from your discussion? I'm seeing lots of wandering through the woods but nothing concrete which I can add to the proposal.

How do we feel about the mechanics that have been outlined in the proposal? That is:

Here are a set of recommendations that good HSes should use to determine what user ID local parts are allowed.
A receiving HS should check user IDs from federation for problems based on said recommendations. If problems are found, punycode it for client consumption.

The debate then revolves around what the recommendations are, and how far we go wrt homograph attacks, case mappings, etc.

richvdh · 2015-12-15T12:14:23Z

Sorry, this ended up with a bit of a stream of conciousness from me, regarding things which are only tangentially related to the PR. I suspect Matthew did the same.

Here are a set of recommendations that good HSes should use to determine what user ID local parts are allowed.
A receiving HS should check user IDs from federation for problems based on said recommendations. If problems are found, punycode it for client consumption.

I think these are excellent principles. As you say, the debate is around exactly what the restrictions/identity mappings are.

richvdh · 2015-12-15T12:53:35Z

drafts/human-id-rules.rst

+ - MUST NOT contain a ``:`` or start with a ``@`` or ``.``
+ - MUST NOT contain one of the 107 blacklisted characters on this list: 
+     http://kb.mozillazine.org/Network.IDN.blacklist_chars
+ - After stripping " 0-9, +, -, [, ], _, and the space character it MUST NOT


I think this is the wrong test - or at least misworded. 'A' is in the exemplar characters for both English and French, for example...

I think what you're after is that it only contains characters from one script, after ignoring Common and Inherited script characters, as per http://unicode.org/reports/tr39/#Mixed_Script_Detection. That may be overly restrictive, but it's probably easier to have something restrictive and relax it later, than vice versa.

(ditto for room aliases)

richvdh · 2017-09-26T13:55:54Z

After 3 years, it's sadly pretty clear that we're not going to progress this as it currently stands. I think the drafts folder is a better home for this work than a PR, so I'm going to land this.

ara4n · 2017-09-26T14:12:36Z

hang on... rich: i thought this was obsoleted generally by the mxid formatting stuff you landed?

richvdh · 2017-09-27T07:00:46Z

I'd love to know how github decides whether or not it's going to email me when someone comments on a PR I'm subscribed to.

Anyway:

hang on... rich: i thought this was obsoleted generally by the mxid formatting stuff you landed?

There's more here than mxids: It also addresses room aliases. Though yes, on looking at it, much of it does seem to be obsolete now. Maybe we should just get rid of this file, then.

…ng-improvements Async uploads rate limiting improvements

kegsay and others added 4 commits December 22, 2014 10:46

Proposal for human ID rules.

4f3ee12

Includes handling of namespaces for bots, handing of capitalisation, spoof checks and escape sequences.

Update human-id-rules.rst

408a051

Clarify position on capitalisation.

Update human-id-rules.rst

f2422ea

Moar clarify.

Update human-id-rules.rst

37a7f21

Mention case canonicalisation on registration.

kegsay mentioned this pull request Jan 15, 2015

Initial formal proposal for the AS API #5

Merged

Merge branch 'master' into human-id-rules

0131543

kegsay added the stalled label Sep 25, 2015

kegsay and others added 4 commits October 13, 2015 15:09

Updated to reflect more recent progress

3d5ec5e

Make it valid RST

0ab2d66

Merge branch 'master' into human-id-rules

c9f6534

Merge branch 'human-id-rules' of github.com:matrix-org/matrix-doc int…

87f656e

…o human-id-rules

kegsay added pending-feedback and removed stalled labels Oct 13, 2015

Linkify

ee3fe98

kegsay assigned illicitonion Oct 19, 2015

illicitonion reviewed Oct 19, 2015
View reviewed changes

illicitonion assigned kegsay and unassigned illicitonion Oct 26, 2015

richvdh reviewed Dec 15, 2015
View reviewed changes

matrixbot mentioned this pull request Oct 28, 2016

Format for rich text messages (SPEC-48) #471

Closed

kegsay removed their assignment Apr 6, 2017

Merge branch 'master' into human-id-rules

aebfcda

richvdh merged commit a7c28fd into master Sep 26, 2017

ara4n mentioned this pull request Jan 11, 2018

Display names are vulnerable to homoglyph attacks element-hq/element-web#5826

Closed

This was referenced Mar 1, 2022

Ability for server admins to acquire privileges in arbitrary rooms to resolve power struggles (SPEC-159) matrix-org/matrix-spec#67

Open

We need a better way of disambiguating users with the same display name than mxids. (SPEC-221) matrix-org/matrix-spec#97

Open

This was referenced Jan 11, 2018

Grammar and disambiguation of display names (SPEC-392) matrix-org/matrix-spec#177

Open

Backwardly-extensible history in bridged portal rooms (SPEC-440) matrix-org/matrix-spec#193

Open

kegsay mentioned this pull request Feb 28, 2023

MSC3575: Sliding Sync (aka Sync v3) #3575

Open

sumnerevans pushed a commit to beeper/matrix-spec-proposals that referenced this pull request Apr 4, 2023

Merge pull request matrix-org#3 from beeper/async-uploads-rate-limiti…

fedc697

…ng-improvements Async uploads rate limiting improvements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for human ID rules. #3

Proposal for human ID rules. #3

kegsay commented Dec 22, 2014

ara4n commented Dec 23, 2014

NegativeMjark commented Dec 29, 2014

kegsay commented Dec 29, 2014

ara4n commented Dec 29, 2014

kegsay commented Dec 30, 2014

ara4n commented Oct 13, 2015

kegsay commented Oct 13, 2015

kegsay commented Oct 13, 2015

illicitonion Oct 19, 2015

illicitonion Oct 19, 2015

ara4n Oct 19, 2015

kegsay Oct 20, 2015

ara4n commented Dec 10, 2015

richvdh commented Dec 11, 2015

kegsay commented Dec 15, 2015

richvdh commented Dec 15, 2015

richvdh Dec 15, 2015

richvdh commented Sep 26, 2017

ara4n commented Sep 26, 2017

richvdh commented Sep 27, 2017

Proposal for human ID rules. #3

Proposal for human ID rules. #3

Conversation

kegsay commented Dec 22, 2014

ara4n commented Dec 23, 2014

NegativeMjark commented Dec 29, 2014

kegsay commented Dec 29, 2014

ara4n commented Dec 29, 2014

kegsay commented Dec 30, 2014

ara4n commented Oct 13, 2015

kegsay commented Oct 13, 2015

kegsay commented Oct 13, 2015

illicitonion Oct 19, 2015

Choose a reason for hiding this comment

illicitonion Oct 19, 2015

Choose a reason for hiding this comment

ara4n Oct 19, 2015

Choose a reason for hiding this comment

kegsay Oct 20, 2015

Choose a reason for hiding this comment

ara4n commented Dec 10, 2015

richvdh commented Dec 11, 2015

Display names

IDs

kegsay commented Dec 15, 2015

richvdh commented Dec 15, 2015

richvdh Dec 15, 2015

Choose a reason for hiding this comment

richvdh commented Sep 26, 2017

ara4n commented Sep 26, 2017

richvdh commented Sep 27, 2017