Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for human ID rules. #3

Merged
merged 11 commits into from
Sep 26, 2017
Merged

Proposal for human ID rules. #3

merged 11 commits into from
Sep 26, 2017

Conversation

kegsay
Copy link
Member

@kegsay kegsay commented Dec 22, 2014

Includes handing of capitalisation, spoof checks and escape sequences.

Rendered: https://github.com/matrix-org/matrix-doc/blob/human-id-rules/drafts/human-id-rules.rst

kegsay and others added 4 commits December 22, 2014 10:46
Includes handling of namespaces for bots, handing of capitalisation, spoof
checks and escape sequences.
Clarify position on capitalisation.
Mention case canonicalisation on registration.
@ara4n
Copy link
Member

ara4n commented Dec 23, 2014

Thoughts:

  • What about whitespace? (I suggest we veto whitespace entirely in aliases & user IDs, as it's too ambiguous and hard to parse)
  • Should we be preemptively banning other sigils than @ and .?
  • Shouldn't room aliases also have sigil restrictions (e.g. not starting with a ! or #?)
  • Why should user IDs be case-insensitive but not room aliases? What does case-insensitive actually mean in this context? Case sensitivity is locale specific - whose locale are we comparing in?
  • "Checks SHOULD be applied to room aliases" -- how and when are the checks actually applied? If it's just that they should be enforced on creation, it should be clearer. Why isn't this a MUST?
  • Should we be looking more at stringprep-style normalisation of the unicode (e.g. http://xmpp.org/extensions/xep-0328.html), to ensure we do accurate comparisons between strings? If so, where does this happen - at creation, or doing a canonicalisation lookup when discovering a contact, or whenever an event is sent?
  • I think we should be specifying how to handle ambiguous strings (e.g. @matthew:matrix.org v. @matthew:matrix.org) in the same document, alongside punycoding malicious-looking IDs. Whilst this is a slightly different problem, it's close enough that it'd be good to cover them all off. It sounds like the plan for doing this is a lookup API on the target HS that can be queried to find out the canonical name for an ID?
  • Why can't we rename room aliases if they fail the check (and change the client-server API on creating them to accept alternative proposed aliases if the one submitted is inappropriate)?
  • I'm not sure we should encourage the @.irc.freenode.matrix.:domain style namespacing - it'd be better if 3rd party users were bridged into their real matrix equivalents (e.g. Arathorn being bridged into @matthew:matrix.org rather than creating the virtual @.irc.freenode.arathorn:matrix.org user). I guess namespacing is worthwhile if we need it, though. Why dots? And wouldn't @m.irc etc be more matrixy?
  • In the final "Home servers MAY canonicalise the user ID to be completely lower-case if desired." - I'm not sure it's worth saying this, or if it is, we should spell out that this canonicalisation would be happening at sign-up. We should spell out whether HSes do case-insensitive comparisons on user IDs when processing them in events (it sounds like they're don't), and give some examples of how somebody trying to talk to @matthew:matrix.org will end up talking to @matthew:matrix.org instead (i.e. by doing a canonicalisation lookup on the matrix.org HS, if that's what we're doing).

@NegativeMjark
Copy link
Contributor

What about whitespace? (I suggest we veto whitespace entirely in aliases & user IDs, as it's too ambiguous and hard to parse)

I agree.

Should we be preemptively banning other sigils than @ and .?
Shouldn't room aliases also have sigil restrictions (e.g. not starting with a ! or #?)

This can be left to individual HS policy. If there are particular sigils that are likely to be confusing then we should recommend that homeservers ban them. We could recommend that homeservers ban "@!#$:" and any other charaters that have special meaning for ids.

Why should user IDs be case-insensitive but not room aliases? What does case-insensitive actually mean in this context? Case sensitivity is locale specific - whose locale are we comparing in?
"Checks SHOULD be applied to room aliases" -- how and when are the checks actually applied? If it's just that they should be enforced on creation, it should be clearer. Why isn't this a MUST?

We want to stop people creating IDs or aliases that differ only by case. We apply the checks when the room or ID is created.

We have used SHOULD because the protocol as currently written will continue to function if the homeserver disobeys the recommendation. It is the responsibility of the homeserver's to manage the IDs it owns and we are making recommendations about how it does so.

Should we be looking more at stringprep-style normalisation of the unicode (e.g. http://xmpp.org/extensions/xep-0328.html), to ensure we do accurate comparisons between strings? If so, where does this happen - at creation, or doing a canonicalisation lookup when discovering a contact, or whenever an event is sent?

We should definitely apply unicode canonicalisation. (I think DNS uses NFC)?
And it would be a sensible way to get rid of white space.
I'm not sure we would want to do any case-folding mappings though.

I think we should be specifying how to handle ambiguous strings (e.g. @matthew:matrix.org v. @matthew:matrix.org) in the same document, alongside punycoding malicious-looking IDs. Whilst this is a slightly different problem, it's close enough that it'd be good to cover them all off. It sounds like the plan for doing this is a lookup API on the target HS that can be queried to find out the canonical name for an ID?

I think the options that work from a technical perspective are:

  • Canonicalise on the client using client locale. I think RFC 5891 suggests doing this for internationalized domain names. I guess using the client local makes it more likely to catch capitalisation mistakes that that client would make.
  • Canonicalise on the client/hs using something like nameprep. I think RFC 3490 suggests doing this for internationalized domain names. I'd be interested to know why they stopped recommending this.
  • Use "case insentive" comparisons for ids. I think RFC 4343 recommends this for the ASCII characters in domain names. This only works if you are from the US.
  • Use a lookup API on the HS that owns the user id or just make the invite API or room alias API do the canonicalisation. Gives us a lot of flexibility.

Why can't we rename room aliases if they fail the check (and change the client-server API on creating them to accept alternative proposed aliases if the one submitted is inappropriate)?

I'm not sure how we would determine a suitable alternative for the room aliases.

I'm not sure we should encourage the @.irc.freenode.matrix.:domain style namespacing - it'd be better if 3rd party users were bridged into their real matrix equivalents (e.g. Arathorn being bridged into @matthew:matrix.org rather than creating the virtual @.irc.freenode.arathorn:matrix.org user). I guess namespacing is worthwhile if we need it, though. Why dots? And wouldn't @m.irc etc be more matrixy?

I'm not sure about bridging Arathorn into @.matthew:matrix.org. If we are going to have decent crypto and security it seems wrong to encourage insecure protocols to masqurade as you. (Then again if you trust freenode, connect over SSL and have registered your nick it might not be too bad).

I think we need to be able to bridge IRC users that don't have existing matrix accounts. I'd like it if we had a way to namespace those accounts. I guess dots match how we namespace event types. I personally don't mind using "-" like we currently do in the IRC bridge.

@kegsay
Copy link
Member Author

kegsay commented Dec 29, 2014

it'd be better if 3rd party users were bridged into their real matrix equivalents

In that case, we'd really need to use OAuth2 to provide limited access for a given third party to access the matrix account. Without that, it's obviously a huge security risk if any bridge can act freely on your behalf without your permission. I think we should be providing OAuth2 anyway, given "acting on behalf of a client" is a recurring problem. The limited access token would say only be able to send m.room.message in rooms you're in, and would not be able to invite / join / leave rooms.

@ara4n
Copy link
Member

ara4n commented Dec 29, 2014

I'm pretty sure we want to be able to link random services (and apps) to our main account, otherwise we're going to end up with a billion fragmented accounts, which could be a terrible experience (unless we fix portability of accounts, or make multi-account clients really painless). I'm not sure that OAuth2 is good enough for this. See https://matrix.org/jira/browse/SPEC-79 for more discussion on this problem.

@kegsay
Copy link
Member Author

kegsay commented Dec 30, 2014

I'm not sure that OAuth2 is good enough for this.

In what way is OAuth2 deficient? You are free to choose an arbitrary set of "scope" permissions which can be granted via OAuth2, e.g. this is Google's: https://www.googleapis.com/discovery/v1/apis/oauth2/v2/rest?fields=auth(oauth2(scopes))

A good writeup on OAuth scopes: https://brandur.org/oauth-scope

I would prefer we didn't re-invent the wheel on this...

@kegsay kegsay added the stalled label Sep 25, 2015
@ara4n
Copy link
Member

ara4n commented Oct 13, 2015

Am I hallucinating that we made some progress in finally locking this down? I'd much rather that we got this one landed and implemented in Synapse than adding more bells & whistles like #93...

@kegsay
Copy link
Member Author

kegsay commented Oct 13, 2015

I don't think we've made any more progress on this. I'll try to push forward a revised edition which incorporates the past 10 months ❗ of work. This is a tricky one to do correctly because it can get tangled up with "should messages to @ foo:bar go to @ Foo:bar" - I will attempt to avoid this conflict as much as possible to increase the odds of us getting consensus. This does mean that the revised edition will not cover all bases, but that means it shouldn't be blocked on things like the "business card lookup API" or "how do we resolve case-mappings", etc.

@kegsay
Copy link
Member Author

kegsay commented Oct 13, 2015

Done. With respect to the exemplar characters on the CLDR datasets, there are Python bindings for the library International Components for Unicode (ICU) which exposes the CLDR datasets.

https://pypi.python.org/pypi/PyICU/

ICU itself exposes functions to do mappings to and from punycode: http://icu-project.org/apiref/icu4c/uidna_8h.html

There's also a handy spoofing library: http://icu-project.org/apiref/icu4c/uspoof_8h.html which has the handy test of USPOOF_MIXED_SCRIPT_CONFUSABLE and USPOOF_SINGLE_SCRIPT which looks to be what we want.

These checks are:

User ID Localparts:
- MUST NOT contain a ``:`` or start with a ``@`` or ``.``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't they start with @s or .s?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the @ is described below (an arbitrary choice I guess) - I still don't know about the .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for forbidding a . prefix was because at one point we were
going to namespace gateway'd user IDs as
@.irc.whatever.nickname:foo.com. This has been lost as we namespace
bridges any old way nowadays. Personally I still think it'd be useful
to reserve some prefixes as secondary sigils (twigils) if needed.

On 19/10/2015 16:14, Daniel Wagner-Hall wrote:

In drafts/human-id-rules.rst
#3 (comment):

-- Error message MAY go into further information about which
characters were - rejected and why. -- Error message SHOULD
contain a failed_keys key which contains an array - of strings
which represent the keys which failed the check e.g:: - -
failed_keys: [ user_id, room_alias ] - -Other considerations
--------------------- -- Basic security: Informational key on the
event attached by HS to say "unsafe +User IDs and Room Aliases MUST
be Unicode as UTF-8. Checks are performed on +these IDs by
homeservers to protect users from phishing/spoofing attacks. +These
checks are: + +User ID Localparts: + - MUST NOT contain a : or
start with a @ or .

Oh, the @ is described below (an arbitrary choice I guess) - I still
don't know about the .

— Reply to this email directly or view it on GitHub
https://github.com/matrix-org/matrix-doc/pull/3/files#r42383296.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 100% agree with Ara on this: some future-proofing by reserving prefixes we may want in the future is a Good Thing imo.

@illicitonion illicitonion assigned kegsay and unassigned illicitonion Oct 26, 2015
@ara4n
Copy link
Member

ara4n commented Dec 10, 2015

After a discussion on #matrix:matrix.org about SPEC-1 i've gone and read through the latest proposal here. In general it feels good - but I have some concerns:

  • It feels like the SHOULD requirements disallowing case-insensitive clashes between IDs should be MUST? Having HSes which allow colliding IDs is surely a bug.
  • How do we handle canonical comparison that includes of unicode ligatures etc? e.g. the difference between "LATIN SMALL LIGATURE FFI" (ffi) and ffi. Leo says this might be solvable by NFKC.
  • I'm not sure how much I trust the 2009 vintage "107 blacklisted IDN character" list from the Mozilla wiki. Unicode has moved on a lot in the last 7 years.
  • How do we handle combining diacritics (i.e. zalgos)? This definition of a homograph attack doesn't seem to cover the difference between S͂̌ and Š͂, although the two are similar enough (especially when buried in a zalgo) to be confused by a human. Or are we saved due to combining diacritics not being 'exemplar characters'?
  • Similarly, there's an entire fleet of homograph attacks with similar looking emojis - e.g. applying slightly different Fitzpatrick skin tone modifiers to an emoji.
  • We seem to have forgotten little-L versus capital-I homograph attacks within the same alphabet.
  • By rejecting 'invalid' IDs rather than implementing a canonicalisation or 'business card' API that clients must query in order to turn @matthew:matrix.org into @matthew:matrix.org, we lose the ability to tell the client what the correct ID formatting actually is (unless we rely on the client to infer it from subsequent events, or unless we include it in the M_FAILED_HUMAN_ID_CHECK error?)

I'm wondering whether avoiding homograph attacks is a bit of a fool's quest, especially given simple ambiguities between I's and l's and 1's etc, and instead Kegan's suggestion of basically copying the IDN behaviour that Chrome uses is good enough for basic use of room aliases and user IDs. Meanwhile we'd rely more heavily in future on a reputation score to differentiate the real Slim Shady from SIim Shady.

Another consideration is that user IDs, room aliases and user display names / room names all have subtly different uses:

  • User IDs will be used only for disambiguating users with the same display name. You would typically not use them out-of-band for identifying a user when contacting them on a business card etc, as you'd use a 3PID instead. They do not necessarily need to be i18n friendly for a given language, but it might be polite. We could equally disambiguate users on a hash or punycode of the user-id though.
  • Room aliases however may be used out-of-band as a way of identifying a specific room (at that point in time). As such, they need to be i18n friendly to a given language.
  • Display names and Room names are not meant to be unique. They could also be richer (including emojis and zalgos etc), which we probably want to support in order to losslessly federate with the widest set of other systems which support more exotic display names. Ideally we would have a way of spotting and flagging homograph attacks however, given users use them in practice to identify users in a room, and when this happens we need to know when to activate display name disambiguation. To try to avoid these homograph attacks and recognise them when they happen, do we actually want to apply the same rules as for IDs? Or do we rely entirely on reputation scores instead (in future)? We should at least try to express the name in a canonical form (NFKC?) and stripping out whitespace and non-exemplar characters before checking for a clash? although just how deep the HS wants to dig to warn its users about potential phishing and name collisions is probably an implementation detail of the HS (albeit based on recommendations we should spec).

The reason to try harder to disambiguate user IDs & aliases is because they MUST be unique, and they may be used out-of-band, and so a homograph attack which subverts that uniqueness is slightly more malicious than one applied to display names. For display names, we can hopefully rely on social means to generally keep people honest and avoid trying to impersonate one another within a room. If "fiona" is already in the room and a "fiona" (with a fi unicode ligature) joins with the same avatar and starts speaking, hopefully someone in the room will realise there's a doppleganger going on and check user IDs and call foul. Meanwhile, invites would always disambiguate using both reputation and user ID to avoid phishing attacks from an ambiguous display name & avatar. (We can also disambiguate within a room by using the date the user joined the room - "Matthew (joined Tue)" v. "Matthew (joined Feb)" etc. This disambiguation would be rendered clientside, the server just putting an advisory 'ambiguous' flag on the member so the client knows to warn somehow.)

If this sounds sane, we should add a section about display names & disambiguation to the spec - presumably in the same place as vdH's rules on how to calculate display names.

A silly suggestion that came up was to render IDs in a bunch of different fonts and check for sufficient structural dissimilarity to other IDs in the system, but this is pretty daft.

@richvdh
Copy link
Member

richvdh commented Dec 11, 2015

I'd come to much the same conclusion (viz: that trying to prevent homograph attacks is a fool's errand; and that attempting to do so is likely to result in a situation where people mistakenly trust our ability to do so and then get caught out by the cases when we can't). As such, it's better to provide alternative means for users to establish trust where it matters: visual hashes, reputation scores, etc.

I'd also come to the conclusion that one size does not necessarily fit all.

Display names

If "fiona" is already in the room and a "fiona" (with a fi unicode ligature) joins with the same avatar and starts speaking, hopefully someone in the room will realise there's a doppleganger going on and check user IDs and call foul.

hrm. If I were the second fiona, I'd join quietly and let my presence go unnoticed for a couple of days, and then start speaking. The chances of someone noticing this are minimal.

So any attempt to disambiguate is only going to be a courtesy, where we have two non-malicious Matthews in a room and you'd like to keep track of which is which. I'm not even sure how effective that will be, particularly if people come and go from rooms, and Matthew in room X is different from Matthew in room Y. I'm tempted to give up any technological attempt to disambiguate and rely on some social means ("can one of you change your displayname?"); though of course that risks a "I was Matthew first!"/"I'm always Matthew" scenario.

So practically speaking, as a service to the user, we should try to identify visually-similar display names and provide a means of disambiguation. What exactly that means is somewhat open to debate.

The unicode consortium have some recommendations in this area: http://www.unicode.org/reports/tr36/#Visual_Spoofing_Recommendations.

  • The unicode normalisation transformations (especially NFKC) are a good first pass.
  • Normalisation doesn't do anything to distinguish homographs from different scripts (Cryllic 'А' vs Latin 'A'). Disallowing mixed-script names does little to solve this problem, whilst risking annoying people with arbitrary restrictions. The unicode consortium publish lists of confusable characters (http://www.unicode.org/Public/security/latest/confusables.txt) and these could be used as a basis for deciding if disambiguation is required. This list includes small L vs big i, as well as all the weird whitespace characters.
  • Nor do they do anything to distinguish characters of different case or accenting (Matthew vs matthew vs Matthéw). My feeling is that we don't need to disambiguate these anyway. This means that Zalgoed nicks could be hard to distinguish, but I feel like you're setting yourself up for that if you Zalgo your nick.

IDs

  • I'll have to think about these harder :)

@kegsay
Copy link
Member Author

kegsay commented Dec 15, 2015

Are there any actionable items from your discussion? I'm seeing lots of wandering through the woods but nothing concrete which I can add to the proposal.

How do we feel about the mechanics that have been outlined in the proposal? That is:

  • Here are a set of recommendations that good HSes should use to determine what user ID local parts are allowed.
  • A receiving HS should check user IDs from federation for problems based on said recommendations. If problems are found, punycode it for client consumption.

The debate then revolves around what the recommendations are, and how far we go wrt homograph attacks, case mappings, etc.

@richvdh
Copy link
Member

richvdh commented Dec 15, 2015

Sorry, this ended up with a bit of a stream of conciousness from me, regarding things which are only tangentially related to the PR. I suspect Matthew did the same.

Here are a set of recommendations that good HSes should use to determine what user ID local parts are allowed.
A receiving HS should check user IDs from federation for problems based on said recommendations. If problems are found, punycode it for client consumption.

I think these are excellent principles. As you say, the debate is around exactly what the restrictions/identity mappings are.

- MUST NOT contain a ``:`` or start with a ``@`` or ``.``
- MUST NOT contain one of the 107 blacklisted characters on this list:
http://kb.mozillazine.org/Network.IDN.blacklist_chars
- After stripping " 0-9, +, -, [, ], _, and the space character it MUST NOT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong test - or at least misworded. 'A' is in the exemplar characters for both English and French, for example...

I think what you're after is that it only contains characters from one script, after ignoring Common and Inherited script characters, as per http://unicode.org/reports/tr39/#Mixed_Script_Detection. That may be overly restrictive, but it's probably easier to have something restrictive and relax it later, than vice versa.

(ditto for room aliases)

@kegsay kegsay removed their assignment Apr 6, 2017
@richvdh
Copy link
Member

richvdh commented Sep 26, 2017

After 3 years, it's sadly pretty clear that we're not going to progress this as it currently stands. I think the drafts folder is a better home for this work than a PR, so I'm going to land this.

@richvdh richvdh merged commit a7c28fd into master Sep 26, 2017
@ara4n
Copy link
Member

ara4n commented Sep 26, 2017

hang on... rich: i thought this was obsoleted generally by the mxid formatting stuff you landed?

@richvdh
Copy link
Member

richvdh commented Sep 27, 2017

I'd love to know how github decides whether or not it's going to email me when someone comments on a PR I'm subscribed to.

Anyway:

hang on... rich: i thought this was obsoleted generally by the mxid formatting stuff you landed?

There's more here than mxids: It also addresses room aliases. Though yes, on looking at it, much of it does seem to be obsolete now. Maybe we should just get rid of this file, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants