-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative handling of illegal IDNs (such as domains with emojis) #18
Comments
This domain is invalid (see for example http://unicode.org/cldr/utility/idna.jsp?a=%5CU0001f414%5CU0001f414.tk), why do you need to convert it? You can call idna.uts46_remap directly if you want. |
For me, it shows OK in uts46 (but not in IDNA2008 or IDNA2003). I'm On 23/09/2015 08:41, Jon Ribbens wrote:
|
This is an IDNA library, and the domain is invalid under IDNA ;-) decoded = idna.uts46_remap(".".join(
label[4:].decode("punycode") if label.startswith(b"xn--") else
label.decode("ascii") for label in domain.split(b"."))) Otherwise if you want to provide a patch to add support for ignoring errors then that seems not entirely unreasonable to me, albeit it's not my decision whether it goes into the project! |
I'm OK with having a more lenient conversion function when you pass an appropriate optional argument to the decode function, but the default should be standards compliance. At the end of the day this is an IDNA 2008 compliant library, and domains with emoji in them are illegal in IDNA 2008. These deprecated domains will ultimately stop working as domain registries and software implementers upgrade. |
Ok. So if I add a uts46=True option to decode (with the default as Thanks Philip On 24/09/2015 11:47, Kim Davies wrote:
|
|
Duh. You are exactly right. On 24/09/2015 14:28, Jon Ribbens wrote:
|
I have the same issue with another domain: If you correctly check this and the aforementioned domain, both are valid: The problem comes from the hardcoded code points ( in this case codepoint_classes['PVALID'] ) in idna/idnadata.py which are most likely not up-to-date. You can get a current table from here: http://www.unicode.org/Public/idna/latest/ The solution to that might either be to change idna/idnadata.py every time a new Unicode version comes out or to hope pythons unicodedata library is always up-to-date and to derive the code points with the help of the rules in RFC5892. I'm not sure what is the preferred way but I'm willing to take a stab at it, one way or another. |
@AlexNigl I am not clear on specifically what you are reporting. As to the version of Unicode, the IETF have temporarily fixed IDNA to Unicode 6.3.0 due to unintended issues with later versions (see issue #8), but that has no bearing on this specific issue. Unicode 9.0 would produce the exact same result based on RFC 5892 Section 2.1:
We can see the general category for, say, the CHICKEN (U+1F414) is "So" which is not on the permitted list:
|
@kjd I seem to have misread the use of PVALID (in your Code and RFC5892) regarding the "valid" code points in the IDNA Mapping Table from UTS 46. So please ignore my comments about the not up-to-date code points. However it seems that the uts46 flag doesn't trigger the use of the IDNA Mapping Table (according to UTS-46) in "check_label" and instead uses the PVALID table according to RFC5892.
The reason is that despite "\U0001F410" being invalid in IDNA2008 it is valid according to UTS 46.
|
IDNA library threw exception while handing emoji-domains from: https://xn--qeiaa.ws/ (GoDaddy) . Is this a related issue? Thanks! >>> import idna
>>> print(idna.decode('xn--qeiaa.ws'))
|
I am trying to think of the best generic solution to this and a similar issue found in issue #27 and issue #32. What they all have in common is they are not legal IDNs, but they are found in the wild due to other non-standards compliant software. As it is a common pattern to simply treat all potential hostnames, IDNA or not, as input to this library, so there is an argument for providing some mechanism of doing conversions around them. Current use cases:
Both could be some twist on using an "errors" argument like Python's native encode/decode functions. Currently the library is analogous to "strict" behavior, but these alternatives would not be analogous to "replace" and "ignore" behaviors. I'm wondering if adding an errors argument that has a number of potential, combinable values would make sense here:
Not sure if there could be others. (I was thinking you could limit In practice it would look something like this:
The two exception categories could be combined something along the lines of The biggest concern is that Does anyone have any thoughts or ideas on this approach or alternatives? |
@kjd Maybe you can give a warning whenever this illegal conversion happens, also putting a warning in the documentation too would help. Also putting a function that says if this is a valid 2008 IDNA would be great to see if this library is used for some processing on domain names. |
idna.encode() is a de-facto function to test IDNA 2008 validity of a domain. It will return the encoded domain if successful (and thus valid), and throw an IDNAError exception if not. |
See also #40 and http://unicode.org/cldr/utility/idna.jsp?a=%E2%98%83.net
|
IMHO it's better to report the acceptance of emoji (etc) domains to the browser vendors as security bugs in the browsers... |
Hi. >>> idna.decode('xn--238h.to')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.9/site-packages/idna/core.py", line 389, in decode
s = ulabel(label)
File "/usr/lib64/python3.9/site-packages/idna/core.py", line 308, in ulabel
check_label(label)
File "/usr/lib64/python3.9/site-packages/idna/core.py", line 257, in check_label
raise InvalidCodepoint('Codepoint {} at position {} of {} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+1F63B at position 1 of '😻' not allowed and when I checked it in the browser, the domain worked. |
@j12i Note that per Wikipedia (on "Emoji domains"), right now:
You also have the case, in any TLD, of names created before IDNA2008 started to be enforced. Most of the times, registry will keep them (until registrant deletes them). For example, one year ago |
…2008 eg https://todayinmarch2020.🦈🖥.ws/ , https://🕸💍.ws/ , https://🐷🔥.ws https://unicode.org/faq/idn.html#6 psf/requests#3687 kjd/idna#18 kjd/idna#40
Moved some items around and added text about version compatibility and emoji domains
Closing this issue. Mitigations for this are currently referenced in the project's documentation, which links to this issue for anyone that wants to read the discussion. |
The decode method can throw an exception when it finds characters not acceptable in IDNA2008. I think that the characters are acceptable in UTS46.
idna.decode("xn--co8ha.tk")
There isn't a way of signalling to decode that it should apply uts46 rules. UTS46 (in section 4.3) says:
The decode method currently indicates whether there was an error, but it does not always produce a converted unicode string.
The domain name above is a valid domain name and can be accessed: http://🐔🐔.tk/
Also, trying to encode this domain name also fails, even with uts46=True and transitional=True.
The python call
"xn--co8ha.tk".decode("idna")
does produce the right answer.
I would stick with the python idna2003 implementation, except that I need to improved handling of the german ß character.
The text was updated successfully, but these errors were encountered: