Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDNA #53

Closed
annevk opened this issue Jul 30, 2015 · 26 comments
Closed

IDNA #53

annevk opened this issue Jul 30, 2015 · 26 comments

Comments

@annevk
Copy link
Member

annevk commented Jul 30, 2015

This issue tracks faults in http://www.unicode.org/reports/tr46/ since Unicode doesn't really do that well. If you find an issue, use http://www.unicode.org/reporting.html to report it and then report back here.

@SimonSapin
Copy link
Contributor

I’ve just submitted the following to http://www.unicode.org/reporting.html. I’ll update this when I get a response.

Subject: xn-- prefix never added in UTS # 46

In http://www.unicode.org/reports/tr46/tr46-15.html#ProcessingStepConvertValidate , the algorithm looks for a xn-- prefix and decodes the rest of the label per Punycode when it is present.

In http://www.unicode.org/reports/tr46/tr46-15.html#ToASCII however, the xn-- prefix is never added:

Convert each label with non-ASCII characters into Punycode [RFC3492]. This may record an error.

This should probably be replaced with something like:

For each label with non-ASCII characters, replace the label with “xn--” followed by the encoding of the label according to Punycode [RFC3492]. This may record an error.

@Sebmaster
Copy link

My report to Unicode from some time ago which seems to not be fixed yet:

The Format section (8.1) under Conformance Testing in UTS46 is confusing.

The explanation for the toASCII and toUnicode explains to use the provided processing_option for toUnicode, and always use nontransitional for toASCII.
However, in the implementation section of toUnicode (4.3), it explains to always call the processing step with nontransitional. The toASCII parameter list provides a processing_option, though.

It looks to me, as if the descriptions for toASCII and toUnicode in the conformance testing section got mixed up. This also applies to the descriptions in the header of IdnaTest.txt.

The other thing is that there's only a single IdnaTest file, but there's no explanation to which algorithm it applies. Is it for IDNA2008, IDNA2003 or UTS46? It seems to be categorized according to Unicode standard instead of IDNA reference, which makes this really confusing. Haven't reported that one yet though.

@SimonSapin
Copy link
Contributor

@Sebmaster regarding the other thing, http://www.unicode.org/reports/tr46/#Conformance_Testing explains how "To test for conformance to UTS46" using IdnaTest.txt.

@Sebmaster
Copy link

@SimonSapin I'm not sure that's totally correct either since:

Bn for Bidi Rule #n from Section 2. The Bidi Rule, in Right-to-Left Scripts for Internationalized Domain Names for Applications (IDNA) [IDNA2008]
Cn for ContextJ tests in, Appendix A.n in The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) [IDNA2008]. Thus C1 = Appendix A.1. ZERO WIDTH NON-JOINER, and C2 = Appendix A.2. ZERO WIDTH JOINER. The CONTEXTO tests are optional for client software, and not tested here.

is not described in TR46 at all. It's imported from the IDNA2008 standard, which has no relevance in the TR46 spec... I think 😕

@Sebmaster
Copy link

Got a mail today from Unicode (regarding conformance test description):

This was discussed at the UTC meeting in July, and has been forwarded to the author of the UTS for consideration in a subsequent version.

So that's pretty sweet.

@jcranmer
Copy link

Oh yeah, I came back into this and recall that the IdnaTest.txt is really bad at telling you how to process it.

@Sebmaster:
The ToASCII column uses nontransitional processing (read IdnaTest.txt's commented header) and UseSTD3ASCIIRules=true (see §8 of the input). However, they definitely appear to have some extra rules not described in their algorithm (for example, ToUnicode should never produce an [A4_1] or [A4_2] error, since those are specific to the ToASCII regime and ToUnicode never calls ToASCII, yet you can clearly see for yourself that they do).

@SimonSapin
Copy link
Contributor

I got a response to #53 (comment):

This has been added to the feedback document for next week's meeting.

@SimonSapin
Copy link
Contributor

… and today:

I was directed by the UTC to let you know that this has been sent to the editor for review during the next update cycle.

@valenting
Copy link
Collaborator

As per servo/rust-url#160 I submitted feedback regarding Validation rule no. 2 - "2. The label must not contain a U+002D HYPHEN-MINUS character in both the third and fourth positions."
This isn't being enforced by all UAs, as it's being used on youtube which uses domains such as https://r3---sn-2gb7ln7k.googlevideo.com/videoplayback?... This domain breaks that rule.

@srl295
Copy link

srl295 commented May 12, 2016

@valenting Your feedback is tracked as part of PRI317 http://www.unicode.org/review/pri317/ (being discussed now).

By the way @SimonSapin I'd think the right way to track is via UTC agenda items http://www.unicode.org/L2/L-curdoc.htm

@Sebmaster
Copy link

It seems like Unicode has closed that ticket without removing the -- validity requirement 😞

Does anybody have the ability to look into the Unicode ... process to see what's going on there?

@annevk
Copy link
Member Author

annevk commented Nov 21, 2016

@domenic
Copy link
Member

domenic commented Jan 6, 2017

The validation rule problem mentioned at #53 (comment) doesn't seem to have made it into https://docs.google.com/document/d/11PEww2N0PbXyPhbsCdW_PjD3BNgZMy5XHUv02SSXNqY/edit#heading=h.p7mmdt3ofe3 by my reading. What's the latest? It's item A, nevermind

@bagder bagder mentioned this issue Jan 30, 2017
@annevk
Copy link
Member Author

annevk commented Feb 13, 2017

Going forward, rather than tracking all UTS 46 feedback here, I suggest we just create new issues against this repository, so we can discuss each problem in isolation. I created an idna label that we can use to group them all.

@TimothyGu
Copy link
Member

As an update to the original issue, it seems the proposed changes to UTS#46 have been incorporated into its latest draft: http://www.unicode.org/reports/tr46/proposed.html.

Since traditionally UTS#46 updates are synced with Unicode Standard updates, a new version of UTS#46 with the CheckHyphens hook should be published next month, when Unicode 10.0.0 is scheduled to be released as well.

@annevk
Copy link
Member Author

annevk commented May 11, 2017

Yeah, but it's unclear if that's what we want as I commented in the document @domenic pointed to and discussed in email (primarily www-archive) and also at #267. However, it's been very hard to get implementers to give feedback on exactly what they're willing to try out here and what makes sense.

@domenic
Copy link
Member

domenic commented May 11, 2017

I think we should go for a quick-fix first so that people who are trying to use spec-complaint libraries like Node.js's URL and jsdom/whatwg-url don't continue to suffer. We can use #267 to figure out a longer-term browser-compatible plan.

@annevk
Copy link
Member Author

annevk commented May 11, 2017

  1. This is the generic IDNA issue. It's not for the hyphen case specifically.
  2. Is there actual suffering? Demonstrated compatibility issues would be really valuable.

@domenic
Copy link
Member

domenic commented May 11, 2017

Sure, I meant we should go for a quick-fix for the hyphens.

Examples of suffering:

annevk added a commit that referenced this issue May 12, 2017
As reported at
#53 (comment) this is
causing issues in non-browser implementations.
annevk added a commit to web-platform-tests/wpt that referenced this issue May 18, 2017
Tests for whatwg/url#53 and friends.
annevk added a commit to web-platform-tests/wpt that referenced this issue May 19, 2017
Tests for whatwg/url#53 and friends.
@annevk
Copy link
Member Author

annevk commented May 24, 2017

FWIW, I changed my mind after seeing #309 (comment). I think what UTS46 revision 18 defines is reasonable and that's what we should go with.

@annevk
Copy link
Member Author

annevk commented May 24, 2017

@Sebmaster did your testing issue ever got addressed? If not, could you file a new issue on that? I'm happy to help investigate that as I've made some attempts myself as well now.

annevk added a commit that referenced this issue May 24, 2017
Fixes #53 and fixes #267 by no longer breaking on on hyphens in the
3rd and 4th position of a domain label. This is known to break
YouTube: r3---sn-2gb7ln7k.googlevideo.com. This is done by setting
the proposed CheckHyphens flag to false.

Fixes #110 by clarifying that BIDI and CONTEXTJ checks are to be done
by setting the proposed CheckBidi and CheckJoiners flags to true.

Follow-up #313 is filed to remove the proposed bits once Unicode is
updated.
annevk added a commit that referenced this issue May 24, 2017
Tests: web-platform-tests/wpt#5976.

Fixes #53 and fixes #267 by no longer breaking on on hyphens in the
3rd and 4th position of a domain label. This is known to break
YouTube: r3---sn-2gb7ln7k.googlevideo.com. This is done by setting
the proposed CheckHyphens flag to false.

Fixes #110 by clarifying that BIDI and CONTEXTJ checks are to be done
by setting the proposed CheckBidi and CheckJoiners flags to true.

Follow-up #313 is filed to remove the proposed bits once Unicode is
updated.
@Sebmaster
Copy link

It's not addressed yet, but the latest draft contains a TODO for it.

@domenic domenic closed this as completed in dc9d831 Jun 1, 2017
domenic pushed a commit to web-platform-tests/wpt that referenced this issue Jun 1, 2017
Tests for whatwg/url#53 and friends, as fixed by whatwg/html#2627.
@domenic
Copy link
Member

domenic commented Jun 17, 2017

I reported the following editorial issue:

I help maintain a JavaScript library for implementing UTS 46. In the process of revising our public API for the upcoming proposed revision (http://www.unicode.org/reports/tr46/proposed.html#ToASCII), we noticed how strange it is that all other inputs to ToASCII, besides the input string, are booleans. Whereas processing_option is an enumeration with two values.

For editorial consistency, would it make sense to switch the processing option to a boolean flag, e.g. UseTransitionalProcessing?

@annevk
Copy link
Member Author

annevk commented Jun 22, 2017

@Sebmaster it seems http://www.unicode.org/reports/tr46/#Conformance_Testing was updated.

@domenic I asked for that change too. Note that if we want to track it here it should become its own issue. This is no longer a meta issue for all things IDNA as it got too unwieldy.

@domenic
Copy link
Member

domenic commented Jun 22, 2017

No need to really track it. I do think though having a public archive of feedback we've submitted is good, and my bad for my part in derailing the thread away from that. Maybe www-archive is OK?

@annevk
Copy link
Member Author

annevk commented Jun 22, 2017

Yeah or just a new issue for each piece of feedback. I don't think that would get crowded and if it does we can figure out a better approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

8 participants