Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define hosts' public suffix and registrable domain. #391

Merged
merged 7 commits into from
Jun 7, 2018
Merged

Define hosts' public suffix and registrable domain. #391

merged 7 commits into from
Jun 7, 2018

Conversation

mikewest
Copy link
Member

@mikewest mikewest commented May 25, 2018

This patch is another attempt at #72, and defers most of the actual work to the
algorithms defined at https://publicsuffix.org/list/.

I wonder if this is something we should expose on URL objects? I thikn @hillbrad was asking for it a looong time ago, but I don't know if he still has use cases.


Preview | Diff

This patch is another attempt at #72, and defers most of the actual work to the
algorithms defined at https://publicsuffix.org/list/.
Copy link
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I mostly have nits.

Do we need public suffix as a concept in practice? I haven't seen the need for that so far, but if we add an API it would make sense.

url.bs Outdated
obtain <var>host</var>'s <a for=host>public suffix</a>, run the following steps:

<ol>
<li><p>Let <var>parsed</var> be the result of <a lt="host parser">host parsing</a> <var>host</var>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A host is already parsed (otherwise it wouldn't be a host). You also need to introduce the host variable in the paragraph before the algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A host is already parsed (otherwise it wouldn't be a host).

Hrm. Yeah, I guess it's reasonable to assume that we'll only be using this algorithm on already-parsed hosts.

You also need to introduce the host variable in the paragraph before the algorithm.

Line 277 introduces <var>host</var>. Would you prefer to be more explicit, like "To obtain the <a for=host>public suffix</a> for a <a for=/>host</a> <var>A</var>:"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that on line 277. That's fine, but your alternative here works too.

url.bs Outdated
<var>host</var>'s <a for=host>registrable domain</a>, run the following steps:

<ol>
<li><p>Let <var>parsed</var> be the result of <a lt="host parser">host parsing</a> <var>host</var>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

url.bs Outdated
<ol>
<li><p>Let <var>parsed</var> be the result of <a lt="host parser">host parsing</a> <var>host</var>.

<li><p>If <var>parsed</var> is not a <a>domain</a>, return the empty string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of implies that the public suffix is also a string. Perhaps it's cleaner to return null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the public suffix a host? I guess it could be. I was assuming it was a string, but treating it as a host seems reasonable.

<td><code>com</code>
<td><code>example.com</code>
<tr>
<td><code>EXAMPLE.COM</code>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a host, but input to the host parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's helpful to point out that no matter how folks spell the URL, it's going to be normalized. Perhaps shifting this table to include a URL rather than a host would make that point, especially for the punycode bits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to just list hosts, but we should label it "host input" or some such, to not confuse it with host as a concept, which is already parsed and normalized.

<td><code>github.io</code>
<td><code>whatwg.github.io</code>
<tr>
<td><code>إختبار</code>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. And also applies below.

url.bs Outdated
</div>

<p>Two <a for=/>hosts</a>, <var>A</var> and <var>B</var> are said to be
<dfn for=host export id=concept-host-same-site>same-site</dfn> with each other if either of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have it as "same origin". Should this be "same site"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh. I think I'd have spelled it "same-origin" if you hadn't already spelled it "same origin". :)

I'm happy to follow suit with "same site"; I'm not dogmatic about hyphenation.

url.bs Outdated
following statements are true:

<ul class=brief>
<li><p><var>A</var> is identical to <var>B</var>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use concept-host-equals.

@annevk
Copy link
Member

annevk commented May 25, 2018

As for the API, I think we could make it part of #288. We could even leave out unicode() for now as that seems to be more controversial...

@mikewest
Copy link
Member Author

Updated based on your feedback, WDYT?

Do we need public suffix as a concept in practice?

I think we do, as it seems to be what we want to reference from HTML's document.domain bit. It also makes the definition of "registrable domain" simpler. Potentially exposing it as an API in the future is just icing on the cake.

Copy link

@sleevi sleevi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an API perspective, I remain opposed to adding it (least of all because each browser carries their own notion of a PSL and, as it turns out, browser’s implements slightly different algorithms)

I die a little inside that we’re speccing more of this, but I suppose that’s inevitable. CC’ing @weppos and @dnsguru as other PSL maintainers

@mikewest
Copy link
Member Author

From an API perspective, I remain opposed to adding it (least of all because each browser carries their own notion of a PSL and, as it turns out, browser’s implements slightly different algorithms)

Isn't detecting that kind of difference in platform restrictions a reason to add the API? Different browsers consider a different set of hosts to have different registrable domains over time: it seems reasonable to expose that to developers so they can make decisions about the environment in which their code is executing.

I die a little inside that we’re speccing more of this, but I suppose that’s inevitable.

Not writing it down seems worse. :) As long as cookies, document.domain, etc. depend on the notion, we need to explain it in a way that specs can rely on.

@sleevi
Copy link

sleevi commented May 25, 2018

Isn't detecting that kind of difference in platform restrictions a reason to add the API? Different browsers consider a different set of hosts to have different registrable domains over time: it seems reasonable to expose that to developers so they can make decisions about the environment in which their code is executing.

No, it's actually a reason for not exposing - that the platform does not provide any guarantees about that environment, and you shouldn't be relying on trying to detect the environment. If there is something you could do if you're not on the PSL (for example, using separate domains under a gTLD vs using separate 3LDs), you should do that regardless. Anything you could or would do if you're not on the PSL, you should do regardless. And anything you would or could do 'if' you're on the PSL is wrong. So it's sort of win/win - always treat it as if you're not on it, and you're fine :)

I know it's a fairly extreme position, but the PSL isn't something we can or should be relying on, especially as the Internet scales on. The notion of having a static list of every hosting provider, CMS, and otherwise user-generated content platforms, which the PSL is, is, well, a very 1980s solution :)

Not writing it down seems worse. :) As long as cookies, document.domain, etc. depend on the notion, we need to explain it in a way that specs can rely on.

Except no one (on the author side) should be building systems that rely on it, and no new specs should be depending on it. To the extent writing it down gets to make explicit that this is a legacy aspect and any spec that references it has security flaws and should be redesigned before being implemented/shipped because of that, sure 👍

@annevk
Copy link
Member

annevk commented May 25, 2018

@sleevi if you were actually successful in stopping new standards from relying on it I'd be more persuaded. But meanwhile WebAuthn relies on it, Token Binding relies on it, the new Cross-Origin-Resource-Policy header relies on it, and to a large extent because of pressure from accounts.google.com (at least for the first two) as I understand it.

@sleevi
Copy link

sleevi commented May 25, 2018

@annevk Token Binding is not yet shipping in any browser, and WebAuthN's use of facets remains problematic security. If C-O-R-P relies on it, we should be fixing that in C-O-R-P before shipping.

@annevk
Copy link
Member

annevk commented May 25, 2018

@sleevi @mikewest and I tried and it turns out cookies are just too attractive not to be compatible with. We better figure out how to make PSL scale somehow.

url.bs Outdated
@@ -272,6 +272,93 @@ for further processing.
U+0020 SPACE, U+0023 (#), U+0025 (%), U+002F (/), U+003A (:), U+003F (?), U+0040 (@), U+005B ([),
U+005C (\), or U+005D (]).

<p>A <a for=/>host</a>'s <dfn for=host export id=concept-host-public-suffix>public suffix</dfn> is
the portion of a <a for=/>host</a> which is controlled by a registrar, public or otherwise. To
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantry:

which is controlled by a registrar

Isn't necessarily correct. This implies control over the DNS, which isn't always passed on (e.g. in the cast of hosting or DNS providers), and an example like appspot.com, that domain isn't controlled by a registrar.

That was the intent of the PSL originally - reflecting ccTLDs registration policies - but that predates the advent of the PRIVATE section where it all began the descent into hell :)

publicsuffix.org doesn't list 'what' a public suffix is, other than the result of running the algorithm. Logically, it represents the separation of domain boundaries indicating a change in administrative or technical control or security policy (which is why IETF called it DBOUND), but that's a bit of a mouthful... :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example like appspot.com, that domain isn't controlled by a registrar.

Aren't we calling Google the registrar in this example? Or GitHub the registrar of*.github.io?

Is there a term I could use that would be more accurate (and less than a sentence long :) )?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikewest Yeah, except neither GitHub nor Google are actually or acting as registrars. That was why it was sort of weird. They don't necessarily allow registration either (and may instead assign names, such as Amazon, based on project IDs)

For a given domain input, the PSL splits the labels on the first administrative boundary, with the registered domain being the set of labels that are operated according to a different set of domain policies than the public suffix (which itself may contain more domain splits).

Definitely a mouthful, and this is part of why we dance around it on publicsuffix.org, because we haven't found a pithy way of describing left/right except in their relationship to each other. :/

I was hoping your ability to condense these concepts would be better than mine.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely one of the longest outstanding proper definitions that we eventually need to clarify on the PSL project as well, and connected to publicsuffix/publicsuffix.org#12

If we consider only the ICANN section (mistakenly named like that, as it should be IANA), than the definition is probably correct. If we consider also the PRIVATE section, and then the list as a whole, we must come with a better definition of what is effectively that distinguish a suffix from a host.

The "controlled" part is the key. In both cases, the denominator is than an entity has control of a portion (a set of labels) in a host, and determine specific rules on how that portion of the name is operated. Everything beyond (on the left) of that label is basically not under direct control of that entity, and therefore each subzone should be considered independent from the others.

In the case of a registry, the controlled portion is for sure the TLD and perhaps extra lower levels (generally second, something third). In that case, the "registerable" definition potentially apply, as there is a direct assumption that the registrar makes those domain available for registration. Again, this is actually a potential incorrect assumption, as domains that belong to that zone may not be open for registration, but assigned explicitly.

If the "registerable" may potentially fit the registrar use case, it definitely doesn't fit the PRIVATE use case because the suffixes in this section may be there for a variety of reasons.

However, regardless the use case, the common pattern is that the entity that controls the suffix declares that every subzone beyond that suffix should be considered independent zones potentially managed by different users.

@sleevi
Copy link

sleevi commented May 25, 2018

@annevk The PSL fundamentally can't scale, unless our goal is to deliver a snapshot of the Internet domains to users in real time, which we sort of put to bed when RFC 952 was obsoleted.

I don't want to derail this thread, so apologies if it comes off there - but definitely wanted to push back on the notion of exposing this as part of the platform. Anyone that is making security assumptions about the presence or non-presence of a domain on the PSL is making a flawed security decision. To the extent browsers are doing it, they're wrong - and while I understand they may be doing so for legacy reasons, we should push back. But as a concept for exposing it to/as part of the platform, as much as possible, we should be trying to hide it from the platform and developers, because it's a concept that should go away / should not be relied on. If there is anything authors would do differently (based on non-presence), they should do that, and if there's anything they would do based on presence, they should stop doing that :) Hopefully that would obviate the need for API exposure.

@annevk
Copy link
Member

annevk commented May 25, 2018

@mikewest I pushed a cleanup commit, but I didn't address @sleevi's now-hidden comment above.

@annevk
Copy link
Member

annevk commented May 25, 2018

@sleevi again, that'd be more convincing if Google didn't double down on relying upon it. Both server-side with accounts and in Chrome with site isolation.

@sleevi
Copy link

sleevi commented May 25, 2018

@annevk I don't think I can productively respond to that. It feels very much "You're employed by Google. Google misuses a project you maintain. Therefore I don't need to consider that feedback until you change Google first.". I agree that it is unfortunate, I agree that it is not ideal, but to the extent possible, we should push back and try to find better solutions, and push back on those organizations that ignore that feedback.

@mikewest
Copy link
Member Author

again, that'd be more convincing if Google didn't double down on relying upon it. Both server-side with accounts and in Chrome with site isolation.

Google is large, it contains multitudes.

FWIW, Chrome's isolation folks are actively working on origin isolation (document.domain and related weirdness makes that hard), and I think it's fair to say that they see "site" isolation as a stopgap they'd like to move past (though that itself was a ~4 year engineering project).

I think it's also true that Google's sign-in team is enthusiastic about separating accounts.google.com from everything else, but that there's real value in creating some association with docs.google.com and mail.google.com. I'm hopeful that we won't be stuck with that model forever, but it seems like one we're going to be dealing with for (at least) the next 5 years.

I think it's helpful to create primitives that help developers work within the model we've created for ourselves, on the one hand, and to use our other hand to poke at the model in the hopes of shifting it. Making Sec-Metadata less granular, or not shipping SameSite cookies or etc. seems like it would make the short term pain more acute, and wouldn't actually advance the goal of shifting to a more origin-based view of the world.

@weppos
Copy link

weppos commented May 25, 2018

Thanks @sleevi for bringing this discussion to my attention. For the records, I'm very bad at instantaneously follow up to threads (I admire how Ryan can keep an eye on all the ML he's involved into 😅 ).

Anyways, before weighting on one side or another, I actually have a question:

I know it's a fairly extreme position, but the PSL isn't something we can or should be relying on, especially as the Internet scales on. The notion of having a static list of every hosting provider, CMS, and otherwise user-generated content platforms, which the PSL is, is, well, a very 1980s solution :)

@sleevi just to better understand, are you critic on the concept of a public suffix, or on the specific implementation of the PSL per se?

Reason I'm asking is because I concur that the PSL as it stands today it's an obsolete project that may need some extra work, but I'm not totally against the concept of a public suffix (the hard part is to properly define it).

For now, for the sake of simplicity, let's take the suffixes in the PSL and consider them a PSL. Should the domain owner have a way like it was done for the CAA to somehow communicate e.g. via DNS what is the zone preferrable public suffix interpretation and policy, would your position on the concept of public suffix change?

Please forgive for a second the possible implementation constraints or implications.

I think the scope here is to try to clarify that a notion of public suffix exists, regardless whether we use the PSL today to represent them. Which I don't completely disagree on the idea.

url.bs Outdated
@@ -272,6 +272,94 @@ for further processing.
U+0020 SPACE, U+0023 (#), U+0025 (%), U+002F (/), U+003A (:), U+003F (?), U+0040 (@), U+005B ([),
U+005C (\), or U+005D (]).

<p>A <a for=/>host</a>'s <dfn for=host export>public suffix</dfn> is the portion of a
<a for=/>host</a> which is controlled by a registrar, public or otherwise. To obtain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "controlled by a registrar, public or otherwise" we could say "included on the Public Suffix List". This is boring, but factual and correct as I understand it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that works (as boring as it is)

url.bs Outdated
</ol>

<p>A <a for=/>host</a>'s <dfn for=host export>registrable domain</dfn> is a <a>domain</a> that could
be registered at a registry. To obtain <var>host</var>'s <a for=host>registrable domain</a>, run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is its public suffix including one domain label preceding its public suffix". Again, boring, but factual?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a given host may have multiple public suffixes expressed within it.

Perhaps:
The domain formed by the most specific public suffix for host, along with the domain label immediately preceding it?

From a spec question, what do you want this definition to entail for the appspot.com case?

That is,

  • foo.bar.appspot.com is "obviously" going to return bar.appspot.com as the registerable domain (with appspot.com as the public suffix), and the same would be expected if just bar.appspot.com.
  • What do you expect this machinery to return for appspot.com? appspot.com is on the PSL, so that is a public suffix, but appspot.com is also a registerable domain under the com PSL.

I seem to recall that different platform features interpret that differently (navigation vs cookies, for example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't really aware of this case. Do you know why they interpret it differently? I guess we want consistent answers with cookies, WebAuthn, etc. If by navigation you mean the address bar it seems consistency with that would not matter that much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikewest wrote that the registrable domain would be null in such a case (we have github.io as example).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d need to reaudit the Chrome code to figure out which cases are web visible. The results differ in this case based on whether or not you include private suffices and whether you treat wildcards as implicit entries of the parent. Chrome and FF differ on the latter, and the former is specified by the caller.

@sleevi
Copy link

sleevi commented May 25, 2018

@weppos Thanks for chiming in. I realized the more I wrote, the more this should be its own issue, so I filed publicsuffix/list#671 to try and track some of my thoughts on this, and on the overall domain holder boundary use case.

@annevk
Copy link
Member

annevk commented May 25, 2018

Based on the discussion I think we should go ahead, to ensure a consistent definition of all the standards that end up using these.

Given @sleevi's legitimate concerns, let's add a <p class=warning> block to discourage relying on "same site", "public suffix", and "registrable domain". However, I'm not personally motivated to police these (other than ensuring correct usage) anymore given that being-similar-to-cookie-interests seem to very easily win any argument.

Any API proposal should be discussed in a new issue on this repository, if there's still appetite. Discussion of scaling PSL can go in the issue raised by @sleevi above.

url.bs Outdated
<tr>
<td><code>com</code>
<td><code>com</code>
<td>
Copy link
Member

@annevk annevk May 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change this to <td>Null I suppose. (Though we could maybe add a paragraph that says that null values are omitted. Not sure what's nicer.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest <td><i>null</i>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We don't use that convention anywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example tables like this are special. We omit the quotes, substitute strings for structs, and use other conventions meant for visual clarity and not consistency. I certainly don't think we should capitalize "null" here, and I think italicizing it so that it's clear it's not just a registrable domain named null is helpful.

Shrug. Just a thought.

<td><code>whatwg.github.io</code>
<td><code>github.io</code>
<td><code>whatwg.github.io</code>
<tr>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this row duplicated? The previous one looks the same.

@dnsguru
Copy link

dnsguru commented May 25, 2018

@sleevi @weppos I'll toss in some hopefully helpful but really wordy info here as color.

For the last decade we have struggled with the focused definition of 'public suffix' term, 'eTLD', 'registerable domain' and other terms as being interchangeable.

I agree that we need a glossary or something to help make these definitions more clear, and perhaps identify and expire the use of some of them if we can. But it gets tricky and nuanced.

Different users, developers, integrators, and contributors define these in a variety of ways, sometimes as synonyms, sometimes not. This seems to result from the variation in how the PSL gets implemented within libraries or used in development. Sometimes there is a granular distinction that drove a given term's usage.

Pardon if I go too deep on this one but it helps us out in the process of coming up with a good path forward.
In the past, @gerv the wize and powerful helped us out to tolerate these organic differences by reminding us of the origins of the PSL being about cookie horizons, and how it has grown since.

To avoid being drawn into the frothing energy around competing/alternative root TLD systems, the PSL maintainers opted to follow a document from ICANN called ICP-3, which defines a single authoritative root system for TLDs. The IANA maintains the listings of the TLDs listed in that root. The IANA does not go deeper levels than these initial entries (so it would include .UK but not CO.UK and include .AU but not COM.AU).

A long time ago (in a galaxy far, far away ;)), it was discovered that one might be able to issue a 'super cookie' for CO.UK and slurp up all kinds of interesting data for the subdomains of CO.UK, and the PSL was born to dig deeper into these TLD structures in order to know what to treat as though it was 'effectively' (hence 'eTLD') a Top Level Domain when really it was a second level (or deeper) domain such as CO.UK.

And thus was born use of a static list to identify these nuances in a more elegant manner than the IANA list, and go deeper into the effective namespaces.

The benefit of such a list is that it is possible to cache it or incorporate it within one's software to understand how to treat entries, but a drawback is that having this update creates challenges because it is held in a centralized location (which defeats the benefits of the distributed nature of DNS that replaced the hosts.txt situation in the 80s).

In the years since, the PSL has really become the only widely used, community maintained, frequently updated list of strings that might be expected to behave as-if they are a TLD, even if they are not at the top level.

This evolved further. While CO.UK is operated by Nominet who oversee .UK, and COM.AU is overseen by AUDA who oversee .AU, there are some TLD-like systems that leap that direct and authoritative connection. Over the course of time, systems like Centralnic started offering subdomain registrations, Github offered subdomain hosting, and Dyn (now Oracle) started to offer DNS host naming, etc. US.COM, operated by Centralnic, is technically under .COM, but not operated directly by the .COM registry. So there is a change in the administrative horizon that begins at the root, and we opted to split the PSL into two sections, putting the IANA top down / ICANN delegated zones into the 'ICANN' section, and located this stuff (mostly, it still needs constant audit) into a section that designated that horizon, the 'PRIVATE' section.

These lists seem at first like they are something that should be simple to compile, but the other maintainers and I would argue this is not the case. As a result, developers and integrators and security experts, software libraries, certificate authorities, and browsers and search engines (and I could riff for a while on this) have leveraged the PSL as a core list (sometimes authority) on handling this stuff.

We as maintainers know what we do with it, but know we are not all knowing and have a spectrum of use-cases that get impacted by changes we might make to the file due to the processing that is done on it after it is downloaded.

Maintaining entries is non-disruptive to the list and the derivative users. Renaming sections might be. Defining these terms in a glossary may be helpful for future integrations, but not as much for where there is 'set and forget' code or processes.

I hope this is helpful - and not too "ivory tower" - as background.

@mikewest
Copy link
Member Author

Based on the discussion I think we should go ahead, to ensure a consistent definition of all the standards that end up using these.

(FYI: I'm OOO until the 4th; I can almost certainly add the note that @annevk requested above sometime today, but if this needs more discussion than that, then I hope y'all feel free to move forward without me. :) )

@annevk
Copy link
Member

annevk commented May 28, 2018

@mikewest I think the main thing I need input on is the appspot.com / github.io question. To quote @sleevi above:

From a spec question, what do you want this definition to entail for the appspot.com case?

That is,

  • foo.bar.appspot.com is "obviously" going to return bar.appspot.com as the registerable domain (with appspot.com as the public suffix), and the same would be expected if just bar.appspot.com.
  • What do you expect this machinery to return for appspot.com? appspot.com is on the PSL, so that is a public suffix, but appspot.com is also a registerable domain under the com PSL.

I seem to recall that different platform features interpret that differently (navigation vs cookies, for example)

@sleevi
Copy link

sleevi commented May 28, 2018

To expand on that:

Doing an audit of Chromium for EXCLUDE_PRIVATE_REGISTRIES shows that the features are predominantly UI. It looks like we do expose some of the configurability to Blink, although it doesn't use it

The wildcard issue is enough to call out, since you'll get different results if you getPublicSuffix(getPublicSuffix('foo.platform.sh')) depending on browser.

@mikewest
Copy link
Member Author

Thanks, @sleevi and @annevk.

From a spec question, what do you want this definition to entail for the appspot.com case?

I think we have two options:

  1. We return appspot.com for both the registrable domain and public suffix.
  2. We return null for the registrable domain, and appspot.com for the public suffix.

I prefer 2 (and I'm pretty sure that's what the current text suggests, though the PSL's algorithm isn't exactly clear in step 7 of https://publicsuffix.org/list/ what we ought to do if there's no additional label). Running with 1 would require callers to understand that appspot.com needs to be treated differently, and to ensure that they do some sort of "If X's registrable domain is identical to X's public suffix" check, which is trivial to forget. It seems safer to fail closed by returning null as the registrable domain for both TLDs like com and like appspot.com.

The bugs you linked, Ryan, are curious. I would have expected neither appspot.com nor platform.sh to be able to have cookies set. It seems like you need to pick one or the other: you're either a TLD, or you're not. Being both is weird (but also outside the scope of this bug to define, since we're basically punting to the PSL :) ).

@annevk
Copy link
Member

annevk commented May 29, 2018

@mikewest so that means you can never be same site with appspot.com, even if you are appspot.com. We better include some examples. Or if that's an open issue with the PSL, record that in their GitHub issues and link it.

@mikewest
Copy link
Member Author

@mikewest so that means you can never be same site with appspot.com, even if you are appspot.com

This seems fine to me, at least partially because I'm not comfortable with actually serving anything from a public suffix, and I'm surprised every time I rediscover that we allow it. :)

You also can't set appspot.com as document.domain, nor can you set cookies at appspot.com. As Ryan notes, I don't actually expect this to have much web-visible impact due to those kinds of constraints, aside from the wildcard case which I think I misunderstood above. (The difference there, though, is not the registrable domain behavior, but the behavior of the wildcard in defining whether a given domain is a public suffix).

@mikewest
Copy link
Member Author

mikewest commented Jun 4, 2018

cc03e7b addresses the concrete suggestions I picked out of the conversation above. WDYT?

@annevk
Copy link
Member

annevk commented Jun 4, 2018

Thanks, instead of defining registrable domain in terms of a registry, we should use a variant of @sleevi's suggestion I think:

The domain formed by the most specific public suffix for host, along with the domain label immediately preceding it.

And I think we should add another paragraph to the warning, saying that when specifications nevertheless do rely on any of them for comparison purposes, they should carefully consider cross-scheme scenarios. I'm not entirely sure how to phrase that given that hosts don't have schemes, but you'll probably think of something?

@mikewest
Copy link
Member Author

mikewest commented Jun 4, 2018

Thanks, instead of defining registrable domain in terms of a registry, we should use a variant of @sleevi's suggestion I think:

Agreed.

you'll probably think of something?

I thought of something in bd35e7e. Is it reasonable?

@annevk
Copy link
Member

annevk commented Jun 4, 2018

Yeah that looks great, I have a couple minor editorial nits, but from my perspective this is good to go otherwise. I'll give the others copied here until Wednesday before I merge it (at which point it'd be good to raise problems as new issues instead).

<tr>
<td><code>example.إختبار</code>
<td><code>xn-kgbechtv</code>
<td><code>example.xn-kgbechtv</code>
Copy link

@sleevi sleevi Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one of the things is the PSL doesn't specify whether or not it returns U-Label or A-Label (that's left to the implementation). I'm curious the documentation here for the A-Label - is this an expectation of the contract?

That is, are you trying to show that either U-Label or A-Label can be returned regardless of U-Label or A-Label input, or are you trying to state that A-Labels should be the consistent return?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't rely on this anywhere (assuming it's consistent to be one or the other, is that at least required?), but A-label seems preferable as that'd be consistent with how the platform exposes URLs and origins overall.

I suspect this will only matter if we add an API, but it really depends on whether PSL dependencies keep getting added or not.

@annevk
Copy link
Member

annevk commented Jun 7, 2018

Thanks all! I filed #396 as follow-up for the remaining issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants