Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an explainer for traffic advice. #10

Merged
merged 4 commits into from
Apr 13, 2021

Conversation

jeremyroman
Copy link
Contributor

@buettner @domenic PTAL? If this clears this round of bikeshedding I can expand this into a more detailed spec in Bikeshed or something (though honestly it won't be a very long spec anyhow).

```

Each agent has a series of identifiers it recognizes, in order of specificity:
* its own agent name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if I say {"user_agent": "Chrome", "disallow": true}? Will Chrome block user-initiated traffic to the origin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, no, Chrome would not respect (or even request) this advice with regard to user-initiated traffic. Chrome could do so with regard to traffic which is not immediately user-visible, like prefetch traffic. This declaration, supposing Chrome implemented this (separately from the prefetch proxy) would advise Chrome to shed any traffic it was willing and able to. I think Chrome should not consider a user navigation sheddable, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest draft is pretty clear now, but the relationship between the "client" and the "agent" could use a bit more. I'd suggest:

  • Adding ", e.g. ExamplePrefetchProxy" here (perhaps linked to the below section)?
  • Adding an example of "a client of the proxy service": something like "(e.g. a web browser)" would be accurate, I think?

]
```

Each agent has a series of identifiers it recognizes, in order of specificity:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "agent" needs much more definition. Is Chrome a user agent? Is curl? Is googlebot?

It sounds like the intent is to cover "agents" which don't send "direct user traffic". Those don't sound like user_agents to me...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the term in the HTTP sense; the application which is constructing HTTP requests and sending them to a server, i.e. the application which would ordinarily identify itself with the User-Agent request header.

I don't love the term in this context, but I think it's consistent with RFC 7231 and the robots.txt format, both of which would consider curl and googlebot to be user agents.


Currently the only advice is the key `"disallow"`, which specifies a boolean which, if present and `true`, advises the agent not to establish connections to the origin. In the future other advice may be added.

If the response has a `404 Not Found` status, on the other hand, the agent should apply its default behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

404 not found or other error condition (e.g. invalid JSON, network error, 5xx, 409, ...). You cover some of this above so just making this less specific will do the trick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to focus on the obvious error case at the explainer level of depth.

There are a few details here that I think might merit some detailed discussion but which don't really affect the high-level proposal. For instance:

  • should 3XX redirects be followed? should they be cached?
  • perhaps a network error or 503 Service Unavailable (or maybe all 5XX codes) indicate that the server is busy and a prefetch proxy should prefer not to send it additional traffic at that time (either at the agent's discretion, or after the Retry-After if present), rather than assuming it's okay -- and is this a behavior that generalizes to all agents

Agreed that this will require more detail to specify but is this the right place for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my general philosophy is the explainer doesn't need to be exhaustive, but should also avoid being inaccurate. So this sentence in particular is troubling because it's over-specific, and implies that e.g. 409s will not get the same treatment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added "or a similar status" to hopefully imply that I'm not saying 404 need be the only status with this treatment. I have difficult seeing why a 409 response would make sense here (can GET requests generally generate conflicts?), and some like 429 Too Many Requests might conceptually make sense to parse as "disallow" (if there are too many requests for what is basically a static resource, should we really be sending more non-essential traffic).

It definitely is the case that handling of the various 3XX, 4XX and 5XX statuses will need slightly more text than I want here. I only meant to say what happens in the two cases likely to happen in practice.


## Proposal

Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice.json`. If it returns a `200 OK` response with the `application/json` MIME type, the response body should contain valid JSON like the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the request for this resource go through an applicable service worker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably arises from the clarity of how I'm using the term "agent" here, which I'm meaning to be closer to "HTTP user agent" than "web browser". Many agents, including web crawlers and the proposed proxy servers, don't implement HTML (thus service workers) at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, right now it's not clear whether the web browser would consult this file before using a proxy, versus the proxy doing this without the web browser in the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an example at the bottom. For a prefetch proxy, it makes the most sense for the proxy to do this (as it can easily and transparently cache the conclusion).


## Why not robots.txt?

`robots.txt` is designed for crawlers, especially search engine crawlers, and so site owners have likely already established robots rules because they wish to limit traffic from crawlers -- even though they have no such concern about prefetch proxy traffic. The `robots.txt` format is also designed to limit traffic by path, which isn't appropriate for agents which do not know the path of the requests they are responsible for throttling (as with a CONNECT proxy carrying TLS traffic).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of ties into my above feedback about agents. I think you want to divide the universe up into three categories:

  • Whatever agents this applies to
  • Whatever agents robots.txt applies to
  • Agents which won't care about either (I think normal-navigations from browsers fall into this category)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a short explanation of who should consider being an "agent which respects traffic advice". (I do consider it a politeness thing more than anything else; HTTP clients can and will send traffic anyway if they think it's more important than the server's advice.)


## Proposal

Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice.json`. If it returns a `200 OK` response with the `application/json` MIME type, the response body should contain valid JSON like the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any OK response should be allowed I think, not just 200

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, possibly minus 204 and 205, though honestly any except 200 is pretty silly here. (201 Created? What did I create?) I'm trying to explain the "good" behavior here; it wasn't clear to me that going through the exact handling of this edge case made the idea clearer.


## Proposal

Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice.json`. If it returns a `200 OK` response with the `application/json` MIME type, the response body should contain valid JSON like the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a custom MIME type? Seems like most modern JSON formats do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know of a reason to here (if there is one, happy to consider). The usual justification (e.g. for application/importmap+json) seems to be to prevent some JSON not intended for use as an import map from being used as one to modify browser behavior.

That doesn't seem to apply here, though, since no resource besides /.well-known/traffic-advice.json could be parsed as such.


## Proposal

Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice.json`. If it returns a `200 OK` response with the `application/json` MIME type, the response body should contain valid JSON like the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including file extensions is generally not good for well-known URLs, from what I understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this mentioned in RFC 8615 or the registry. I do note that most do not, however many of these correspond to types without a well-recognized file extension or are not intended to serve a response body.

My primary motivation for including a file extension and MIME type which are well established is that it enables a publisher to create this file in common HTTP servers like Apache and IIS by simply adding it to the web root, without needing to otherwise modify server configuration to set the correct MIME type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the first web spec that uses file extensions from what I understand. I'd really strongly recommend we not do this. That means you do need server configuration access, but you also should need server configuration access to modify /.well-known anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I suspect modifying response headers is more work than adding a file in the web root, but probably at the same privilege level and hopefully not too difficult.


## Proposal

Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice.json`. If it returns a `200 OK` response with the `application/json` MIME type, the response body should contain valid JSON like the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere you should mention that we'll always decode this as UTF-8 (ignoring charset) like other modern web platform features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to do this; do you have a reference handy where this has been done elsewhere? We ignore charset even if specified as non-UTF-8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import maps, WebSockets, server-sent events, module scripts, worker scripts, and Fetch's json() come to mind.

Copy link
Contributor

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a bit more clarity on the client/agent distinction.

```

Each agent has a series of identifiers it recognizes, in order of specificity:
* its own agent name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest draft is pretty clear now, but the relationship between the "client" and the "agent" could use a bit more. I'd suggest:

  • Adding ", e.g. ExamplePrefetchProxy" here (perhaps linked to the below section)?
  • Adding an example of "a client of the proxy service": something like "(e.g. a web browser)" would be accurate, I think?

@jeremyroman
Copy link
Contributor Author

Resolved latest comments.

@mnot
Copy link

mnot commented Mar 28, 2022

Please register the .well-known location if you're going to use this -- see https://github.com/protocol-registries/well-known-uris

@jeremyroman
Copy link
Contributor Author

Provisional registration request submitted: protocol-registries/well-known-uris#22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants