-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an explainer for traffic advice. #10
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# Traffic Advice | ||
|
||
Publishers may wish not to accept traffic from [private prefetch proxies](README.md) and other sources other than direct user traffic, for instance to reduce server load due to speculative prefetch activity. | ||
|
||
We propose a well-known "traffic advice" resource, analogous to `/robots.txt` (for web crawlers), which allows an HTTP server to declare that implementing agents should stop sending traffic to it for some time. | ||
|
||
## Proposal | ||
|
||
HTTP request activity can broadly be divided into: | ||
* activity on behalf of a user interaction (e.g., a web browser a web page requested by the user), or which for another reason cannot easily be discarded | ||
* activity for which there is an existing specialized mechanism for throttling traffic (e.g. web crawlers respecting `robots.txt`) | ||
* activity which can easily be discarded (e.g., because it corresponds to a prefetch which improves loading performance but not correctness) at the server's request (e.g., because it is under load or the operator otherwise does not wish to serve non-essential traffic) | ||
|
||
Applications in the third category should consider acting as *agents which respect traffic advice*, so as to respect the server operator's wishes with a minimum resource impact. | ||
|
||
Agents which respect traffic advice should fetch the well-known path `/.well-known/traffic-advice`. If it returns a response with an [ok status](https://fetch.spec.whatwg.org/#ok-status) and a `application/trafficadvice+json` MIME type, the response body should contain valid UTF-8 encoded JSON like the following: | ||
|
||
```json | ||
[ | ||
{"user_agent": "prefetch-proxy", "disallow": true} | ||
] | ||
``` | ||
|
||
Each agent has a series of identifiers it recognizes, in order of specificity: | ||
* its own agent name (e.g. `"ExamplePrivatePrefetchProxy"`) | ||
* decreasingly specific generic categories that describe it, like `"prefetch-proxy"` | ||
* `"*"` (which applies to every implementing agent) | ||
|
||
It finds the most specific element of the response, and applies the corresponding advice (currently only a boolean which advises disallowing all traffic) to its behavior. The agent should respect the cache-related response headers to minimize the frequency of such requests and to revalidate the resource when it is stale. | ||
|
||
Currently the only advice is the key `"disallow"`, which specifies a boolean which, if present and `true`, advises the agent not to establish connections to the origin. In the future other advice may be added. | ||
|
||
If the response has a `404 Not Found` status (or a similar status), on the other hand, the agent should apply its default behavior. | ||
|
||
## Why not robots.txt? | ||
|
||
`robots.txt` is designed for crawlers, especially search engine crawlers, and so site owners have likely already established robots rules because they wish to limit traffic from crawlers -- even though they have no such concern about prefetch proxy traffic. The `robots.txt` format is also designed to limit traffic by path, which isn't appropriate for agents which do not know the path of the requests they are responsible for throttling (as with a CONNECT proxy carrying TLS traffic). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This kind of ties into my above feedback about agents. I think you want to divide the universe up into three categories:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a short explanation of who should consider being an "agent which respects traffic advice". (I do consider it a politeness thing more than anything else; HTTP clients can and will send traffic anyway if they think it's more important than the server's advice.) |
||
|
||
A more similar textual format would be possible, but the format for parsing `robots.txt` is not consistently specified and implemented. By contrast, JSON implementations are widely available on a wide variety of platforms used by site owners and authors. | ||
|
||
## Application to private prefetch proxies | ||
|
||
For example, suppose a private prefetch proxy, `ExamplePrivatePrefetchProxy`, would like to respect traffic advice in order to allow site owners to limit inbound traffic from the proxy. | ||
|
||
When a client of the proxy service (e.g., a web browser) requests a connection to `https://www.example.com`, the proxy server issues an HTTP request for `https://www.example.com/.well-known/traffic-advice`. It receives the sample response body from above. It recognizes `"prefetch-proxy"` as the most specific advice to apply to itself. | ||
|
||
It caches this result (traffic is presently disallowed) at the proxy server (or even across multiple proxy server instances run by the same operator), and refuses client connections to `https://www.example.com` until an updated `/.well-known/traffic-advice` resource no longer disallows traffic. Even if a large number of proxy clients request connections to `https://www.example.com`, the site operator and its CDN do not receive traffic from the proxy except for infrequent requests to revalidate the traffic advice (which may be, for example, once per hour). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "agent" needs much more definition. Is Chrome a user agent? Is curl? Is googlebot?
It sounds like the intent is to cover "agents" which don't send "direct user traffic". Those don't sound like
user_agent
s to me...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the term in the HTTP sense; the application which is constructing HTTP requests and sending them to a server, i.e. the application which would ordinarily identify itself with the
User-Agent
request header.I don't love the term in this context, but I think it's consistent with RFC 7231 and the robots.txt format, both of which would consider curl and googlebot to be user agents.